Thread: [sqlsmith] Parallel worker executor crash on master
Hi, sqlsmith just crashed a parallel worker while testing master at 699bf7d05c. I can reproduce it with the following recipe on a fresh regression database. Backtrace and query plan below as well. regards, Andreas --8<---------------cut here---------------start------------->8--- set min_parallel_table_scan_size = '8kB'; set parallel_setup_cost = 0; set parallel_tuple_cost = 0; select * from public.prt1_m_p3 as ref_3 inner join pg_catalog.pg_class as sample_1 tablesample bernoulli (2.1) on (ref_3.c = sample_1.relpages ) inner join public.rtest_view4 as ref_4 on (ref_4.b @@ (select keyword from public.test_tsquery limit 1 offset 2) ), lateral (select ref_3.a as c0 from public.lseg_tbl as ref_5 where (select f1 from public.timetz_tbl limit 1 offset 5) > (select pg_catalog.min(timetzcol) from public.brintest) limit 18) as subq_2 where true limit 102; --8<---------------cut here---------------end--------------->8--- QUERY PLAN ----------------------------------------------------------------------------------------------------------- Limit (cost=107.69..1421.69 rows=102 width=315) InitPlan 1 (returns $1) -> Limit (cost=0.35..0.52 rows=1 width=32) -> Gather (cost=0.00..1.04 rows=6 width=32) Workers Planned: 1 -> Parallel Seq Scan on test_tsquery (cost=0.00..1.04 rows=4 width=32) -> Nested Loop (cost=107.17..5775.39 rows=440 width=315) Join Filter: (ref_3.c = sample_1.relpages) -> Nested Loop (cost=107.17..5416.40 rows=250 width=16) -> Gather (cost=0.00..1.29 rows=50 width=12) Workers Planned: 1 -> Parallel Seq Scan on prt1_m_p3 ref_3 (cost=0.00..1.29 rows=29 width=12) -> Limit (cost=107.17..108.20 rows=5 width=4) InitPlan 2 (returns $4) -> Limit (cost=0.45..0.54 rows=1 width=12) -> Gather (cost=0.00..1.07 rows=12 width=12) Workers Planned: 1 -> Parallel Seq Scan on timetz_tbl (cost=0.00..1.07 rows=7 width=12) InitPlan 3 (returns $5) -> Aggregate (cost=106.62..106.64 rows=1 width=12) -> Seq Scan on brintest (cost=0.00..106.30 rows=130 width=12) -> Gather (cost=0.00..1.03 rows=5 width=4) Workers Planned: 1 Params Evaluated: $4, $5 -> Result (cost=0.00..1.03 rows=3 width=0) One-Time Filter: ($4 > $5) -> Parallel Seq Scan on lseg_tbl ref_5 (cost=0.00..1.03 rows=3 width=0) -> Materialize (cost=0.00..194.10 rows=44 width=299) -> Nested Loop (cost=0.00..193.88 rows=44 width=299) -> Gather (cost=0.00..140.00 rows=1 width=40) Workers Planned: 2 Params Evaluated: $1 -> Parallel Seq Scan on rtest_view4 ref_4 (cost=0.00..140.00 rows=1 width=40) Filter: (b @@ $1) -> Sample Scan on pg_class sample_1 (cost=0.00..53.44 rows=44 width=259) Sampling: bernoulli ('2.1'::real) Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000055c9dba2d7f8 in timetz_gt (fcinfo=0x55c9dd7295a0) at date.c:2183 2183 TimeTzADT *time2 = PG_GETARG_TIMETZADT_P(1); #1 0x000055c9db8ae8b2 in ExecInterpExpr (state=0x55c9dd729f00, econtext=0x55c9dd7299b8, isnull=<optimized out>) at execExprInterp.c:692 #2 0x000055c9db8d6753 in ExecEvalExprSwitchContext (isNull=0x7ffd2f99e55f "", econtext=0x55c9dd7299b8, state=<optimizedout>) at ../../../src/include/executor/executor.h:300 #3 ExecQual (econtext=0x55c9dd7299b8, state=<optimized out>) at ../../../src/include/executor/executor.h:369 #4 ExecResult (pstate=0x55c9dd729c38) at nodeResult.c:84 #5 0x000055c9db8b21ea in ExecProcNode (node=0x55c9dd729c38) at ../../../src/include/executor/executor.h:242 #6 ExecutePlan (execute_once=<optimized out>, dest=0x55c9dd6fff78, direction=<optimized out>, numberTuples=18, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c9dd729c38, estate=0x55c9dd729118)at execMain.c:1718 #7 standard_ExecutorRun (queryDesc=0x55c9dd742b48, direction=<optimized out>, count=18, execute_once=<optimized out>) atexecMain.c:361 #8 0x000055c9db8b6e99 in ParallelQueryMain (seg=0x55c9dd6a8ea8, toc=0x7f9109496000) at execParallel.c:1294 #9 0x000055c9db7911d9 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1150 #10 0x000055c9db975a63 in StartBackgroundWorker () at bgworker.c:841 #11 0x000055c9db982015 in do_start_bgworker (rw=0x55c9dd6a0380) at postmaster.c:5741 #12 maybe_start_bgworkers () at postmaster.c:5954 #13 0x000055c9db982ae8 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5134 #14 <signal handler called> #15 0x00007f91088503d3 in select () from /lib/x86_64-linux-gnu/libc.so.6 #16 0x000055c9db7024ed in ServerLoop () at postmaster.c:1721 #17 0x000055c9db984194 in PostmasterMain (argc=3, argv=0x55c9dd67a5a0) at postmaster.c:1365 #18 0x000055c9db70448d in main (argc=3, argv=0x55c9dd67a5a0) at main.c:228
On Sat, Dec 16, 2017 at 12:57 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > Hi, > > sqlsmith just crashed a parallel worker while testing master at > 699bf7d05c. I can reproduce it with the following recipe on a fresh > regression database. Backtrace and query plan below as well. > This seems to be another symptom of the problem related to es_query_dsa for which Thomas has sent a patch on a different thread [1]. After applying that patch, I am not able to see the problem. I think due to the wrong usage of dsa across nodes, it can lead to sending some wrong values for params to workers. [1] - https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila writes: > This seems to be another symptom of the problem related to > es_query_dsa for which Thomas has sent a patch on a different thread > [1]. After applying that patch, I am not able to see the problem. I > think due to the wrong usage of dsa across nodes, it can lead to > sending some wrong values for params to workers. > > [1] - https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com while my posted recipe is indeed inconspicuous with the patch applied, It seems to have made matters worse from the sqlsmith perspective: Instead of one core dump per hour I get one per minute. Sample backtrace below. I could not find a recipe yet to reproduce these (beyond starting sqlsmith). regards, Andreas Core was generated by `postgres: smith regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 283 estate->es_query_dsa = gatherstate->pei->area; #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 #2 0x0000555a5d51a1ea in ExecProcNode (node=0x555a5fff1350) at ../../../src/include/executor/executor.h:242 #3 ExecutePlan (execute_once=<optimized out>, dest=0x555a604f78a0, direction=<optimized out>, numberTuples=1, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x555a5fff1350, estate=0x555a5fff1138)at execMain.c:1718 #4 standard_ExecutorRun (queryDesc=0x555a604f78f8, direction=<optimized out>, count=1, execute_once=<optimized out>) atexecMain.c:361 #5 0x0000555a5d5267cc in postquel_getnext (es=0x555a604f7418, es=0x555a604f7418, fcache=0x555a5fd1a658, fcache=0x555a5fd1a658)at functions.c:865 #6 fmgr_sql (fcinfo=0x555a60376470) at functions.c:1161 #7 0x0000555a5d5224f7 in ExecMakeFunctionResultSet (fcache=0x555a60376400, econtext=econtext@entry=0x555a60374090, argContext=0x555a5fd449d0,isNull=0x555a6037a60e "", isDone=isDone@entry=0x555a6037a698) at execSRF.c:604 #8 0x0000555a5d53dcbb in ExecProjectSRF (node=node@entry=0x555a60373f78, continuing=continuing@entry=0 '\000') at nodeProjectSet.c:175 #9 0x0000555a5d53ddf5 in ExecProjectSet (pstate=0x555a60373f78) at nodeProjectSet.c:105 #10 0x0000555a5d53d556 in ExecProcNode (node=0x555a60373f78) at ../../../src/include/executor/executor.h:242 #11 ExecNestLoop (pstate=0x555a60373da0) at nodeNestloop.c:109 #12 0x0000555a5d53d556 in ExecProcNode (node=0x555a60373da0) at ../../../src/include/executor/executor.h:242 #13 ExecNestLoop (pstate=0x555a60373248) at nodeNestloop.c:109 #14 0x0000555a5d536699 in ExecProcNode (node=0x555a60373248) at ../../../src/include/executor/executor.h:242 #15 ExecLimit (pstate=0x555a60372650) at nodeLimit.c:95 #16 0x0000555a5d5433eb in ExecProcNode (node=0x555a60372650) at ../../../src/include/executor/executor.h:242 #17 ExecSetParamPlan (node=<optimized out>, econtext=0x555a6045e948) at nodeSubplan.c:968 #18 0x0000555a5d513da8 in ExecEvalParamExec (state=<optimized out>, op=0x555a604619f0, econtext=<optimized out>) at execExprInterp.c:1921 #19 0x0000555a5d516b7e in ExecInterpExpr (state=0x555a604616e0, econtext=0x555a6045e948, isnull=<optimized out>) at execExprInterp.c:1038 #20 0x0000555a5d547cad in ExecEvalExprSwitchContext (isNull=0x7ffecac290ce "", econtext=0x555a6045e948, state=0x555a604616e0)at ../../../src/include/executor/executor.h:300 #21 ExecProject (projInfo=0x555a604616d8) at ../../../src/include/executor/executor.h:334 #22 ExecWindowAgg (pstate=0x555a6045e670) at nodeWindowAgg.c:1761 #23 0x0000555a5d536699 in ExecProcNode (node=0x555a6045e670) at ../../../src/include/executor/executor.h:242 #24 ExecLimit (pstate=0x555a6045df28) at nodeLimit.c:95 #25 0x0000555a5d51a1ea in ExecProcNode (node=0x555a6045df28) at ../../../src/include/executor/executor.h:242 #26 ExecutePlan (execute_once=<optimized out>, dest=0x555a604322a0, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x555a6045df28, estate=0x555a5ffef128)at execMain.c:1718 #27 standard_ExecutorRun (queryDesc=0x555a5ff8e418, direction=<optimized out>, count=0, execute_once=<optimized out>) atexecMain.c:361 #28 0x0000555a5d668ecc in PortalRunSelect (portal=portal@entry=0x555a5fbf5f00, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,dest=dest@entry=0x555a604322a0) at pquery.c:932 #29 0x0000555a5d66a4c0 in PortalRun (portal=portal@entry=0x555a5fbf5f00, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x555a604322a0, altdest=altdest@entry=0x555a604322a0,completionTag=0x7ffecac29380 "") at pquery.c:773 #30 0x0000555a5d66608b in exec_simple_query (query_string=0x555a5fb78178 "[...]") at postgres.c:1120 #31 0x0000555a5d667de1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x555a5fbb5710, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4139 #32 0x0000555a5d36af16 in BackendRun (port=0x555a5fb9d280) at postmaster.c:4412 #33 BackendStartup (port=0x555a5fb9d280) at postmaster.c:4084 #34 ServerLoop () at postmaster.c:1757 #35 0x0000555a5d5ec214 in PostmasterMain (argc=3, argv=0x555a5fb725a0) at postmaster.c:1365 #36 0x0000555a5d36c48d in main (argc=3, argv=0x555a5fb725a0) at main.c:228
On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > Amit Kapila writes: > >> This seems to be another symptom of the problem related to >> es_query_dsa for which Thomas has sent a patch on a different thread >> [1]. After applying that patch, I am not able to see the problem. I >> think due to the wrong usage of dsa across nodes, it can lead to >> sending some wrong values for params to workers. >> >> [1] - https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com > > while my posted recipe is indeed inconspicuous with the patch applied, > It seems to have made matters worse from the sqlsmith perspective: > Instead of one core dump per hour I get one per minute. Sample > backtrace below. I could not find a recipe yet to reproduce these > (beyond starting sqlsmith). > > regards, > Andreas > > Core was generated by `postgres: smith regression [local] SELECT '. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 > 283 estate->es_query_dsa = gatherstate->pei->area; > #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 Hmm, thanks. That's not good. Do we know if gatherstate->pei is NULL, or if it's somehow pointing to garbage? Not sure how either of those things could happen, since we only set it to NULL in ExecShutdownGather() after which point we shouldn't call ExecGather() again, and any MemoryContext problems with pei should have caused problems already without this patch (for example in ExecParallelCleanup). Clearly I'm missing something. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro writes: > On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich > <seltenreich@gmx.de> wrote: >> Core was generated by `postgres: smith regression [local] SELECT '. >> Program terminated with signal SIGSEGV, Segmentation fault. >> #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 >> 283 estate->es_query_dsa = gatherstate->pei->area; >> #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 > > Hmm, thanks. That's not good. Do we know if gatherstate->pei is > NULL, or if it's somehow pointing to garbage? It was NULL on all the coredumps I looked into. Below[1] is a full gatherstate. > Not sure how either of those things could happen, since we only set it > to NULL in ExecShutdownGather() after which point we shouldn't call > ExecGather() again, and any MemoryContext problems with pei should > have caused problems already without this patch (for example in > ExecParallelCleanup). Clearly I'm missing something. FWIW, all backtraces collected so far are identical for the first nine frames. After ExecProjectSet, they are pretty random executor innards. ,---- | #1 ExecGather at nodeGather.c:216 | #2 0x0000555bc9fb41ea in ExecProcNode at ../../../src/include/executor/executor.h:242 | #3 ExecutePlan at execMain.c:1718 | #4 standard_ExecutorRun at execMain.c:361 | #5 0x0000555bc9fc07cc in postquel_getnext at functions.c:865 | #6 fmgr_sql (fcinfo=0x555bcba07748) at functions.c:1161 | #7 0x0000555bc9fbc4f7 in ExecMakeFunctionResultSet at execSRF.c:604 | #8 0x0000555bc9fd7cbb in ExecProjectSRF at nodeProjectSet.c:175 | #9 0x0000560828dc8df5 in ExecProjectSet at nodeProjectSet.c:105 `---- regards, Andreas Footnotes: [1] (gdb) p *gatherstate $3 = { ps = { type = T_GatherState, plan = 0x555bcb9faf30, state = 0x555bcba3d098, ExecProcNode = 0x555bc9fc9e30 <ExecGather>, ExecProcNodeReal = 0x555bc9fc9e30 <ExecGather>, instrument = 0x0, worker_instrument = 0x0, qual = 0x0, lefttree = 0x555bcba3d678, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x555bcba3d5b8, ps_ExprContext = 0x555bcba3d3c8, ps_ProjInfo = 0x0 }, initialized = 1 '\001', need_to_scan_locally = 1 '\001', tuples_needed = -1, funnel_slot = 0x555bcba3d4c0, pei = 0x0, nworkers_launched = 0, nreaders = 0, nextreader = 0, reader = 0x0 }
On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > Thomas Munro writes: > >> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich >> <seltenreich@gmx.de> wrote: >>> Core was generated by `postgres: smith regression [local] SELECT '. >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 >>> 283 estate->es_query_dsa = gatherstate->pei->area; >>> #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 >> >> Hmm, thanks. That's not good. Do we know if gatherstate->pei is >> NULL, or if it's somehow pointing to garbage? > > It was NULL on all the coredumps I looked into. Below[1] is a full > gatherstate. > >> Not sure how either of those things could happen, since we only set it >> to NULL in ExecShutdownGather() after which point we shouldn't call >> ExecGather() again, and any MemoryContext problems with pei should >> have caused problems already without this patch (for example in >> ExecParallelCleanup). Clearly I'm missing something. > > FWIW, all backtraces collected so far are identical for the first nine > frames. After ExecProjectSet, they are pretty random executor innards. Oh, right. We only create pei if (gather->num_workers > 0 && estate->es_use_parallel_mode), so I need to teach that patch that pei may be NULL. I'll go and post a new version like that over on the other thread. -- Thomas Munro http://www.enterprisedb.com