Thread: [sqlsmith] Parallel worker executor crash on master

[sqlsmith] Parallel worker executor crash on master

From

Andreas Seltenreich

Date:

15 December 2017, 22:27:14

Hi,

sqlsmith just crashed a parallel worker while testing master at
699bf7d05c.  I can reproduce it with the following recipe on a fresh
regression database.  Backtrace and query plan below as well.

regards,
Andreas

--8<---------------cut here---------------start------------->8---
set min_parallel_table_scan_size = '8kB';
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;

select *
      from
    public.prt1_m_p3 as ref_3
        inner join pg_catalog.pg_class as sample_1 tablesample bernoulli (2.1)
        on (ref_3.c = sample_1.relpages )
      inner join public.rtest_view4 as ref_4
      on (ref_4.b @@ (select keyword from public.test_tsquery limit 1 offset 2)
        ),
    lateral (select
          ref_3.a as c0
        from
          public.lseg_tbl as ref_5
        where (select f1 from public.timetz_tbl limit 1 offset 5)
         > (select pg_catalog.min(timetzcol) from public.brintest)
        limit 18) as subq_2
      where true
      limit 102;
--8<---------------cut here---------------end--------------->8---


                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Limit  (cost=107.69..1421.69 rows=102 width=315)
   InitPlan 1 (returns $1)
     ->  Limit  (cost=0.35..0.52 rows=1 width=32)
           ->  Gather  (cost=0.00..1.04 rows=6 width=32)
                 Workers Planned: 1
                 ->  Parallel Seq Scan on test_tsquery  (cost=0.00..1.04 rows=4 width=32)
   ->  Nested Loop  (cost=107.17..5775.39 rows=440 width=315)
         Join Filter: (ref_3.c = sample_1.relpages)
         ->  Nested Loop  (cost=107.17..5416.40 rows=250 width=16)
               ->  Gather  (cost=0.00..1.29 rows=50 width=12)
                     Workers Planned: 1
                     ->  Parallel Seq Scan on prt1_m_p3 ref_3  (cost=0.00..1.29 rows=29 width=12)
               ->  Limit  (cost=107.17..108.20 rows=5 width=4)
                     InitPlan 2 (returns $4)
                       ->  Limit  (cost=0.45..0.54 rows=1 width=12)
                             ->  Gather  (cost=0.00..1.07 rows=12 width=12)
                                   Workers Planned: 1
                                   ->  Parallel Seq Scan on timetz_tbl  (cost=0.00..1.07 rows=7 width=12)
                     InitPlan 3 (returns $5)
                       ->  Aggregate  (cost=106.62..106.64 rows=1 width=12)
                             ->  Seq Scan on brintest  (cost=0.00..106.30 rows=130 width=12)
                     ->  Gather  (cost=0.00..1.03 rows=5 width=4)
                           Workers Planned: 1
                           Params Evaluated: $4, $5
                           ->  Result  (cost=0.00..1.03 rows=3 width=0)
                                 One-Time Filter: ($4 > $5)
                                 ->  Parallel Seq Scan on lseg_tbl ref_5  (cost=0.00..1.03 rows=3 width=0)
         ->  Materialize  (cost=0.00..194.10 rows=44 width=299)
               ->  Nested Loop  (cost=0.00..193.88 rows=44 width=299)
                     ->  Gather  (cost=0.00..140.00 rows=1 width=40)
                           Workers Planned: 2
                           Params Evaluated: $1
                           ->  Parallel Seq Scan on rtest_view4 ref_4  (cost=0.00..140.00 rows=1 width=40)
                                 Filter: (b @@ $1)
                     ->  Sample Scan on pg_class sample_1  (cost=0.00..53.44 rows=44 width=259)
                           Sampling: bernoulli ('2.1'::real)


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055c9dba2d7f8 in timetz_gt (fcinfo=0x55c9dd7295a0) at date.c:2183
2183        TimeTzADT  *time2 = PG_GETARG_TIMETZADT_P(1);
#1  0x000055c9db8ae8b2 in ExecInterpExpr (state=0x55c9dd729f00, econtext=0x55c9dd7299b8, isnull=<optimized out>) at
execExprInterp.c:692
#2  0x000055c9db8d6753 in ExecEvalExprSwitchContext (isNull=0x7ffd2f99e55f "", econtext=0x55c9dd7299b8,
state=<optimizedout>) at ../../../src/include/executor/executor.h:300
 
#3  ExecQual (econtext=0x55c9dd7299b8, state=<optimized out>) at ../../../src/include/executor/executor.h:369
#4  ExecResult (pstate=0x55c9dd729c38) at nodeResult.c:84
#5  0x000055c9db8b21ea in ExecProcNode (node=0x55c9dd729c38) at ../../../src/include/executor/executor.h:242
#6  ExecutePlan (execute_once=<optimized out>, dest=0x55c9dd6fff78, direction=<optimized out>, numberTuples=18,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c9dd729c38,
estate=0x55c9dd729118)at execMain.c:1718
 
#7  standard_ExecutorRun (queryDesc=0x55c9dd742b48, direction=<optimized out>, count=18, execute_once=<optimized out>)
atexecMain.c:361
 
#8  0x000055c9db8b6e99 in ParallelQueryMain (seg=0x55c9dd6a8ea8, toc=0x7f9109496000) at execParallel.c:1294
#9  0x000055c9db7911d9 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1150
#10 0x000055c9db975a63 in StartBackgroundWorker () at bgworker.c:841
#11 0x000055c9db982015 in do_start_bgworker (rw=0x55c9dd6a0380) at postmaster.c:5741
#12 maybe_start_bgworkers () at postmaster.c:5954
#13 0x000055c9db982ae8 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5134
#14 <signal handler called>
#15 0x00007f91088503d3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x000055c9db7024ed in ServerLoop () at postmaster.c:1721
#17 0x000055c9db984194 in PostmasterMain (argc=3, argv=0x55c9dd67a5a0) at postmaster.c:1365
#18 0x000055c9db70448d in main (argc=3, argv=0x55c9dd67a5a0) at main.c:228

Re: [sqlsmith] Parallel worker executor crash on master

From

Amit Kapila

Date:

16 December 2017, 10:23:06

On Sat, Dec 16, 2017 at 12:57 AM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:
> Hi,
>
> sqlsmith just crashed a parallel worker while testing master at
> 699bf7d05c.  I can reproduce it with the following recipe on a fresh
> regression database.  Backtrace and query plan below as well.
>

This seems to be another symptom of the problem related to
es_query_dsa for which Thomas has sent a patch on a different thread
[1].  After applying that patch, I am not able to see the problem.  I
think due to the wrong usage of dsa across nodes, it can lead to
sending some wrong values for params to workers.

[1] - https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [sqlsmith] Parallel worker executor crash on master

From

Andreas Seltenreich

Date:

16 December 2017, 12:13:32

Amit Kapila writes:

> This seems to be another symptom of the problem related to
> es_query_dsa for which Thomas has sent a patch on a different thread
> [1].  After applying that patch, I am not able to see the problem.  I
> think due to the wrong usage of dsa across nodes, it can lead to
> sending some wrong values for params to workers.
>
> [1] - https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com

while my posted recipe is indeed inconspicuous with the patch applied,
It seems to have made matters worse from the sqlsmith perspective:
Instead of one core dump per hour I get one per minute.  Sample
backtrace below.  I could not find a recipe yet to reproduce these
(beyond starting sqlsmith).

regards,
Andreas

Core was generated by `postgres: smith regression [local] SELECT                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
283                estate->es_query_dsa = gatherstate->pei->area;
#1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216
#2  0x0000555a5d51a1ea in ExecProcNode (node=0x555a5fff1350) at ../../../src/include/executor/executor.h:242
#3  ExecutePlan (execute_once=<optimized out>, dest=0x555a604f78a0, direction=<optimized out>, numberTuples=1,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x555a5fff1350,
estate=0x555a5fff1138)at execMain.c:1718
 
#4  standard_ExecutorRun (queryDesc=0x555a604f78f8, direction=<optimized out>, count=1, execute_once=<optimized out>)
atexecMain.c:361
 
#5  0x0000555a5d5267cc in postquel_getnext (es=0x555a604f7418, es=0x555a604f7418, fcache=0x555a5fd1a658,
fcache=0x555a5fd1a658)at functions.c:865
 
#6  fmgr_sql (fcinfo=0x555a60376470) at functions.c:1161
#7  0x0000555a5d5224f7 in ExecMakeFunctionResultSet (fcache=0x555a60376400, econtext=econtext@entry=0x555a60374090,
argContext=0x555a5fd449d0,isNull=0x555a6037a60e "", isDone=isDone@entry=0x555a6037a698) at execSRF.c:604
 
#8  0x0000555a5d53dcbb in ExecProjectSRF (node=node@entry=0x555a60373f78, continuing=continuing@entry=0 '\000') at
nodeProjectSet.c:175
#9  0x0000555a5d53ddf5 in ExecProjectSet (pstate=0x555a60373f78) at nodeProjectSet.c:105
#10 0x0000555a5d53d556 in ExecProcNode (node=0x555a60373f78) at ../../../src/include/executor/executor.h:242
#11 ExecNestLoop (pstate=0x555a60373da0) at nodeNestloop.c:109
#12 0x0000555a5d53d556 in ExecProcNode (node=0x555a60373da0) at ../../../src/include/executor/executor.h:242
#13 ExecNestLoop (pstate=0x555a60373248) at nodeNestloop.c:109
#14 0x0000555a5d536699 in ExecProcNode (node=0x555a60373248) at ../../../src/include/executor/executor.h:242
#15 ExecLimit (pstate=0x555a60372650) at nodeLimit.c:95
#16 0x0000555a5d5433eb in ExecProcNode (node=0x555a60372650) at ../../../src/include/executor/executor.h:242
#17 ExecSetParamPlan (node=<optimized out>, econtext=0x555a6045e948) at nodeSubplan.c:968
#18 0x0000555a5d513da8 in ExecEvalParamExec (state=<optimized out>, op=0x555a604619f0, econtext=<optimized out>) at
execExprInterp.c:1921
#19 0x0000555a5d516b7e in ExecInterpExpr (state=0x555a604616e0, econtext=0x555a6045e948, isnull=<optimized out>) at
execExprInterp.c:1038
#20 0x0000555a5d547cad in ExecEvalExprSwitchContext (isNull=0x7ffecac290ce "", econtext=0x555a6045e948,
state=0x555a604616e0)at ../../../src/include/executor/executor.h:300
 
#21 ExecProject (projInfo=0x555a604616d8) at ../../../src/include/executor/executor.h:334
#22 ExecWindowAgg (pstate=0x555a6045e670) at nodeWindowAgg.c:1761
#23 0x0000555a5d536699 in ExecProcNode (node=0x555a6045e670) at ../../../src/include/executor/executor.h:242
#24 ExecLimit (pstate=0x555a6045df28) at nodeLimit.c:95
#25 0x0000555a5d51a1ea in ExecProcNode (node=0x555a6045df28) at ../../../src/include/executor/executor.h:242
#26 ExecutePlan (execute_once=<optimized out>, dest=0x555a604322a0, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x555a6045df28,
estate=0x555a5ffef128)at execMain.c:1718
 
#27 standard_ExecutorRun (queryDesc=0x555a5ff8e418, direction=<optimized out>, count=0, execute_once=<optimized out>)
atexecMain.c:361
 
#28 0x0000555a5d668ecc in PortalRunSelect (portal=portal@entry=0x555a5fbf5f00, forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,dest=dest@entry=0x555a604322a0) at pquery.c:932
 
#29 0x0000555a5d66a4c0 in PortalRun (portal=portal@entry=0x555a5fbf5f00, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=1'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x555a604322a0,
altdest=altdest@entry=0x555a604322a0,completionTag=0x7ffecac29380 "") at pquery.c:773
 
#30 0x0000555a5d66608b in exec_simple_query (query_string=0x555a5fb78178 "[...]") at postgres.c:1120
#31 0x0000555a5d667de1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x555a5fbb5710, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4139
 
#32 0x0000555a5d36af16 in BackendRun (port=0x555a5fb9d280) at postmaster.c:4412
#33 BackendStartup (port=0x555a5fb9d280) at postmaster.c:4084
#34 ServerLoop () at postmaster.c:1757
#35 0x0000555a5d5ec214 in PostmasterMain (argc=3, argv=0x555a5fb725a0) at postmaster.c:1365
#36 0x0000555a5d36c48d in main (argc=3, argv=0x555a5fb725a0) at main.c:228

Re: [sqlsmith] Parallel worker executor crash on master

From

Thomas Munro

Date:

16 December 2017, 23:30:21

On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:
> Amit Kapila writes:
>
>> This seems to be another symptom of the problem related to
>> es_query_dsa for which Thomas has sent a patch on a different thread
>> [1].  After applying that patch, I am not able to see the problem.  I
>> think due to the wrong usage of dsa across nodes, it can lead to
>> sending some wrong values for params to workers.
>>
>> [1] -
https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com
>
> while my posted recipe is indeed inconspicuous with the patch applied,
> It seems to have made matters worse from the sqlsmith perspective:
> Instead of one core dump per hour I get one per minute.  Sample
> backtrace below.  I could not find a recipe yet to reproduce these
> (beyond starting sqlsmith).
>
> regards,
> Andreas
>
> Core was generated by `postgres: smith regression [local] SELECT                        '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
> 283                             estate->es_query_dsa = gatherstate->pei->area;
> #1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216

Hmm, thanks.  That's not good.  Do we know if gatherstate->pei is
NULL, or if it's somehow pointing to garbage?   Not sure how either of
those things could happen, since we only set it to NULL in
ExecShutdownGather() after which point we shouldn't call ExecGather()
again, and any MemoryContext problems with pei should have caused
problems already without this patch (for example in
ExecParallelCleanup).  Clearly I'm missing something.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [sqlsmith] Parallel worker executor crash on master

From

Andreas Seltenreich

Date:

17 December 2017, 02:26:45

Thomas Munro writes:

> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich
> <seltenreich@gmx.de> wrote:
>> Core was generated by `postgres: smith regression [local] SELECT                        '.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
>> 283                             estate->es_query_dsa = gatherstate->pei->area;
>> #1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216
>
> Hmm, thanks.  That's not good.  Do we know if gatherstate->pei is
> NULL, or if it's somehow pointing to garbage?

It was NULL on all the coredumps I looked into.  Below[1] is a full
gatherstate.

> Not sure how either of those things could happen, since we only set it
> to NULL in ExecShutdownGather() after which point we shouldn't call
> ExecGather() again, and any MemoryContext problems with pei should
> have caused problems already without this patch (for example in
> ExecParallelCleanup).  Clearly I'm missing something.

FWIW, all backtraces collected so far are identical for the first nine
frames.  After ExecProjectSet, they are pretty random executor innards.

,----
| #1  ExecGather at nodeGather.c:216
| #2  0x0000555bc9fb41ea in ExecProcNode at ../../../src/include/executor/executor.h:242
| #3  ExecutePlan at execMain.c:1718
| #4  standard_ExecutorRun at execMain.c:361
| #5  0x0000555bc9fc07cc in postquel_getnext at functions.c:865
| #6  fmgr_sql (fcinfo=0x555bcba07748) at functions.c:1161
| #7  0x0000555bc9fbc4f7 in ExecMakeFunctionResultSet at execSRF.c:604
| #8  0x0000555bc9fd7cbb in ExecProjectSRF at nodeProjectSet.c:175
| #9  0x0000560828dc8df5 in ExecProjectSet at nodeProjectSet.c:105
`----

regards,
Andreas

Footnotes: 
[1]
(gdb) p *gatherstate
$3 = {
  ps = {
    type = T_GatherState, 
    plan = 0x555bcb9faf30, 
    state = 0x555bcba3d098, 
    ExecProcNode = 0x555bc9fc9e30 <ExecGather>, 
    ExecProcNodeReal = 0x555bc9fc9e30 <ExecGather>, 
    instrument = 0x0, 
    worker_instrument = 0x0, 
    qual = 0x0, 
    lefttree = 0x555bcba3d678, 
    righttree = 0x0, 
    initPlan = 0x0, 
    subPlan = 0x0, 
    chgParam = 0x0, 
    ps_ResultTupleSlot = 0x555bcba3d5b8, 
    ps_ExprContext = 0x555bcba3d3c8, 
    ps_ProjInfo = 0x0
  }, 
  initialized = 1 '\001', 
  need_to_scan_locally = 1 '\001', 
  tuples_needed = -1, 
  funnel_slot = 0x555bcba3d4c0, 
  pei = 0x0, 
  nworkers_launched = 0, 
  nreaders = 0, 
  nextreader = 0, 
  reader = 0x0
}

Re: [sqlsmith] Parallel worker executor crash on master

From

Thomas Munro

Date:

18 December 2017, 01:29:38

On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:
> Thomas Munro writes:
>
>> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich
>> <seltenreich@gmx.de> wrote:
>>> Core was generated by `postgres: smith regression [local] SELECT                        '.
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283
>>> 283                             estate->es_query_dsa = gatherstate->pei->area;
>>> #1  ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216
>>
>> Hmm, thanks.  That's not good.  Do we know if gatherstate->pei is
>> NULL, or if it's somehow pointing to garbage?
>
> It was NULL on all the coredumps I looked into.  Below[1] is a full
> gatherstate.
>
>> Not sure how either of those things could happen, since we only set it
>> to NULL in ExecShutdownGather() after which point we shouldn't call
>> ExecGather() again, and any MemoryContext problems with pei should
>> have caused problems already without this patch (for example in
>> ExecParallelCleanup).  Clearly I'm missing something.
>
> FWIW, all backtraces collected so far are identical for the first nine
> frames.  After ExecProjectSet, they are pretty random executor innards.

Oh, right.  We only create pei if (gather->num_workers > 0 &&
estate->es_use_parallel_mode), so I need to teach that patch that pei
may be NULL.  I'll go and post a new version like that over on the
other thread.

-- 
Thomas Munro
http://www.enterprisedb.com