Thread: [sqlsmith] crashes in RestoreSnapshot on hot standby

[sqlsmith] crashes in RestoreSnapshot on hot standby

From

Andreas Seltenreich

Date:

30 June 2016, 21:26:35

Running sqlsmith on a streaming slave (master as of f8c5855) is
inconspicuous as long as the master is idle.  As soon as I start it on
the master as well, the standby frequently crashes in RestoreSnapshot.
It doesn't seem to be specific to the queries, as they don't trigger a
crash when re-run.

Backtraces always look like the ones below.

regards,
Andreas

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.9826@.core ON doombat ****
Core was generated by `postgres: smith regression [local] SELECT                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020
 
#2  0x00000000004a934a in heap_beginscan_parallel (relation=0x2060a90,
parallel_scan=parallel_scan@entry=0x7f2701d5a0f8)at heapam.c:1657
 
#3  0x00000000005fbedf in ExecSeqScanInitializeDSM (node=0x1f5b470, pcxt=0x221af88) at nodeSeqscan.c:327
#4  0x00000000005dd0ad in ExecParallelInitializeDSM (planstate=planstate@entry=0x1f5b470, d=d@entry=0x7ffd4ba200d0) at
execParallel.c:245
#5  0x00000000005dd425 in ExecInitParallelPlan (planstate=0x1f5b470, estate=estate@entry=0x1f5ab28, nworkers=2) at
execParallel.c:477
#6  0x00000000005ef4a4 in ExecGather (node=node@entry=0x1f5b048) at nodeGather.c:159
#7  0x00000000005dda48 in ExecProcNode (node=node@entry=0x1f5b048) at execProcnode.c:515
#8  0x00000000005f4b30 in ExecLimit (node=node@entry=0x1f5acd0) at nodeLimit.c:91
#9  0x00000000005dd9d8 in ExecProcNode (node=node@entry=0x1f5acd0) at execProcnode.c:531
#10 0x00000000005fef7c in ExecSetParamPlan (node=<optimized out>, econtext=0x1f5c138) at nodeSubplan.c:999
#11 0x00000000005e28b5 in ExecEvalParamExec (exprstate=<optimized out>, econtext=<optimized out>, isNull=0x22045b0 "",
isDone=<optimizedout>) at execQual.c:1135
 
#12 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x2204200, econtext=0x1f5c138, isNull=0x2203d98 "",
isDone=<optimizedout>) at execQual.c:2015
 
#13 0x00000000005de29a in ExecEvalCoalesce (coalesceExpr=<optimized out>, econtext=0x1f5c138, isNull=0x2203d98 "",
isDone=<optimizedout>) at execQual.c:3446
 
#14 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x22039e8, econtext=0x1f5c138, isNull=0x7ffd4ba203df "",
isDone=<optimizedout>) at execQual.c:2015
 
#15 0x00000000005e4939 in ExecQual (qual=<optimized out>, econtext=econtext@entry=0x1f5c138,
resultForNull=resultForNull@entry=0'\000') at execQual.c:5269
 
#16 0x00000000005faef1 in ExecResult (node=node@entry=0x1f5c020) at nodeResult.c:82
#17 0x00000000005ddbf8 in ExecProcNode (node=node@entry=0x1f5c020) at execProcnode.c:392
#18 0x00000000005d9c1f in ExecutePlan (dest=0x1ebb7d0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized
out>,operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1f5c020, estate=0x1f5ab28) at
execMain.c:1567
#19 standard_ExecutorRun (queryDesc=0x1f5a718, direction=<optimized out>, count=0) at execMain.c:338
#20 0x00000000006f7238 in PortalRunSelect (portal=portal@entry=0x1d13be8, forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,dest=dest@entry=0x1ebb7d0) at pquery.c:946
 
#21 0x00000000006f875e in PortalRun (portal=0x1d13be8, count=9223372036854775807, isTopLevel=<optimized out>,
dest=0x1ebb7d0,altdest=0x1ebb7d0, completionTag=0x7ffd4ba20840 "") at pquery.c:787
 
#22 0x00000000006f6003 in exec_simple_query (query_string=<optimized out>) at postgres.c:1094
#23 PostgresMain (argc=30489576, argv=0x1ecfb08, dbname=0x1cf5a00 "regression", username=0x1ecfc20 "\b\373\354\001") at
postgres.c:4074
#24 0x000000000046ca67 in BackendRun (port=0x1d17b50) at postmaster.c:4262
#25 BackendStartup (port=0x1d17b50) at postmaster.c:3936
#26 ServerLoop () at postmaster.c:1693
#27 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cf45e0) at postmaster.c:1301
#28 0x000000000046d9cd in main (argc=3, argv=0x1cf45e0) at main.c:228
(gdb) p debug_query_string
$1 = 0x1d68a78 "select  \n  sample_0.j as c0\nfrom \n  public.testjsonb as sample_0 tablesample system (8) \nwhere
cast(coalesce(pg_catalog.char_length(\n     cast((select comment from public.room limit 1 offset 20)\n         as
text)),\n   pg_catalog.pg_trigger_depth()) as integer) <> 3"
 

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.8104@.core ON marbit ****
Core was generated by `postgres: bgworker: parallel worker for PID 2610                  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1  0x0000000000822032 in RestoreSnapshot (start_address=0x7f7b1ee4fa58 <error: Cannot access memory at address
0x7f7b1ee4fa58>)at snapmgr.c:2020
 
#2  0x00000000004e6142 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1010
#3  0x0000000000683b62 in StartBackgroundWorker () at bgworker.c:726
#4  0x000000000068ec32 in do_start_bgworker (rw=0x22a8900) at postmaster.c:5535
#5  maybe_start_bgworker () at postmaster.c:5709
#6  0x000000000068f686 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:4971
#7  <signal handler called>
#8  0x00007f7b1e3f3ac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#9  0x000000000046c381 in ServerLoop () at postmaster.c:1657
#10 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x22825e0) at postmaster.c:1301
#11 0x000000000046d9cd in main (argc=3, argv=0x22825e0) at main.c:228

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Michael Paquier

Date:

01 July 2016, 02:17:22

On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> #1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020
 
       memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,              serialized_snapshot->subxcnt
*sizeof(TransactionId));
 
So this is choking here? Is one of those pointers NULL?
-- 
Michael

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Thomas Munro

Date:

01 July 2016, 03:19:02

On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
>> #1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020 
>
>         memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
>                serialized_snapshot->subxcnt * sizeof(TransactionId));
> So this is choking here? Is one of those pointers NULL?

Theory 1:
If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
> 0, we set snapshot->subxip = snapshot->xip +
serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
you show we call memcpy(snapshot->subxip, ...).  The fix might be
something like the attached.

Theory 2:
The DSM segment was deleted underneath us.  We can see that it was not
mapped by the time GDB dumped that (start_address is not accessible).

Theory 3:
Somehow the xcnt or xsubcnt was wrong or the serialized snapshot was
truncated, and we read past the end of the DSM, who knows...

--
Thomas Munro
http://www.enterprisedb.com

Attachment

fix-subxip.patch

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Amit Kapila

Date:

01 July 2016, 03:26:02

On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> >> #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020
> >
> > memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
> > serialized_snapshot->subxcnt * sizeof(TransactionId));
> > So this is choking here? Is one of those pointers NULL?
>
> Theory 1:
> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
> initialized to a non-NULL value. Then if serialized_snapshot->subxcnt
> > 0, we set snapshot->subxip = snapshot->xip +
> serialized_snapshot->xcnt (so that's NULL too). Then in line the line
> you show we call memcpy(snapshot->subxip, ...). The fix might be
> something like the attached.
>

I was just typing the mail, when I see this mail. I also reached to the conclusion that this is the reason of crash. You can see how CopySnapshot calculates the subxipoff, may be writing code that way will be more consistent. In case of recovery, I think serialized_snapshot->xcnt will always be zero as we fill everything in subxip array (refer below code in GetSnapshotData).

GetSnapshotData()
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.

}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Thomas Munro

Date:

01 July 2016, 04:08:15

On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
> wrote:
>> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
>> initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
>> > 0, we set snapshot->subxip = snapshot->xip +
>> serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
>> you show we call memcpy(snapshot->subxip, ...).  The fix might be
>> something like the attached.
>
> I was just typing the mail, when I see this mail.  I also reached to the
> conclusion that this is the reason of crash.  You can see how CopySnapshot
> calculates the subxipoff, may be writing code that way will be more
> consistent.

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

--
Thomas Munro
http://www.enterprisedb.com

Attachment

fix-subxip-v2.patch

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Amit Kapila

Date:

01 July 2016, 04:23:07

On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
>> wrote:
>>> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
>>> initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
>>> > 0, we set snapshot->subxip = snapshot->xip +
>>> serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
>>> you show we call memcpy(snapshot->subxip, ...).  The fix might be
>>> something like the attached.
>>
>> I was just typing the mail, when I see this mail.  I also reached to the
>> conclusion that this is the reason of crash.  You can see how CopySnapshot
>> calculates the subxipoff, may be writing code that way will be more
>> consistent.
>
> Or maybe just like this?
>
> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
> +                       serialized_snapshot->xcnt;
>

This way it looks better to me.  Thanks for the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Andreas Seltenreich

Date:

01 July 2016, 05:41:31

Amit Kapila writes:
> On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Or maybe just like this?
>>
>> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
>> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
>> +                       serialized_snapshot->xcnt;
>>
>
> This way it looks better to me.  Thanks for the patch.

I no longer see these crashes when testing with the patch applied.

thanks,
Andreas

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

From

Robert Haas

Date:

01 July 2016, 13:04:39

On Fri, Jul 1, 2016 at 1:41 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> Amit Kapila writes:
>> On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> Or maybe just like this?
>>>
>>> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
>>> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
>>> +                       serialized_snapshot->xcnt;
>>>
>>
>> This way it looks better to me.  Thanks for the patch.
>
> I no longer see these crashes when testing with the patch applied.

Committed and back-patched to 9.5 where this code was added.  Thanks
to all for the report, patch, review, and testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company