Thread: [sqlsmith] crashes in RestoreSnapshot on hot standby
Running sqlsmith on a streaming slave (master as of f8c5855) is inconspicuous as long as the master is idle. As soon as I start it on the master as well, the standby frequently crashes in RestoreSnapshot. It doesn't seem to be specific to the queries, as they don't trigger a crash when re-run. Backtraces always look like the ones below. regards, Andreas **** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.9826@.core ON doombat **** Core was generated by `postgres: smith regression [local] SELECT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167 167 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden. (gdb) bt #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167 #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memoryat address 0x7f2701d5a110>) at snapmgr.c:2020 #2 0x00000000004a934a in heap_beginscan_parallel (relation=0x2060a90, parallel_scan=parallel_scan@entry=0x7f2701d5a0f8)at heapam.c:1657 #3 0x00000000005fbedf in ExecSeqScanInitializeDSM (node=0x1f5b470, pcxt=0x221af88) at nodeSeqscan.c:327 #4 0x00000000005dd0ad in ExecParallelInitializeDSM (planstate=planstate@entry=0x1f5b470, d=d@entry=0x7ffd4ba200d0) at execParallel.c:245 #5 0x00000000005dd425 in ExecInitParallelPlan (planstate=0x1f5b470, estate=estate@entry=0x1f5ab28, nworkers=2) at execParallel.c:477 #6 0x00000000005ef4a4 in ExecGather (node=node@entry=0x1f5b048) at nodeGather.c:159 #7 0x00000000005dda48 in ExecProcNode (node=node@entry=0x1f5b048) at execProcnode.c:515 #8 0x00000000005f4b30 in ExecLimit (node=node@entry=0x1f5acd0) at nodeLimit.c:91 #9 0x00000000005dd9d8 in ExecProcNode (node=node@entry=0x1f5acd0) at execProcnode.c:531 #10 0x00000000005fef7c in ExecSetParamPlan (node=<optimized out>, econtext=0x1f5c138) at nodeSubplan.c:999 #11 0x00000000005e28b5 in ExecEvalParamExec (exprstate=<optimized out>, econtext=<optimized out>, isNull=0x22045b0 "", isDone=<optimizedout>) at execQual.c:1135 #12 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x2204200, econtext=0x1f5c138, isNull=0x2203d98 "", isDone=<optimizedout>) at execQual.c:2015 #13 0x00000000005de29a in ExecEvalCoalesce (coalesceExpr=<optimized out>, econtext=0x1f5c138, isNull=0x2203d98 "", isDone=<optimizedout>) at execQual.c:3446 #14 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x22039e8, econtext=0x1f5c138, isNull=0x7ffd4ba203df "", isDone=<optimizedout>) at execQual.c:2015 #15 0x00000000005e4939 in ExecQual (qual=<optimized out>, econtext=econtext@entry=0x1f5c138, resultForNull=resultForNull@entry=0'\000') at execQual.c:5269 #16 0x00000000005faef1 in ExecResult (node=node@entry=0x1f5c020) at nodeResult.c:82 #17 0x00000000005ddbf8 in ExecProcNode (node=node@entry=0x1f5c020) at execProcnode.c:392 #18 0x00000000005d9c1f in ExecutePlan (dest=0x1ebb7d0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1f5c020, estate=0x1f5ab28) at execMain.c:1567 #19 standard_ExecutorRun (queryDesc=0x1f5a718, direction=<optimized out>, count=0) at execMain.c:338 #20 0x00000000006f7238 in PortalRunSelect (portal=portal@entry=0x1d13be8, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,dest=dest@entry=0x1ebb7d0) at pquery.c:946 #21 0x00000000006f875e in PortalRun (portal=0x1d13be8, count=9223372036854775807, isTopLevel=<optimized out>, dest=0x1ebb7d0,altdest=0x1ebb7d0, completionTag=0x7ffd4ba20840 "") at pquery.c:787 #22 0x00000000006f6003 in exec_simple_query (query_string=<optimized out>) at postgres.c:1094 #23 PostgresMain (argc=30489576, argv=0x1ecfb08, dbname=0x1cf5a00 "regression", username=0x1ecfc20 "\b\373\354\001") at postgres.c:4074 #24 0x000000000046ca67 in BackendRun (port=0x1d17b50) at postmaster.c:4262 #25 BackendStartup (port=0x1d17b50) at postmaster.c:3936 #26 ServerLoop () at postmaster.c:1693 #27 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cf45e0) at postmaster.c:1301 #28 0x000000000046d9cd in main (argc=3, argv=0x1cf45e0) at main.c:228 (gdb) p debug_query_string $1 = 0x1d68a78 "select \n sample_0.j as c0\nfrom \n public.testjsonb as sample_0 tablesample system (8) \nwhere cast(coalesce(pg_catalog.char_length(\n cast((select comment from public.room limit 1 offset 20)\n as text)),\n pg_catalog.pg_trigger_depth()) as integer) <> 3" **** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.8104@.core ON marbit **** Core was generated by `postgres: bgworker: parallel worker for PID 2610 '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167 167 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden. (gdb) bt #0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167 #1 0x0000000000822032 in RestoreSnapshot (start_address=0x7f7b1ee4fa58 <error: Cannot access memory at address 0x7f7b1ee4fa58>)at snapmgr.c:2020 #2 0x00000000004e6142 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1010 #3 0x0000000000683b62 in StartBackgroundWorker () at bgworker.c:726 #4 0x000000000068ec32 in do_start_bgworker (rw=0x22a8900) at postmaster.c:5535 #5 maybe_start_bgworker () at postmaster.c:5709 #6 0x000000000068f686 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:4971 #7 <signal handler called> #8 0x00007f7b1e3f3ac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 #9 0x000000000046c381 in ServerLoop () at postmaster.c:1657 #10 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x22825e0) at postmaster.c:1301 #11 0x000000000046d9cd in main (argc=3, argv=0x22825e0) at main.c:228
On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memoryat address 0x7f2701d5a110>) at snapmgr.c:2020 memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt, serialized_snapshot->subxcnt *sizeof(TransactionId)); So this is choking here? Is one of those pointers NULL? -- Michael
On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: >> #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memoryat address 0x7f2701d5a110>) at snapmgr.c:2020 > > memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt, > serialized_snapshot->subxcnt * sizeof(TransactionId)); > So this is choking here? Is one of those pointers NULL? Theory 1: If serialized_snapshot->xcnt == 0, then snapshot->xip never gets initialized to a non-NULL value. Then if serialized_snapshot->subxcnt > 0, we set snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt (so that's NULL too). Then in line the line you show we call memcpy(snapshot->subxip, ...). The fix might be something like the attached. Theory 2: The DSM segment was deleted underneath us. We can see that it was not mapped by the time GDB dumped that (start_address is not accessible). Theory 3: Somehow the xcnt or xsubcnt was wrong or the serialized snapshot was truncated, and we read past the end of the DSM, who knows... -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> >> #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020
> >
> > memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
> > serialized_snapshot->subxcnt * sizeof(TransactionId));
> > So this is choking here? Is one of those pointers NULL?
>
> Theory 1:
> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
> initialized to a non-NULL value. Then if serialized_snapshot->subxcnt
> > 0, we set snapshot->subxip = snapshot->xip +
> serialized_snapshot->xcnt (so that's NULL too). Then in line the line
> you show we call memcpy(snapshot->subxip, ...). The fix might be
> something like the attached.
>
GetSnapshotData()
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.
--
>
> On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> >> #1 0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020
> >
> > memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
> > serialized_snapshot->subxcnt * sizeof(TransactionId));
> > So this is choking here? Is one of those pointers NULL?
>
> Theory 1:
> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
> initialized to a non-NULL value. Then if serialized_snapshot->subxcnt
> > 0, we set snapshot->subxip = snapshot->xip +
> serialized_snapshot->xcnt (so that's NULL too). Then in line the line
> you show we call memcpy(snapshot->subxip, ...). The fix might be
> something like the attached.
>
I was just typing the mail, when I see this mail. I also reached to the conclusion that this is the reason of crash. You can see how CopySnapshot calculates the subxipoff, may be writing code that way will be more consistent. In case of recovery, I think serialized_snapshot->xcnt will always be zero as we fill everything in subxip array (refer below code in GetSnapshotData).
GetSnapshotData()
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.
..
..
}
On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> > wrote: >> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets >> initialized to a non-NULL value. Then if serialized_snapshot->subxcnt >> > 0, we set snapshot->subxip = snapshot->xip + >> serialized_snapshot->xcnt (so that's NULL too). Then in line the line >> you show we call memcpy(snapshot->subxip, ...). The fix might be >> something like the attached. > > I was just typing the mail, when I see this mail. I also reached to the > conclusion that this is the reason of crash. You can see how CopySnapshot > calculates the subxipoff, may be writing code that way will be more > consistent. Or maybe just like this? - snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt; + snapshot->subxip = ((TransactionId *) (snapshot + 1)) + + serialized_snapshot->xcnt; -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> >> wrote: >>> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets >>> initialized to a non-NULL value. Then if serialized_snapshot->subxcnt >>> > 0, we set snapshot->subxip = snapshot->xip + >>> serialized_snapshot->xcnt (so that's NULL too). Then in line the line >>> you show we call memcpy(snapshot->subxip, ...). The fix might be >>> something like the attached. >> >> I was just typing the mail, when I see this mail. I also reached to the >> conclusion that this is the reason of crash. You can see how CopySnapshot >> calculates the subxipoff, may be writing code that way will be more >> consistent. > > Or maybe just like this? > > - snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt; > + snapshot->subxip = ((TransactionId *) (snapshot + 1)) + > + serialized_snapshot->xcnt; > This way it looks better to me. Thanks for the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila writes: > On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Or maybe just like this? >> >> - snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt; >> + snapshot->subxip = ((TransactionId *) (snapshot + 1)) + >> + serialized_snapshot->xcnt; >> > > This way it looks better to me. Thanks for the patch. I no longer see these crashes when testing with the patch applied. thanks, Andreas
On Fri, Jul 1, 2016 at 1:41 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > Amit Kapila writes: >> On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> Or maybe just like this? >>> >>> - snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt; >>> + snapshot->subxip = ((TransactionId *) (snapshot + 1)) + >>> + serialized_snapshot->xcnt; >>> >> >> This way it looks better to me. Thanks for the patch. > > I no longer see these crashes when testing with the patch applied. Committed and back-patched to 9.5 where this code was added. Thanks to all for the report, patch, review, and testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company