PostgreSQL crashes with SIGSEGV - Mailing list pgsql-bugs
From | Bernd Helmle |
---|---|
Subject | PostgreSQL crashes with SIGSEGV |
Date | |
Msg-id | 1512661638.9720.34.camel@oopsware.de Whole thread Raw |
Responses |
Re: PostgreSQL crashes with SIGSEGV
Re: PostgreSQL crashes with SIGSEGV Re: PostgreSQL crashes with SIGSEGV |
List | pgsql-bugs |
A customer recently reported a crash in a postgres backend. The backend encountered a SIGSEGV, crashing during SELECTs from a fairly complicated view using a grouping set directive. I've managed to reproduce it by tracking it down to a specific SELECT, but unfortunately couldn't yet manage to strip it down to a small, repeatable test case which doesn't involve the whole (sensitive) dataset. I'm reporting my findings so far, maybe it helps to track it down already. The crashed backend gives the following backtrace: Program received signal SIGSEGV, Segmentation fault. 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007 1007 context = ((StandardChunkHeader *) (gdb) bt #0 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007 #1 0x00000000004798d8 in heap_free_minimal_tuple (mtup=0x17194e8) at heaptuple.c:1403 #2 0x0000000000687202 in ExecClearTuple (slot=0x10a3f88) at execTuples.c:455 #3 0x0000000000686b89 in ExecResetTupleTable (tupleTable=0x10a06a0, shouldFree=0 '\000') at execTuples.c:169 #4 0x00000000006773ff in ExecEndPlan (planstate=0x10a0398, estate=0x10a01f0) at execMain.c:1469 #5 0x0000000000675927 in standard_ExecutorEnd (queryDesc=0x10910c0) at execMain.c:468 #6 0x0000000000675865 in ExecutorEnd (queryDesc=0x10910c0) at execMain.c:439 #7 0x000000000062c71c in PortalCleanup (portal=0xf79ae0) at portalcmds.c:280 #8 0x000000000099ce97 in PortalDrop (portal=0xf79ae0, isTopCommit=0 '\000') at portalmem.c:510 #9 0x0000000000813639 in exec_simple_query ( query_string=0xf3dbe0 "SELECT * FROM \"xxxx\".\"xxxx\" WHERE (xxxx = 'xxxx') LIMIT 1000;") at postgres.c:1095 #10 0x00000000008177ea in PostgresMain (argc=1, argv=0xee9e10, dbname=0xeb9a78 "xxxx", username=0xee9c78 "bernd") at postgres.c:4072 #11 0x000000000078cc27 in BackendRun (port=0xedc700) at postmaster.c:4342 #12 0x000000000078c377 in BackendStartup (port=0xedc700) at postmaster.c:4016 #13 0x0000000000788983 in ServerLoop () at postmaster.c:1721 #14 0x0000000000787fb8 in PostmasterMain (argc=1, argv=0xeb7960) at postmaster.c:1329 #15 0x00000000006d1b52 in main (argc=1, argv=0xeb7960) at main.c:228 I've tested this so far against very current REL9_6_STABLE and REL9_5_STABLE and got them to crash with the same backtrace. The crash is dependent on the chosen plan, experiments with work_mem show that the crash seems to happen only if you get external sorts into the execution plan. REL10_STABLE seems not affected, as my extracted application query doesn't crash there. Running the query against REL9_6_STABLE with valgrind shows the following results: ==00:00:01:33.067 13158== Invalid write of size 8 ==00:00:01:33.067 13158== at 0x93D53B: AllocSetFree (aset.c:998) ==00:00:01:33.067 13158== by 0x93EC03: pfree (mcxt.c:1012) ==00:00:01:33.067 13158== by 0x476E34: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:01:33.067 13158== by 0x6521A3: ExecClearTuple (execTuples.c:455) ==00:00:01:33.067 13158== by 0x651D4B: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.540 14381== Invalid read of size 8 ==00:00:03:34.540 14381== at 0x99E29C: pfree (mcxt.c:1007) ==00:00:03:34.540 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.540 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.540 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.540 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.540 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.540 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.540 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.540 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.540 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.540 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.540 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:34.540 14381== Address 0x188cd220 is 474,080 bytes inside a block of size 1,048,576 free'd ==00:00:03:34.540 14381== at 0x4C2FD18: free (vg_replace_malloc.c:530) ==00:00:03:34.540 14381== by 0x99A8DB: AllocSetDelete (aset.c:653) ==00:00:03:34.540 14381== by 0x99C7A4: MemoryContextDelete (mcxt.c:225) ==00:00:03:34.540 14381== by 0x99C855: MemoryContextDeleteChildren (mcxt.c:245) ==00:00:03:34.540 14381== by 0x99C772: MemoryContextDelete (mcxt.c:208) ==00:00:03:34.540 14381== by 0x9A5F10: tuplesort_end (tuplesort.c:1198) ==00:00:03:34.540 14381== by 0x692568: ExecEndAgg (nodeAgg.c:3449) ==00:00:03:34.540 14381== by 0x67BB46: ExecEndNode (execProcnode.c:755) ==00:00:03:34.540 14381== by 0x6AD652: ExecEndSubqueryScan (nodeSubqueryscan.c:181) ==00:00:03:34.540 14381== by 0x67BA69: ExecEndNode (execProcnode.c:697) ==00:00:03:34.540 14381== by 0x69DF6E: ExecEndLimit (nodeLimit.c:438) ==00:00:03:34.540 14381== by 0x67BB9D: ExecEndNode (execProcnode.c:779) ==00:00:03:34.540 14381== Block was alloc'd at ==00:00:03:34.540 14381== at 0x4C2EB6B: malloc (vg_replace_malloc.c:299) ==00:00:03:34.540 14381== by 0x99AED4: AllocSetAlloc (aset.c:866) ==00:00:03:34.540 14381== by 0x99DC7A: palloc (mcxt.c:904) ==00:00:03:34.540 14381== by 0x4798F6: heap_copy_minimal_tuple (heaptuple.c:1417) ==00:00:03:34.540 14381== by 0x6878D7: ExecCopySlotMinimalTuple (execTuples.c:593) ==00:00:03:34.540 14381== by 0x9AC1D2: copytup_heap (tuplesort.c:3998) ==00:00:03:34.540 14381== by 0x9A616C: tuplesort_puttupleslot (tuplesort.c:1345) ==00:00:03:34.540 14381== by 0x68D7D8: fetch_input_tuple (nodeAgg.c:601) ==00:00:03:34.540 14381== by 0x68FEC1: agg_retrieve_direct (nodeAgg.c:2168) ==00:00:03:34.540 14381== by 0x68F9AB: ExecAgg (nodeAgg.c:1903) ==00:00:03:34.540 14381== by 0x67B70A: ExecProcNode (execProcnode.c:503) ==00:00:03:34.540 14381== by 0x6AD40F: SubqueryNext (nodeSubqueryscan.c:53) ==00:00:03:34.540 14381== { <insert_a_suppression_name_here> Memcheck:Addr8 fun:pfree fun:heap_free_minimal_tuple fun:ExecClearTuple fun:ExecResetTupleTable fun:ExecEndPlan fun:standard_ExecutorEnd fun:ExecutorEnd fun:PortalCleanup fun:PortalDrop fun:exec_simple_query fun:PostgresMain fun:BackendRun } ==00:00:03:34.548 14381== Invalid read of size 4 ==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010) ==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.548 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.548 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:34.548 14381== Address 0x7f7f7f7f7f7f7f7f is not stack'd, malloc'd or (recently) free'd ==00:00:03:34.548 14381== { <insert_a_suppression_name_here> Memcheck:Addr4 fun:pfree fun:heap_free_minimal_tuple fun:ExecClearTuple fun:ExecResetTupleTable fun:ExecEndPlan fun:standard_ExecutorEnd fun:ExecutorEnd fun:PortalCleanup fun:PortalDrop fun:exec_simple_query fun:PostgresMain fun:BackendRun } ==00:00:03:34.548 14381== ==00:00:03:34.548 14381== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==00:00:03:34.548 14381== General Protection Fault ==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010) ==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup (portalcmds.c:280) ==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510) ==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query (postgres.c:1095) ==00:00:03:34.548 14381== by 0x818455: PostgresMain (postgres.c:4072) ==00:00:03:34.548 14381== by 0x78CF63: BackendRun (postmaster.c:4342) ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== Process terminating with default action of signal 11 (SIGSEGV) ==00:00:03:35.088 14381== General Protection Fault ==00:00:03:35.088 14381== at 0x632016C: _dl_catch_error (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x631F8E6: __libc_dlclose (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x634B5E4: free_mem (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x634B1E1: __libc_freeres (in /usr/lib64/libc-2.25.so) ==00:00:03:35.088 14381== by 0x4A296DB: _vgnU_freeres (vg_preloaded.c:77) ==00:00:03:35.088 14381== by 0x18118EA7: ??? ==00:00:03:35.088 14381== by 0x4798D7: heap_free_minimal_tuple (heaptuple.c:1403) ==00:00:03:35.088 14381== by 0x68753E: ExecClearTuple (execTuples.c:455) ==00:00:03:35.088 14381== by 0x686EC5: ExecResetTupleTable (execTuples.c:169) ==00:00:03:35.088 14381== by 0x67773B: ExecEndPlan (execMain.c:1469) ==00:00:03:35.088 14381== by 0x675C63: standard_ExecutorEnd (execMain.c:468) ==00:00:03:35.088 14381== by 0x675BA1: ExecutorEnd (execMain.c:439) ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== HEAP SUMMARY: ==00:00:03:35.088 14381== in use at exit: 6,833,625 bytes in 531 blocks ==00:00:03:35.088 14381== total heap usage: 814,111 allocs, 1,264 frees, 118,761,978 bytes allocated ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== For a detailed leak analysis, rerun with: -- leak-check=full ==00:00:03:35.088 14381== ==00:00:03:35.088 14381== For counts of detected and suppressed errors, rerun with: -v ==00:00:03:35.088 14381== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 35 from 6) It seems the backend tries to free a minimal tuple at executor end, which is already gone by deleting the memory context it was allocated in before. ExecResetTupleTable() is looping through a list with 70 entries, it encounters (after 6 or seven rounds) the first tuple slot with tts_shouldFreeMin set, all others before don't have it set: (gdb) p *slot $6 = {type = T_TupleTableSlot, tts_isempty = 0 '\000', tts_shouldFree = 0 '\000', tts_shouldFreeMin = 1 '\001', tts_slow = 1 '\001', tts_tuple = 0x10a6c48, tts_tupleDescriptor = 0x10d1be8, tts_mcxt = 0xf59668, tts_buffer = 0, tts_nvalid = 6, tts_values = 0x10d2640, tts_isnull = 0x10d26d8 "", tts_mintuple = 0x171c168, tts_minhdr = {t_len = 162, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0}, t_tableOid = 0, t_data = 0x171c160}, tts_off = 88} (gdb) p list_length(tupleTable) tts_mintuple is invalid, though: (gdb) p *slot->tts_mintuple Cannot access memory at address 0x171c168 The following attempt to clean it in ExecClearTuple() then crashes the backend. Bernd
pgsql-bugs by date: