Thread: skink's test_decoding failures in 9.4 branch
I can't help noticing that the failure rate on skink has gone from "rare" to "100%" since 3d5b227: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skink&br=REL9_4_STABLE I think we need to put some effort into figuring out what's up there. Also, this morning curculio showed what might be the same issue: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2016-07-20%2014%3A30%3A01 I've still had no luck reproducing it here, though. Wonder if it is specific to certain compilers. regards, tom lane
I wrote: > I've still had no luck reproducing it here, though. Hah --- I take that back. On about the fourth or fifth trial: ==00:00:00:34.291 21525== Invalid read of size 1 ==00:00:00:34.291 21525== at 0x4A08DEC: memcpy (mc_replace_strmem.c:882) ==00:00:00:34.291 21525== by 0x66FA54: DecodeXLogTuple (decode.c:899) ==00:00:00:34.291 21525== by 0x670561: LogicalDecodingProcessRecord (decode.c:711) ==00:00:00:34.291 21525== by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440) ==00:00:00:34.291 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:34.291 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:34.291 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:34.291 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:34.291 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:34.291 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:34.291 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:34.291 21525== by 0x6BD7E3: exec_simple_query (postgres.c:1072) ==00:00:00:34.291 21525== Address 0xe5311d6 is 6 bytes after a block of size 8,192 alloc'd ==00:00:00:34.291 21525== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:00:34.291 21525== by 0x4ED399: XLogReaderAllocate (xlogreader.c:83) ==00:00:00:34.291 21525== by 0x6710B3: StartupDecodingContext (logical.c:161) ==00:00:00:34.291 21525== by 0x671303: CreateDecodingContext (logical.c:413) ==00:00:00:34.291 21525== by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394) ==00:00:00:34.291 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:34.291 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:34.291 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:34.291 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:34.291 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:34.291 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:34.291 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:34.291 21525== ... ... ==00:00:00:35.011 21525== Invalid read of size 1 ==00:00:00:35.011 21525== at 0x4A08CCA: memcpy (mc_replace_strmem.c:882) ==00:00:00:35.011 21525== by 0x66FA54: DecodeXLogTuple (decode.c:899) ==00:00:00:35.011 21525== by 0x670561: LogicalDecodingProcessRecord (decode.c:711) ==00:00:00:35.011 21525== by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440) ==00:00:00:35.011 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:35.011 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:35.011 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:35.011 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:35.012 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:35.012 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:35.012 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:35.012 21525== by 0x6BD7E3: exec_simple_query (postgres.c:1072) ==00:00:00:35.012 21525== Address 0x4ff2450 is 0 bytes after a block of size 8,192 alloc'd ==00:00:00:35.012 21525== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:00:35.012 21525== by 0x4ED399: XLogReaderAllocate (xlogreader.c:83) ==00:00:00:35.012 21525== by 0x6710B3: StartupDecodingContext (logical.c:161) ==00:00:00:35.012 21525== by 0x671303: CreateDecodingContext (logical.c:413) ==00:00:00:35.012 21525== by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394) ==00:00:00:35.012 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:35.012 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:35.012 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:35.012 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:35.012 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:35.012 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:35.012 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:35.012 21525== ==00:00:00:35.012 21525== Invalid read of size 1 ==00:00:00:35.012 21525== at 0x4A08CB8: memcpy (mc_replace_strmem.c:882) ==00:00:00:35.012 21525== by 0x66FA54: DecodeXLogTuple (decode.c:899) ==00:00:00:35.012 21525== by 0x670561: LogicalDecodingProcessRecord (decode.c:711) ==00:00:00:35.012 21525== by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440) ==00:00:00:35.012 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:35.012 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:35.012 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:35.012 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:35.012 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:35.012 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:35.012 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:35.012 21525== by 0x6BD7E3: exec_simple_query (postgres.c:1072) ==00:00:00:35.012 21525== Address 0x4ff2451 is 1 bytes after a block of size 8,192 alloc'd ==00:00:00:35.012 21525== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:00:35.012 21525== by 0x4ED399: XLogReaderAllocate (xlogreader.c:83) ==00:00:00:35.012 21525== by 0x6710B3: StartupDecodingContext (logical.c:161) ==00:00:00:35.012 21525== by 0x671303: CreateDecodingContext (logical.c:413) ==00:00:00:35.012 21525== by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394) ==00:00:00:35.012 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) ==00:00:00:35.012 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) ==00:00:00:35.012 21525== by 0x5C170D: ExecScan (execScan.c:82) ==00:00:00:35.012 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) ==00:00:00:35.012 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) ==00:00:00:35.012 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) ==00:00:00:35.012 21525== by 0x6C11EF: PortalRun (pquery.c:786) ==00:00:00:35.012 21525== ... ... LOG: server process (PID 21525) exited with exit code 128 This is rather interesting because I do not recall that any of skink's failures have shown an access more than 1 byte past the end of the buffer. Any suggestions how to debug this? regards, tom lane
Hi, On 2016-07-20 12:45:04 -0400, Tom Lane wrote: > I wrote: > > I've still had no luck reproducing it here, though. Same here so far. > Hah --- I take that back. On about the fourth or fifth trial: Interesting. > ==00:00:00:34.291 21525== Invalid read of size 1 > ==00:00:00:34.291 21525== at 0x4A08DEC: memcpy (mc_replace_strmem.c:882) > ==00:00:00:34.291 21525== by 0x66FA54: DecodeXLogTuple (decode.c:899) > ==00:00:00:34.291 21525== by 0x670561: LogicalDecodingProcessRecord (decode.c:711) > ==00:00:00:34.291 21525== by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440) > ==00:00:00:34.291 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) > ==00:00:00:34.291 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) > ==00:00:00:34.291 21525== by 0x5C170D: ExecScan (execScan.c:82) > ==00:00:00:34.291 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) > ==00:00:00:34.291 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) > ==00:00:00:34.291 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) > ==00:00:00:34.291 21525== by 0x6C11EF: PortalRun (pquery.c:786) > ==00:00:00:34.291 21525== by 0x6BD7E3: exec_simple_query (postgres.c:1072) > ==00:00:00:34.291 21525== Address 0xe5311d6 is 6 bytes after a block of size 8,192 alloc'd > ==00:00:00:34.291 21525== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) > ==00:00:00:34.291 21525== by 0x4ED399: XLogReaderAllocate (xlogreader.c:83) > ==00:00:00:34.291 21525== by 0x6710B3: StartupDecodingContext (logical.c:161) > ==00:00:00:34.291 21525== by 0x671303: CreateDecodingContext (logical.c:413) > ==00:00:00:34.291 21525== by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394) > ==00:00:00:34.291 21525== by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196) > ==00:00:00:34.291 21525== by 0x5D4131: FunctionNext (nodeFunctionscan.c:95) > ==00:00:00:34.291 21525== by 0x5C170D: ExecScan (execScan.c:82) > ==00:00:00:34.291 21525== by 0x5BA007: ExecProcNode (execProcnode.c:426) > ==00:00:00:34.291 21525== by 0x5B8A61: standard_ExecutorRun (execMain.c:1490) > ==00:00:00:34.291 21525== by 0x6BFE36: PortalRunSelect (pquery.c:942) > ==00:00:00:34.291 21525== by 0x6C11EF: PortalRun (pquery.c:786) > ==00:00:00:34.291 21525== > This is rather interesting because I do not recall that any of skink's > failures have shown an access more than 1 byte past the end of the buffer. > > Any suggestions how to debug this? I guess either using valgrind's gdb server on error, or putting some asserts checking the size would be best. I can look into it, but it'll not be today likely. Regards, Andres
Andres Freund <andres@anarazel.de> writes: > I guess either using valgrind's gdb server on error, or putting some > asserts checking the size would be best. I can look into it, but it'll > not be today likely. I believe the problem is that DecodeUpdate is not on the same page as the WAL-writing routines about how much data there is for an old_key_tuple. Specifically, I see this in 9.4's log_heap_update(): if (old_key_tuple) { ... xlhdr_idx.t_len = old_key_tuple->t_len; rdata[nr].data = (char *) old_key_tuple->t_data + offsetof(HeapTupleHeaderData, t_bits); rdata[nr].len = old_key_tuple->t_len - offsetof(HeapTupleHeaderData, t_bits); ... } so that the amount of tuple data that's *actually* in WAL is offsetof(HeapTupleHeaderData, t_bits) less than what t_len says. However, over in DecodeUpdate, this is processed with xl_heap_header_len xlhdr; memcpy(&xlhdr, data, sizeof(xlhdr)); ... datalen = xlhdr.t_len + SizeOfHeapHeader; ... DecodeXLogTuple(data,datalen, change->data.tp.oldtuple); and what DecodeXLogTuple does is int datalen = len - SizeOfHeapHeader; (so we're back to datalen == xlhdr.t_len) ... memcpy(((char *)tuple->tuple.t_data) + offsetof(HeapTupleHeaderData, t_bits), data + SizeOfHeapHeader, datalen); so that we are copying offsetof(HeapTupleHeaderData, t_bits) too much data from the WAL buffer. Most of the time this doesn't hurt but it's making valgrind complain, and on a unlucky day we might crash entirely. I have not looked to see if the bug also exists in > 9.4. Also, it's not very clear to me whether other call sites for DecodeXLogTuple might have related bugs. regards, tom lane