Thread: skink's test_decoding failures in 9.4 branch

skink's test_decoding failures in 9.4 branch

From

Tom Lane

Date:

20 July 2016, 16:06:07

I can't help noticing that the failure rate on skink has gone from
"rare" to "100%" since 3d5b227:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skink&br=REL9_4_STABLE
I think we need to put some effort into figuring out what's up there.

Also, this morning curculio showed what might be the same issue:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2016-07-20%2014%3A30%3A01

I've still had no luck reproducing it here, though.  Wonder if it is
specific to certain compilers.
        regards, tom lane

Re: skink's test_decoding failures in 9.4 branch

From

Tom Lane

Date:

20 July 2016, 16:45:18

I wrote:
> I've still had no luck reproducing it here, though.

Hah --- I take that back.  On about the fourth or fifth trial:

==00:00:00:34.291 21525== Invalid read of size 1
==00:00:00:34.291 21525==    at 0x4A08DEC: memcpy (mc_replace_strmem.c:882)
==00:00:00:34.291 21525==    by 0x66FA54: DecodeXLogTuple (decode.c:899)
==00:00:00:34.291 21525==    by 0x670561: LogicalDecodingProcessRecord (decode.c:711)
==00:00:00:34.291 21525==    by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440)
==00:00:00:34.291 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:34.291 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:34.291 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:34.291 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:34.291 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:34.291 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:34.291 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:34.291 21525==    by 0x6BD7E3: exec_simple_query (postgres.c:1072)
==00:00:00:34.291 21525==  Address 0xe5311d6 is 6 bytes after a block of size 8,192 alloc'd
==00:00:00:34.291 21525==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:00:34.291 21525==    by 0x4ED399: XLogReaderAllocate (xlogreader.c:83)
==00:00:00:34.291 21525==    by 0x6710B3: StartupDecodingContext (logical.c:161)
==00:00:00:34.291 21525==    by 0x671303: CreateDecodingContext (logical.c:413)
==00:00:00:34.291 21525==    by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394)
==00:00:00:34.291 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:34.291 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:34.291 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:34.291 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:34.291 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:34.291 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:34.291 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:34.291 21525== 
...
...
==00:00:00:35.011 21525== Invalid read of size 1
==00:00:00:35.011 21525==    at 0x4A08CCA: memcpy (mc_replace_strmem.c:882)
==00:00:00:35.011 21525==    by 0x66FA54: DecodeXLogTuple (decode.c:899)
==00:00:00:35.011 21525==    by 0x670561: LogicalDecodingProcessRecord (decode.c:711)
==00:00:00:35.011 21525==    by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440)
==00:00:00:35.011 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:35.011 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:35.011 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:35.011 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:35.012 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:35.012 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:35.012 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:35.012 21525==    by 0x6BD7E3: exec_simple_query (postgres.c:1072)
==00:00:00:35.012 21525==  Address 0x4ff2450 is 0 bytes after a block of size 8,192 alloc'd
==00:00:00:35.012 21525==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:00:35.012 21525==    by 0x4ED399: XLogReaderAllocate (xlogreader.c:83)
==00:00:00:35.012 21525==    by 0x6710B3: StartupDecodingContext (logical.c:161)
==00:00:00:35.012 21525==    by 0x671303: CreateDecodingContext (logical.c:413)
==00:00:00:35.012 21525==    by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394)
==00:00:00:35.012 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:35.012 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:35.012 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:35.012 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:35.012 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:35.012 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:35.012 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:35.012 21525== 
==00:00:00:35.012 21525== Invalid read of size 1
==00:00:00:35.012 21525==    at 0x4A08CB8: memcpy (mc_replace_strmem.c:882)
==00:00:00:35.012 21525==    by 0x66FA54: DecodeXLogTuple (decode.c:899)
==00:00:00:35.012 21525==    by 0x670561: LogicalDecodingProcessRecord (decode.c:711)
==00:00:00:35.012 21525==    by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440)
==00:00:00:35.012 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:35.012 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:35.012 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:35.012 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:35.012 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:35.012 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:35.012 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:35.012 21525==    by 0x6BD7E3: exec_simple_query (postgres.c:1072)
==00:00:00:35.012 21525==  Address 0x4ff2451 is 1 bytes after a block of size 8,192 alloc'd
==00:00:00:35.012 21525==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:00:35.012 21525==    by 0x4ED399: XLogReaderAllocate (xlogreader.c:83)
==00:00:00:35.012 21525==    by 0x6710B3: StartupDecodingContext (logical.c:161)
==00:00:00:35.012 21525==    by 0x671303: CreateDecodingContext (logical.c:413)
==00:00:00:35.012 21525==    by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394)
==00:00:00:35.012 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
==00:00:00:35.012 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
==00:00:00:35.012 21525==    by 0x5C170D: ExecScan (execScan.c:82)
==00:00:00:35.012 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
==00:00:00:35.012 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
==00:00:00:35.012 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
==00:00:00:35.012 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
==00:00:00:35.012 21525== 
...
...
LOG:  server process (PID 21525) exited with exit code 128

This is rather interesting because I do not recall that any of skink's
failures have shown an access more than 1 byte past the end of the buffer.

Any suggestions how to debug this?
        regards, tom lane

Re: skink's test_decoding failures in 9.4 branch

From

Andres Freund

Date:

20 July 2016, 17:02:10

Hi,

On 2016-07-20 12:45:04 -0400, Tom Lane wrote:
> I wrote:
> > I've still had no luck reproducing it here, though.

Same here so far.

> Hah --- I take that back.  On about the fourth or fifth trial:

Interesting.


> ==00:00:00:34.291 21525== Invalid read of size 1
> ==00:00:00:34.291 21525==    at 0x4A08DEC: memcpy (mc_replace_strmem.c:882)
> ==00:00:00:34.291 21525==    by 0x66FA54: DecodeXLogTuple (decode.c:899)
> ==00:00:00:34.291 21525==    by 0x670561: LogicalDecodingProcessRecord (decode.c:711)
> ==00:00:00:34.291 21525==    by 0x671BC3: pg_logical_slot_get_changes_guts (logicalfuncs.c:440)
> ==00:00:00:34.291 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
> ==00:00:00:34.291 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
> ==00:00:00:34.291 21525==    by 0x5C170D: ExecScan (execScan.c:82)
> ==00:00:00:34.291 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
> ==00:00:00:34.291 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
> ==00:00:00:34.291 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
> ==00:00:00:34.291 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
> ==00:00:00:34.291 21525==    by 0x6BD7E3: exec_simple_query (postgres.c:1072)
> ==00:00:00:34.291 21525==  Address 0xe5311d6 is 6 bytes after a block of size 8,192 alloc'd
> ==00:00:00:34.291 21525==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
> ==00:00:00:34.291 21525==    by 0x4ED399: XLogReaderAllocate (xlogreader.c:83)
> ==00:00:00:34.291 21525==    by 0x6710B3: StartupDecodingContext (logical.c:161)
> ==00:00:00:34.291 21525==    by 0x671303: CreateDecodingContext (logical.c:413)
> ==00:00:00:34.291 21525==    by 0x671AF7: pg_logical_slot_get_changes_guts (logicalfuncs.c:394)
> ==00:00:00:34.291 21525==    by 0x5C0B6B: ExecMakeTableFunctionResult (execQual.c:2196)
> ==00:00:00:34.291 21525==    by 0x5D4131: FunctionNext (nodeFunctionscan.c:95)
> ==00:00:00:34.291 21525==    by 0x5C170D: ExecScan (execScan.c:82)
> ==00:00:00:34.291 21525==    by 0x5BA007: ExecProcNode (execProcnode.c:426)
> ==00:00:00:34.291 21525==    by 0x5B8A61: standard_ExecutorRun (execMain.c:1490)
> ==00:00:00:34.291 21525==    by 0x6BFE36: PortalRunSelect (pquery.c:942)
> ==00:00:00:34.291 21525==    by 0x6C11EF: PortalRun (pquery.c:786)
> ==00:00:00:34.291 21525== 

> This is rather interesting because I do not recall that any of skink's
> failures have shown an access more than 1 byte past the end of the buffer.
> 
> Any suggestions how to debug this?

I guess either using valgrind's gdb server on error, or putting some
asserts checking the size would be best. I can look into it, but it'll
not be today likely.

Regards,

Andres

Re: skink's test_decoding failures in 9.4 branch

From

Tom Lane

Date:

20 July 2016, 23:44:14

Andres Freund <andres@anarazel.de> writes:
> I guess either using valgrind's gdb server on error, or putting some
> asserts checking the size would be best. I can look into it, but it'll
> not be today likely.

I believe the problem is that DecodeUpdate is not on the same page as the
WAL-writing routines about how much data there is for an old_key_tuple.
Specifically, I see this in 9.4's log_heap_update():
       if (old_key_tuple)       {           ...           xlhdr_idx.t_len = old_key_tuple->t_len;
           rdata[nr].data = (char *) old_key_tuple->t_data               + offsetof(HeapTupleHeaderData, t_bits);
   rdata[nr].len = old_key_tuple->t_len               - offsetof(HeapTupleHeaderData, t_bits);           ...       }
 

so that the amount of tuple data that's *actually* in WAL is
offsetof(HeapTupleHeaderData, t_bits) less than what t_len says.
However, over in DecodeUpdate, this is processed with
       xl_heap_header_len xlhdr;
       memcpy(&xlhdr, data, sizeof(xlhdr));       ...       datalen = xlhdr.t_len + SizeOfHeapHeader;       ...
DecodeXLogTuple(data,datalen, change->data.tp.oldtuple);
 

and what DecodeXLogTuple does is
   int            datalen = len - SizeOfHeapHeader;   (so we're back to datalen == xlhdr.t_len)   ...   memcpy(((char
*)tuple->tuple.t_data) + offsetof(HeapTupleHeaderData, t_bits),          data + SizeOfHeapHeader,          datalen);
 

so that we are copying offsetof(HeapTupleHeaderData, t_bits) too much
data from the WAL buffer.  Most of the time this doesn't hurt but it's
making valgrind complain, and on a unlucky day we might crash entirely.

I have not looked to see if the bug also exists in > 9.4.  Also, it's
not very clear to me whether other call sites for DecodeXLogTuple might
have related bugs.
        regards, tom lane