Thread: Weird irreproducible behaviors in plpython tests

Weird irreproducible behaviors in plpython tests

From

Tom Lane

Date:

10 April 2016, 21:11:05

Buildfarm member tick failed today with what appears to be a bug
introduced by (or at least exposed by) the recent plpython ereport
patch.  Presumably the fact that it's failing and no other critters are
has to do with its use of -DCLOBBER_CACHE_ALWAYS and/or
-DRANDOMIZE_ALLOCATED_MEMORY.  However, I cannot reproduce the failure
by using those switches, even though tick's CentOS platform should be
pretty much identical to my RHEL6 machine.  Can anyone else reproduce it?

After failing at that, it occurred to me to try it under valgrind,
and kaboom!  I found a *different* bug, which has apparently been
there a long time.  (I say different because I don't see how this
one could produce tick's symptoms; it's a reference to already-freed
strings, but not an attempt to pfree one.)  I'll be fixing this one
shortly, but now we have another puzzle: why isn't buildfarm member
skink seeing the same failure?  It is running the plpython tests.
Can anyone else reproduce a failure by valgrind'ing the plpython
tests?  It looks here like

==00:00:00:29.670 24996== Invalid read of size 1
==00:00:00:29.670 24996==    at 0x4A07F92: strlen (mc_replace_strmem.c:403)
==00:00:00:29.670 24996==    by 0x826860: MemoryContextStrdup (mcxt.c:1158)
==00:00:00:29.670 24996==    by 0x800441: set_errdata_field (elog.c:1230)
==00:00:00:29.670 24996==    by 0x803EE8: err_generic_string (elog.c:1210)
==00:00:00:29.670 24996==    by 0xDFEF382: PLy_elog (plpy_elog.c:117)
==00:00:00:29.670 24996==    by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:29.670 24996==    by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105)
==00:00:00:29.670 24996==    by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345)
==00:00:00:29.670 24996==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)
==00:00:00:29.670 24996==    by 0x70E97F: standard_ProcessUtility (utility.c:515)
==00:00:00:29.670 24996==    by 0x70A856: PortalRunUtility (pquery.c:1175)
==00:00:00:29.670 24996==    by 0x70B8FF: PortalRunMulti (pquery.c:1306)
==00:00:00:29.670 24996==  Address 0xefe2894 is 36 bytes inside a block of size 49 free'd
==00:00:00:29.670 24996==    at 0x4A06430: free (vg_replace_malloc.c:446)
==00:00:00:29.670 24996==    by 0x398AE90D5C: ??? (in /usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE79E3A: ??? (in /usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE5C586: ??? (in /usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE5C5BF: ??? (in /usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0x398AE9A704: ??? (in /usr/lib64/libpython2.6.so.1.0)
==00:00:00:29.670 24996==    by 0xDFEECFE: PLy_traceback (plpy_elog.c:358)
==00:00:00:29.670 24996==    by 0xDFEF21E: PLy_elog (plpy_elog.c:83)
==00:00:00:29.670 24996==    by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:29.670 24996==    by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105)
==00:00:00:29.670 24996==    by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345)
==00:00:00:29.670 24996==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)

the core of the problem being that PLy_get_spi_error_data returns a
pointer to a string that is pointing into the "val" (a/k/a "v") Python
object, which PLy_traceback has carefully released the last refcount on.
So this coding should have failed immediately under any valgrind
testing.  The ereport patch perhaps gave us more scope for the error,
but for me, valgrind'ing the plpython tests fails similarly in 9.5.
So we should have noticed this long since.

I dislike bugs that are platform-dependent with no obvious
reason for it :-(
        regards, tom lane

Re: Weird irreproducible behaviors in plpython tests

From

Andres Freund

Date:

10 April 2016, 21:26:42

On 2016-04-10 17:10:57 -0400, Tom Lane wrote:
> After failing at that, it occurred to me to try it under valgrind,
> and kaboom!  I found a *different* bug, which has apparently been
> there a long time.  (I say different because I don't see how this
> one could produce tick's symptoms; it's a reference to already-freed
> strings, but not an attempt to pfree one.)  I'll be fixing this one
> shortly, but now we have another puzzle: why isn't buildfarm member
> skink seeing the same failure?  It is running the plpython tests.
> Can anyone else reproduce a failure by valgrind'ing the plpython
> tests?  It looks here like

I possible that it's hidden by the broad python error suppressions I
added to valgrind.supp, and the reason they're showing for you is that
you don't appear to have python debugging symbols. But I don't exactly
see the error matching up...

- Andres

Re: Weird irreproducible behaviors in plpython tests

From

Tom Lane

Date:

10 April 2016, 21:55:30

Andres Freund <andres@anarazel.de> writes:
> On 2016-04-10 17:10:57 -0400, Tom Lane wrote:
>> ... I'll be fixing this one
>> shortly, but now we have another puzzle: why isn't buildfarm member
>> skink seeing the same failure?  It is running the plpython tests.

> I possible that it's hidden by the broad python error suppressions I
> added to valgrind.supp, and the reason they're showing for you is that
> you don't appear to have python debugging symbols. But I don't exactly
> see the error matching up...

Hmm.  It's true that I don't have the python debuginfo RPM installed.
But this is a live bug, so I suspect you were too generous about
those suppressions.

FWIW, HEAD passes cleanly under valgrind for me after fixing this one
problem.  I have to leave shortly but will work on the back branches
later.
        regards, tom lane

Re: Weird irreproducible behaviors in plpython tests

From

Andres Freund

Date:

10 April 2016, 21:57:56

On 2016-04-10 17:55:25 -0400, Tom Lane wrote:
> Hmm.  It's true that I don't have the python debuginfo RPM installed.
> But this is a live bug, so I suspect you were too generous about
> those suppressions.

Could be - I just used the ones (after adapting for 32 vs. 64 bit
issues) provided by upstream.

> FWIW, HEAD passes cleanly under valgrind for me after fixing this one
> problem.  I have to leave shortly but will work on the back branches
> later.

Looking through them again:
# Python's allocator does some low-level tricks for efficiency. Those
# can be disabled for better instrumentation; but few people testing
# postgres will have such a build of python. So add broad
# suppressions of the resulting errors.
# See also https://svn.python.org/projects/python/trunk/Misc/README.valgrind
{  python_clever_allocator  Memcheck:Addr4  fun:PyObject_Free
}

{  python_clever_allocator  Memcheck:Addr8  fun:PyObject_Free
}

{  python_clever_allocator  Memcheck:Value4  fun:PyObject_Free
}

{  python_clever_allocator  Memcheck:Value8  fun:PyObject_Free
}

{  python_clever_allocator  Memcheck:Cond  fun:PyObject_Free
}

{  python_clever_allocator  Memcheck:Addr4  fun:PyObject_Realloc
}

{  python_clever_allocator  Memcheck:Addr8  fun:PyObject_Realloc
}

{  python_clever_allocator  Memcheck:Value4  fun:PyObject_Realloc
}

{  python_clever_allocator  Memcheck:Value8  fun:PyObject_Realloc
}

{  python_clever_allocator  Memcheck:Cond  fun:PyObject_Realloc
}

I can't actually see any triggering invalidly with the backtrace you
provided :(

Greetings,

Andres Freund

Re: Weird irreproducible behaviors in plpython tests

From

Tom Lane

Date:

11 April 2016, 02:03:59

Andres Freund <andres@anarazel.de> writes:
> On 2016-04-10 17:55:25 -0400, Tom Lane wrote:
>> Hmm.  It's true that I don't have the python debuginfo RPM installed.
>> But this is a live bug, so I suspect you were too generous about
>> those suppressions.

> Could be - I just used the ones (after adapting for 32 vs. 64 bit
> issues) provided by upstream.

I still see the same failure after installing python-debuginfo:

==00:00:00:55.235 18250== Invalid read of size 1
==00:00:00:55.235 18250==    at 0x4A07F92: strlen (mc_replace_strmem.c:403)
==00:00:00:55.235 18250==    by 0x826860: MemoryContextStrdup (mcxt.c:1158)
==00:00:00:55.235 18250==    by 0x800441: set_errdata_field (elog.c:1230)
==00:00:00:55.235 18250==    by 0x803EE8: err_generic_string (elog.c:1210)
==00:00:00:55.235 18250==    by 0xDFEF2DD: PLy_elog (plpy_elog.c:117)
==00:00:00:55.235 18250==    by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:55.235 18250==    by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105)
==00:00:00:55.235 18250==    by 0xDFF2103: plpython_inline_handler (plpy_main.c:345)
==00:00:00:55.235 18250==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)
==00:00:00:55.235 18250==    by 0x70E97F: standard_ProcessUtility (utility.c:515)
==00:00:00:55.235 18250==    by 0x70A856: PortalRunUtility (pquery.c:1175)
==00:00:00:55.235 18250==    by 0x70B8FF: PortalRunMulti (pquery.c:1306)
==00:00:00:55.235 18250==  Address 0xefe1954 is 36 bytes inside a block of size 49 free'd
==00:00:00:55.235 18250==    at 0x4A06430: free (vg_replace_malloc.c:446)
==00:00:00:55.235 18250==    by 0x398AE90D5C: tupledealloc (tupleobject.c:170)
==00:00:00:55.235 18250==    by 0x398AE79E3A: dict_dealloc (dictobject.c:911)
==00:00:00:55.235 18250==    by 0x398AE5C586: BaseException_clear (exceptions.c:77)
==00:00:00:55.235 18250==    by 0x398AE5C5BF: BaseException_dealloc (exceptions.c:87)
==00:00:00:55.235 18250==    by 0x398AE9A704: subtype_dealloc (typeobject.c:1019)
==00:00:00:55.236 18250==    by 0xDFEEDBC: PLy_traceback (plpy_elog.c:358)
==00:00:00:55.236 18250==    by 0xDFEF154: PLy_elog (plpy_elog.c:83)
==00:00:00:55.236 18250==    by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084)
==00:00:00:55.236 18250==    by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105)
==00:00:00:55.236 18250==    by 0xDFF2103: plpython_inline_handler (plpy_main.c:345)
==00:00:00:55.236 18250==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)


With the patch I'm working on, no error, not even with the python-specific
suppressions all removed from valgrind.supp.  Not sure what to make of
that.  RHEL6's version of python is 2.6.6, which is not real new, but
the documentation that comes with it indicates that the false-valgrind-
warnings problem exists.
        regards, tom lane

Re: Weird irreproducible behaviors in plpython tests

From

Andres Freund

Date:

11 April 2016, 02:18:31

On 2016-04-10 22:03:53 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-04-10 17:55:25 -0400, Tom Lane wrote:
> >> Hmm.  It's true that I don't have the python debuginfo RPM installed.
> >> But this is a live bug, so I suspect you were too generous about
> >> those suppressions.
>
> > Could be - I just used the ones (after adapting for 32 vs. 64 bit
> > issues) provided by upstream.
>
> I still see the same failure after installing python-debuginfo:
>
> ==00:00:00:55.235 18250== Invalid read of size 1
> ==00:00:00:55.235 18250==    at 0x4A07F92: strlen (mc_replace_strmem.c:403)
> ==00:00:00:55.235 18250==    by 0x826860: MemoryContextStrdup (mcxt.c:1158)
> ==00:00:00:55.235 18250==    by 0x800441: set_errdata_field (elog.c:1230)
> ==00:00:00:55.235 18250==    by 0x803EE8: err_generic_string (elog.c:1210)
> ==00:00:00:55.235 18250==    by 0xDFEF2DD: PLy_elog (plpy_elog.c:117)
> ==00:00:00:55.235 18250==    by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084)
> ==00:00:00:55.235 18250==    by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105)
> ==00:00:00:55.235 18250==    by 0xDFF2103: plpython_inline_handler (plpy_main.c:345)
> ==00:00:00:55.235 18250==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)
> ==00:00:00:55.235 18250==    by 0x70E97F: standard_ProcessUtility (utility.c:515)
> ==00:00:00:55.235 18250==    by 0x70A856: PortalRunUtility (pquery.c:1175)
> ==00:00:00:55.235 18250==    by 0x70B8FF: PortalRunMulti (pquery.c:1306)
> ==00:00:00:55.235 18250==  Address 0xefe1954 is 36 bytes inside a block of size 49 free'd
> ==00:00:00:55.235 18250==    at 0x4A06430: free (vg_replace_malloc.c:446)
> ==00:00:00:55.235 18250==    by 0x398AE90D5C: tupledealloc (tupleobject.c:170)
> ==00:00:00:55.235 18250==    by 0x398AE79E3A: dict_dealloc (dictobject.c:911)
> ==00:00:00:55.235 18250==    by 0x398AE5C586: BaseException_clear (exceptions.c:77)
> ==00:00:00:55.235 18250==    by 0x398AE5C5BF: BaseException_dealloc (exceptions.c:87)
> ==00:00:00:55.235 18250==    by 0x398AE9A704: subtype_dealloc (typeobject.c:1019)
> ==00:00:00:55.236 18250==    by 0xDFEEDBC: PLy_traceback (plpy_elog.c:358)
> ==00:00:00:55.236 18250==    by 0xDFEF154: PLy_elog (plpy_elog.c:83)
> ==00:00:00:55.236 18250==    by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084)
> ==00:00:00:55.236 18250==    by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105)
> ==00:00:00:55.236 18250==    by 0xDFF2103: plpython_inline_handler (plpy_main.c:345)
> ==00:00:00:55.236 18250==    by 0x809737: OidFunctionCall1Coll (fmgr.c:1596)
>
>
> With the patch I'm working on, no error, not even with the python-specific
> suppressions all removed from valgrind.supp.  Not sure what to make of
> that.  RHEL6's version of python is 2.6.6, which is not real new, but
> the documentation that comes with it indicates that the false-valgrind-
> warnings problem exists.

I downloaded the RHEL6 srpm, and it appears to be patched to
automatically disable pymalloc when running under valgrind (via
disable-pymalloc-on-valgrind-py26.patch).  That explains why you're
seing the problem, but skink isn't (it's running debian).

Greetings,

Andres Freund

Attachment

disable-pymalloc-on-valgrind-py26.patch

Re: Weird irreproducible behaviors in plpython tests

From

Tom Lane

Date:

11 April 2016, 02:47:36

Andres Freund <andres@anarazel.de> writes:
> I downloaded the RHEL6 srpm, and it appears to be patched to
> automatically disable pymalloc when running under valgrind (via
> disable-pymalloc-on-valgrind-py26.patch).  That explains why you're
> seing the problem, but skink isn't (it's running debian).

Hah.  I suspected Red Hat had done something to hide the issue, but
didn't get around to investigating yet.  Anyway, it's clearly a real
bug, even if skink is failing to find it.
        regards, tom lane