Thread: Weird irreproducible behaviors in plpython tests
Buildfarm member tick failed today with what appears to be a bug introduced by (or at least exposed by) the recent plpython ereport patch. Presumably the fact that it's failing and no other critters are has to do with its use of -DCLOBBER_CACHE_ALWAYS and/or -DRANDOMIZE_ALLOCATED_MEMORY. However, I cannot reproduce the failure by using those switches, even though tick's CentOS platform should be pretty much identical to my RHEL6 machine. Can anyone else reproduce it? After failing at that, it occurred to me to try it under valgrind, and kaboom! I found a *different* bug, which has apparently been there a long time. (I say different because I don't see how this one could produce tick's symptoms; it's a reference to already-freed strings, but not an attempt to pfree one.) I'll be fixing this one shortly, but now we have another puzzle: why isn't buildfarm member skink seeing the same failure? It is running the plpython tests. Can anyone else reproduce a failure by valgrind'ing the plpython tests? It looks here like ==00:00:00:29.670 24996== Invalid read of size 1 ==00:00:00:29.670 24996== at 0x4A07F92: strlen (mc_replace_strmem.c:403) ==00:00:00:29.670 24996== by 0x826860: MemoryContextStrdup (mcxt.c:1158) ==00:00:00:29.670 24996== by 0x800441: set_errdata_field (elog.c:1230) ==00:00:00:29.670 24996== by 0x803EE8: err_generic_string (elog.c:1210) ==00:00:00:29.670 24996== by 0xDFEF382: PLy_elog (plpy_elog.c:117) ==00:00:00:29.670 24996== by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:29.670 24996== by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105) ==00:00:00:29.670 24996== by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345) ==00:00:00:29.670 24996== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) ==00:00:00:29.670 24996== by 0x70E97F: standard_ProcessUtility (utility.c:515) ==00:00:00:29.670 24996== by 0x70A856: PortalRunUtility (pquery.c:1175) ==00:00:00:29.670 24996== by 0x70B8FF: PortalRunMulti (pquery.c:1306) ==00:00:00:29.670 24996== Address 0xefe2894 is 36 bytes inside a block of size 49 free'd ==00:00:00:29.670 24996== at 0x4A06430: free (vg_replace_malloc.c:446) ==00:00:00:29.670 24996== by 0x398AE90D5C: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE79E3A: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE5C586: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE5C5BF: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0x398AE9A704: ??? (in /usr/lib64/libpython2.6.so.1.0) ==00:00:00:29.670 24996== by 0xDFEECFE: PLy_traceback (plpy_elog.c:358) ==00:00:00:29.670 24996== by 0xDFEF21E: PLy_elog (plpy_elog.c:83) ==00:00:00:29.670 24996== by 0xDFF049D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:29.670 24996== by 0xDFF1BBF: PLy_exec_function (plpy_exec.c:105) ==00:00:00:29.670 24996== by 0xDFF25C6: plpython_inline_handler (plpy_main.c:345) ==00:00:00:29.670 24996== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) the core of the problem being that PLy_get_spi_error_data returns a pointer to a string that is pointing into the "val" (a/k/a "v") Python object, which PLy_traceback has carefully released the last refcount on. So this coding should have failed immediately under any valgrind testing. The ereport patch perhaps gave us more scope for the error, but for me, valgrind'ing the plpython tests fails similarly in 9.5. So we should have noticed this long since. I dislike bugs that are platform-dependent with no obvious reason for it :-( regards, tom lane
On 2016-04-10 17:10:57 -0400, Tom Lane wrote: > After failing at that, it occurred to me to try it under valgrind, > and kaboom! I found a *different* bug, which has apparently been > there a long time. (I say different because I don't see how this > one could produce tick's symptoms; it's a reference to already-freed > strings, but not an attempt to pfree one.) I'll be fixing this one > shortly, but now we have another puzzle: why isn't buildfarm member > skink seeing the same failure? It is running the plpython tests. > Can anyone else reproduce a failure by valgrind'ing the plpython > tests? It looks here like I possible that it's hidden by the broad python error suppressions I added to valgrind.supp, and the reason they're showing for you is that you don't appear to have python debugging symbols. But I don't exactly see the error matching up... - Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-04-10 17:10:57 -0400, Tom Lane wrote: >> ... I'll be fixing this one >> shortly, but now we have another puzzle: why isn't buildfarm member >> skink seeing the same failure? It is running the plpython tests. > I possible that it's hidden by the broad python error suppressions I > added to valgrind.supp, and the reason they're showing for you is that > you don't appear to have python debugging symbols. But I don't exactly > see the error matching up... Hmm. It's true that I don't have the python debuginfo RPM installed. But this is a live bug, so I suspect you were too generous about those suppressions. FWIW, HEAD passes cleanly under valgrind for me after fixing this one problem. I have to leave shortly but will work on the back branches later. regards, tom lane
On 2016-04-10 17:55:25 -0400, Tom Lane wrote: > Hmm. It's true that I don't have the python debuginfo RPM installed. > But this is a live bug, so I suspect you were too generous about > those suppressions. Could be - I just used the ones (after adapting for 32 vs. 64 bit issues) provided by upstream. > FWIW, HEAD passes cleanly under valgrind for me after fixing this one > problem. I have to leave shortly but will work on the back branches > later. Looking through them again: # Python's allocator does some low-level tricks for efficiency. Those # can be disabled for better instrumentation; but few people testing # postgres will have such a build of python. So add broad # suppressions of the resulting errors. # See also https://svn.python.org/projects/python/trunk/Misc/README.valgrind { python_clever_allocator Memcheck:Addr4 fun:PyObject_Free } { python_clever_allocator Memcheck:Addr8 fun:PyObject_Free } { python_clever_allocator Memcheck:Value4 fun:PyObject_Free } { python_clever_allocator Memcheck:Value8 fun:PyObject_Free } { python_clever_allocator Memcheck:Cond fun:PyObject_Free } { python_clever_allocator Memcheck:Addr4 fun:PyObject_Realloc } { python_clever_allocator Memcheck:Addr8 fun:PyObject_Realloc } { python_clever_allocator Memcheck:Value4 fun:PyObject_Realloc } { python_clever_allocator Memcheck:Value8 fun:PyObject_Realloc } { python_clever_allocator Memcheck:Cond fun:PyObject_Realloc } I can't actually see any triggering invalidly with the backtrace you provided :( Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2016-04-10 17:55:25 -0400, Tom Lane wrote: >> Hmm. It's true that I don't have the python debuginfo RPM installed. >> But this is a live bug, so I suspect you were too generous about >> those suppressions. > Could be - I just used the ones (after adapting for 32 vs. 64 bit > issues) provided by upstream. I still see the same failure after installing python-debuginfo: ==00:00:00:55.235 18250== Invalid read of size 1 ==00:00:00:55.235 18250== at 0x4A07F92: strlen (mc_replace_strmem.c:403) ==00:00:00:55.235 18250== by 0x826860: MemoryContextStrdup (mcxt.c:1158) ==00:00:00:55.235 18250== by 0x800441: set_errdata_field (elog.c:1230) ==00:00:00:55.235 18250== by 0x803EE8: err_generic_string (elog.c:1210) ==00:00:00:55.235 18250== by 0xDFEF2DD: PLy_elog (plpy_elog.c:117) ==00:00:00:55.235 18250== by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:55.235 18250== by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105) ==00:00:00:55.235 18250== by 0xDFF2103: plpython_inline_handler (plpy_main.c:345) ==00:00:00:55.235 18250== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) ==00:00:00:55.235 18250== by 0x70E97F: standard_ProcessUtility (utility.c:515) ==00:00:00:55.235 18250== by 0x70A856: PortalRunUtility (pquery.c:1175) ==00:00:00:55.235 18250== by 0x70B8FF: PortalRunMulti (pquery.c:1306) ==00:00:00:55.235 18250== Address 0xefe1954 is 36 bytes inside a block of size 49 free'd ==00:00:00:55.235 18250== at 0x4A06430: free (vg_replace_malloc.c:446) ==00:00:00:55.235 18250== by 0x398AE90D5C: tupledealloc (tupleobject.c:170) ==00:00:00:55.235 18250== by 0x398AE79E3A: dict_dealloc (dictobject.c:911) ==00:00:00:55.235 18250== by 0x398AE5C586: BaseException_clear (exceptions.c:77) ==00:00:00:55.235 18250== by 0x398AE5C5BF: BaseException_dealloc (exceptions.c:87) ==00:00:00:55.235 18250== by 0x398AE9A704: subtype_dealloc (typeobject.c:1019) ==00:00:00:55.236 18250== by 0xDFEEDBC: PLy_traceback (plpy_elog.c:358) ==00:00:00:55.236 18250== by 0xDFEF154: PLy_elog (plpy_elog.c:83) ==00:00:00:55.236 18250== by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084) ==00:00:00:55.236 18250== by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105) ==00:00:00:55.236 18250== by 0xDFF2103: plpython_inline_handler (plpy_main.c:345) ==00:00:00:55.236 18250== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) With the patch I'm working on, no error, not even with the python-specific suppressions all removed from valgrind.supp. Not sure what to make of that. RHEL6's version of python is 2.6.6, which is not real new, but the documentation that comes with it indicates that the false-valgrind- warnings problem exists. regards, tom lane
On 2016-04-10 22:03:53 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-04-10 17:55:25 -0400, Tom Lane wrote: > >> Hmm. It's true that I don't have the python debuginfo RPM installed. > >> But this is a live bug, so I suspect you were too generous about > >> those suppressions. > > > Could be - I just used the ones (after adapting for 32 vs. 64 bit > > issues) provided by upstream. > > I still see the same failure after installing python-debuginfo: > > ==00:00:00:55.235 18250== Invalid read of size 1 > ==00:00:00:55.235 18250== at 0x4A07F92: strlen (mc_replace_strmem.c:403) > ==00:00:00:55.235 18250== by 0x826860: MemoryContextStrdup (mcxt.c:1158) > ==00:00:00:55.235 18250== by 0x800441: set_errdata_field (elog.c:1230) > ==00:00:00:55.235 18250== by 0x803EE8: err_generic_string (elog.c:1210) > ==00:00:00:55.235 18250== by 0xDFEF2DD: PLy_elog (plpy_elog.c:117) > ==00:00:00:55.235 18250== by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084) > ==00:00:00:55.235 18250== by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105) > ==00:00:00:55.235 18250== by 0xDFF2103: plpython_inline_handler (plpy_main.c:345) > ==00:00:00:55.235 18250== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) > ==00:00:00:55.235 18250== by 0x70E97F: standard_ProcessUtility (utility.c:515) > ==00:00:00:55.235 18250== by 0x70A856: PortalRunUtility (pquery.c:1175) > ==00:00:00:55.235 18250== by 0x70B8FF: PortalRunMulti (pquery.c:1306) > ==00:00:00:55.235 18250== Address 0xefe1954 is 36 bytes inside a block of size 49 free'd > ==00:00:00:55.235 18250== at 0x4A06430: free (vg_replace_malloc.c:446) > ==00:00:00:55.235 18250== by 0x398AE90D5C: tupledealloc (tupleobject.c:170) > ==00:00:00:55.235 18250== by 0x398AE79E3A: dict_dealloc (dictobject.c:911) > ==00:00:00:55.235 18250== by 0x398AE5C586: BaseException_clear (exceptions.c:77) > ==00:00:00:55.235 18250== by 0x398AE5C5BF: BaseException_dealloc (exceptions.c:87) > ==00:00:00:55.235 18250== by 0x398AE9A704: subtype_dealloc (typeobject.c:1019) > ==00:00:00:55.236 18250== by 0xDFEEDBC: PLy_traceback (plpy_elog.c:358) > ==00:00:00:55.236 18250== by 0xDFEF154: PLy_elog (plpy_elog.c:83) > ==00:00:00:55.236 18250== by 0xDFF018D: PLy_procedure_call (plpy_exec.c:1084) > ==00:00:00:55.236 18250== by 0xDFF14C6: PLy_exec_function (plpy_exec.c:105) > ==00:00:00:55.236 18250== by 0xDFF2103: plpython_inline_handler (plpy_main.c:345) > ==00:00:00:55.236 18250== by 0x809737: OidFunctionCall1Coll (fmgr.c:1596) > > > With the patch I'm working on, no error, not even with the python-specific > suppressions all removed from valgrind.supp. Not sure what to make of > that. RHEL6's version of python is 2.6.6, which is not real new, but > the documentation that comes with it indicates that the false-valgrind- > warnings problem exists. I downloaded the RHEL6 srpm, and it appears to be patched to automatically disable pymalloc when running under valgrind (via disable-pymalloc-on-valgrind-py26.patch). That explains why you're seing the problem, but skink isn't (it's running debian). Greetings, Andres Freund
Attachment
Andres Freund <andres@anarazel.de> writes: > I downloaded the RHEL6 srpm, and it appears to be patched to > automatically disable pymalloc when running under valgrind (via > disable-pymalloc-on-valgrind-py26.patch). That explains why you're > seing the problem, but skink isn't (it's running debian). Hah. I suspected Red Hat had done something to hide the issue, but didn't get around to investigating yet. Anyway, it's clearly a real bug, even if skink is failing to find it. regards, tom lane