Thread: Re: Why our Valgrind reports suck

Re: Why our Valgrind reports suck

From
Tom Lane
Date:
Here's a v3 patchset that's rebased up to HEAD (on top of the
extracted fixes I already pushed) and responds to your review
comments.  I adopted your suggestions except for a couple:

Andres Freund <andres@anarazel.de> writes:
> 0007:
> +    /* Run the rest in xact context to avoid Valgrind leak complaints */
> +    MemoryContextSwitchTo(TopTransactionContext);
> It seems like it also protects at least somewhat against actual leaks?

Well, the worker is going to exit immediately after it completes
this transaction, so I'm not seeing that there's much useful
effect there.  I tweaked the comment a bit, but I don't think it
should oversell the value.

> 0016:
> Agreed with the concern stated in the commit message, but addressing that
> would obviously be a bigger project.

I remembered that I had a patch (written during that old thread) that
moved BuildEventTriggerCache's MemoryContextSwitchTo calls to reduce
the amount of stuff that happens in EventTriggerCacheContext, so I
incorporated that rather than just whining.

> 0017:
> A tad weird to leave the comments above the removed = NILs in place, even
> though it's obviously still correct.

I left this alone; I think it's good to point out that we're not
bothering to free those lists.


Some other notes:

To keep the patch numbering the same, I replaced 0013 (no longer
needed/relevant after 1722d5eb0) with a new patch that hides a
plancache leak that is reported in src/pl/plpgsql's tests.
I'm not terribly thrilled with that patch because it only fixes
the reported case and not adjacent ones, as mentioned in the commit
message.  But I don't think we have copying infrastructure that would
allow fixing the others.

I spent some time trying to construct suppressions that would block
Valgrind's complaints about perl/python/tcl, and eventually gave up.
It seemed too messy, and likely dependent on the platform as well as
the particular versions of those libraries.  The biggest problem is
that in many cases, Valgrind's stack trace doesn't go down far enough
to reach any code of ours, so that the suppression patterns would have
to depend on internals of the language implementation, which seems
fragile.  A possible answer is to require use of a larger
--num-callers setting, but I'm afraid that that'd make Valgrind even
slower.  For the moment I have no suggestion other than to not specify
--with-perl etc. while building a valgrind'able installation.

I've now run contrib (at least the installcheck tests) with this
patchset, and we seem to be in mostly good shape there.  postgres_fdw
and dblink each have some issues, but I didn't see others.

            regards, tom lane


Attachment