Thread: Re: Why our Valgrind reports suck

Re: Why our Valgrind reports suck

From
Tom Lane
Date:
Here's a v3 patchset that's rebased up to HEAD (on top of the
extracted fixes I already pushed) and responds to your review
comments.  I adopted your suggestions except for a couple:

Andres Freund <andres@anarazel.de> writes:
> 0007:
> +    /* Run the rest in xact context to avoid Valgrind leak complaints */
> +    MemoryContextSwitchTo(TopTransactionContext);
> It seems like it also protects at least somewhat against actual leaks?

Well, the worker is going to exit immediately after it completes
this transaction, so I'm not seeing that there's much useful
effect there.  I tweaked the comment a bit, but I don't think it
should oversell the value.

> 0016:
> Agreed with the concern stated in the commit message, but addressing that
> would obviously be a bigger project.

I remembered that I had a patch (written during that old thread) that
moved BuildEventTriggerCache's MemoryContextSwitchTo calls to reduce
the amount of stuff that happens in EventTriggerCacheContext, so I
incorporated that rather than just whining.

> 0017:
> A tad weird to leave the comments above the removed = NILs in place, even
> though it's obviously still correct.

I left this alone; I think it's good to point out that we're not
bothering to free those lists.


Some other notes:

To keep the patch numbering the same, I replaced 0013 (no longer
needed/relevant after 1722d5eb0) with a new patch that hides a
plancache leak that is reported in src/pl/plpgsql's tests.
I'm not terribly thrilled with that patch because it only fixes
the reported case and not adjacent ones, as mentioned in the commit
message.  But I don't think we have copying infrastructure that would
allow fixing the others.

I spent some time trying to construct suppressions that would block
Valgrind's complaints about perl/python/tcl, and eventually gave up.
It seemed too messy, and likely dependent on the platform as well as
the particular versions of those libraries.  The biggest problem is
that in many cases, Valgrind's stack trace doesn't go down far enough
to reach any code of ours, so that the suppression patterns would have
to depend on internals of the language implementation, which seems
fragile.  A possible answer is to require use of a larger
--num-callers setting, but I'm afraid that that'd make Valgrind even
slower.  For the moment I have no suggestion other than to not specify
--with-perl etc. while building a valgrind'able installation.

I've now run contrib (at least the installcheck tests) with this
patchset, and we seem to be in mostly good shape there.  postgres_fdw
and dblink each have some issues, but I didn't see others.

            regards, tom lane


Attachment

Re: Why our Valgrind reports suck

From
Tom Lane
Date:
I wrote:
> Here's a v3 patchset that's rebased up to HEAD (on top of the
> extracted fixes I already pushed) and responds to your review
> comments.

Now that v19 development is open, I'd like to get this work
pushed sooner rather than later, before the patches bit-rot
too much, and so that we can get started on follow-on work
to remove remaining leaks.  Does anyone want to review it
further?

            regards, tom lane



Re: Why our Valgrind reports suck

From
Richard Guo
Date:
On Thu, Jul 10, 2025 at 3:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Now that v19 development is open, I'd like to get this work
> pushed sooner rather than later, before the patches bit-rot
> too much, and so that we can get started on follow-on work
> to remove remaining leaks.  Does anyone want to review it
> further?

I'm just skimming through the changes and happened to spot two minor
things.

In 0008:

        if (pq_mq_handle != NULL)
+       {
            shm_mq_detach(pq_mq_handle);
+           pfree(pq_mq_handle);
+       }
        pq_mq_handle = NULL;

Maybe we could move "pq_mq_handle = NULL;" inside the if branch?
Though I see we're doing it in your way on master.

In 0015:

I noticed that we're freeing the list returned from
logicalrep_workers_find().  Should we do the same for the "workers"
list in AtEOXact_LogicalRepWorkers()?


This is very useful work; I hope we can get it in soon.

Thanks
Richard



Re: Why our Valgrind reports suck

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> I'm just skimming through the changes and happened to spot two minor
> things.

> In 0008:
>         if (pq_mq_handle != NULL)
> +       {
>             shm_mq_detach(pq_mq_handle);
> +           pfree(pq_mq_handle);
> +       }
>         pq_mq_handle = NULL;
> Maybe we could move "pq_mq_handle = NULL;" inside the if branch?
> Though I see we're doing it in your way on master.

Yeah, we could make it be like that.  I was just trying to do the
minimal change from master.

> In 0015:
> I noticed that we're freeing the list returned from
> logicalrep_workers_find().  Should we do the same for the "workers"
> list in AtEOXact_LogicalRepWorkers()?

Seems probably unnecessary.  AtEOXact functions should run in the
transaction's CurTransactionContext, which will be reset or deleted
once the transaction is done.  If we were talking about a large amount
of memory it might be worth reclaiming sooner, but surely we are not.

> This is very useful work; I hope we can get it in soon.

Thanks for looking at it!

            regards, tom lane