Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From Lukas Fittl
Subject Re: Stack-based tracking of per-node WAL/buffer usage
Date
Msg-id CAP53Pkw1RpyAVsvJwUOMgGnVoQUibMk+EB+w-8WqY8kzsC4WjA@mail.gmail.com
Whole thread Raw
In response to Re: Stack-based tracking of per-node WAL/buffer usage  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi Andres,

Thanks for the review!

Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.

FWIW, pgWalUsage would also be nice to remove, but it has some interesting interactions with the cumulative statistics system and heap_page_prune_and_freeze, and seems less performance critical.

On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:
> 0001: Separate node instrumentation from other use of Instrumentation struct
...
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).

Yeah, agreed. I added in a new 0001 patch that changes this to instr_time consistently. I don't see a good reason to keep the transformation to seconds in the Instrumentation logic, since all in-tree callers convert it to milliseconds anyway.
 
> 0003: Introduce stack for tracking per-node WAL/buffer usage

Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.

Makes sense, I've adjusted that to be conditional (in the now renamed 0004 patch).

In the updated patch I've also decided to piggyback on QueryDesc totaltime as the "owning" Instrumentation here for the query's lifetime. It seems simpler that way and avoids having special purpose methods. To go along with that I've changed the general purpose Instrumentation struct to use stack-based instrumentation at the same time.

> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
...
> +             InstrNodeAddToCurrent(&node->instrument->stack);
...

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

You are of course correct, I didn't consider cursors correctly here.

It seems there isn't an existing executor node walk that could be repurposed, so I added a new one in ExecutorFinish ("ExecAccumNodeInstrumentation"). From my read of the code there are no use cases where we need aggregated instrumentation data before ExecutorFinish.

> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> +     /*
> +      * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> +      * get what was first registered first at shutdown. To avoid handling
> +      * that, we are resetting the stack here on abort (instead of recovering
> +      * to previous).
> +      */
> +     pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Yes, good point, I did not consider that case. I've addressed this by only updating the current stack if its not already a parent of the element being released. We are also always adding the element's statistics to the (updated) active stack element at abort.

Thanks,
Lukas

--
Lukas Fittl
Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)
Next
From: Nathan Bossart
Date:
Subject: Re: shmem_startup_hook called twice on Windows