Home > mailing lists
Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

From	Lukas Fittl
Subject	Re: Stack-based tracking of per-node WAL/buffer usage
Date	January 13 11:01:53
Msg-id	CAP53Pkw85U-aMRzkZ+kRKfCh0pA5vyo=_W16gPK4sirZxiJy=A@mail.gmail.com Whole thread Raw
In response to	Re: Stack-based tracking of per-node WAL/buffer usage (Andres Freund <andres@anarazel.de>)
List	pgsql-hackers
Tree view
On Fri, Jan 9, 2026 at 11:38 AM Andres Freund <andres@anarazel.de> wrote:
> I pushed 0001. The only changes I made were to break a few long lines.

Thanks!

Attached v5, feedback addressed unless otherwise noted, with the patch
adding macros to access pgBufferUsage/pgWalUsage (was v4/0003) to the
front per your request.

Its also worth noting that 2defd0006255 just added a new
"instrument_node.h". We could consider moving the per-node
instrumentation structs and associated functions there (but probably
keep a single "instrument.c" unit file?). Thoughts?

> Hm. It looks to me that after moving trigger instrumentation to its own thing,
> there are no users of InstrAlloc() with n > 1. I think it may make sense to
> drop that?

Yep, makes sense, done like that. I also did a code search for any
extension use cases and couldn't find anything that'd need multiples
of the Instrumentation struct (either the new general purpose one, or
the per-node one).

> > +++ b/src/backend/executor/execParallel.c
> > ...
> > -     /* array of num_plan_nodes * num_workers Instrumentation objects follows */
> > +
> > +     /*
> > +      * array of num_plan_nodes * num_workers NodeInstrumentation objects
> > +      * follows
> > +      */
> >  };
>
> Spurious change, I think?

pgindent forces the line break here, because of the Instrumentation =>
NodeInstrumentation change.

Are you saying we shouldn't make that wording change? then I'd suggest
we lower case the "i" of "instrumentation", so it doesn't get confused
with the struct of that name.

> >  typedef struct WorkerInstrumentation
> >  {
> >       int                     num_workers;    /* # of structures that follow */
> > -     Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> > +     NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> >  } WorkerInstrumentation;
>
> Hm. Shouldn't this be WorkerNodeInstrumentation now?

Yeah, the name feels a bit lengthy, but given that I found some of the
parallel query instrumentation logic confusing, erring on the side of
being more explicit is probably good here. Done that way.

> > Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
> >  stack-based instrumentation
>
> I think it may be good to have some tests for edge cases. E.g. testing that a
> query that an explain analyze of a query that executes a plpgsql function
> which in turn does another explain analyze or just another query without
> explain analyze, does something reasonable.
>

Yeah, that makes sense - I ran out of time to add this now (and I feel
there are other things worth discussing in the meantime), but will
work on this.

> > @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> >               resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> >                       palloc0(n * sizeof(ExprState *));
> >               if (instrument_options)
> > -                     resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> > +             {
> > +                     /*
> > +                      * Triggers do not individually track buffer/WAL usage, even if
> > +                      * otherwise tracked
> > +                      */
>
> Why is that?

I originally special cased triggers (i.e. didn't permit trigger
instrumentation to track WAL/buffer usage) because the current code
does not need it, and it simplified the logic, because we always need
to add child stack information to the parent so that totals are
correct, e.g. for pg_stat_statements.

But, I see your point that this isn't great, and we may want to emit
WAL/buffer usage for triggers in EXPLAIN in the future. To handle this
I adjusted this so we now also finalize the per-trigger
instrumentation in ExecutorFinish, like we do it for per-node
instrumentation.

>
>
> > @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
> >               !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> >               elog(ERROR, "InstrStart called twice in a row");
> >
> > -     if (instr->need_bufusage)
> > -             instr->bufusage_start = pgBufferUsage;
> > +     if (instr->need_bufusage || instr->need_walusage)
> > +     {
> > +             Assert(CurrentResourceOwner != NULL);
> > +             instr->owner = CurrentResourceOwner;
> > +
> > +             /*
> > +              * Allocate the stack resource in a memory context that survives
> > +              * during an abort. This will be freed by InstrStop (regular
> > +              * execution) or ResOwnerReleaseInstrumentation (abort).
> > +              *
> > +              * We don't do this in InstrAlloc to avoid leaking when InstrStart +
> > +              * InstrStop isn't called.
> > +              */
> > +             if (instr->stack == NULL)
> > +                     instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
> >
> > -     if (instr->need_walusage)
> > -             instr->walusage_start = pgWalUsage;
> > +             ResourceOwnerEnlarge(instr->owner);
> > +             ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
> > +
> > +             InstrPushStack(instr->stack);
> > +     }
> >  }
>
> I'm still worried that allocating something in CurTransactionContext will bite
> us eventually. It seems like it'd be pretty sane to want to have
> instrumentation around procedure execution, for example.

Hmm, I'll have to spend more time thinking through that future
procedure instrumentation case and how it interacts with resource
owners.

Just to state it, the main point of using CurTransactionContext
(instead of CurrentMemoryContext) here is that we handle an abort
situation that may propagate above the function that called
InstrStart. We need to make sure memory survives until the resource
owner gets cleaned up (either by aborting, or by InstrStop being
called with finalize = true), so we can add the stack to its parent.

I'm happy to adjust this to TopTransactionContext if that's what you
had in mind?

Or maybe I'm missing some other way to get a memory context that
relates to the current resource owner? (i.e. survives at least until
the resource owner gets released)

> >  void
> > -InstrStop(Instrumentation *instr)
> > +InstrStop(Instrumentation *instr, bool finalize)
> >  {
> >       instr_time      endtime;
> >
> > @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
> >               INSTR_TIME_SET_ZERO(instr->starttime);
> >       }
> >
> > -     /* Add delta of buffer usage since entry to node's totals */
> > -     if (instr->need_bufusage)
> > -             BufferUsageAccumDiff(&instr->bufusage,
> > -                                                      &pgBufferUsage, &instr->bufusage_start);
> > +     if (instr->need_bufusage || instr->need_walusage)
> > +     {
> > +             InstrPopStack(instr->stack, finalize);
> >
> > -     if (instr->need_walusage)
> > -             WalUsageAccumDiff(&instr->walusage,
> > -                                               &pgWalUsage, &instr->walusage_start);
> > +             Assert(instr->owner != NULL);
> > +             ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
> > +             instr->owner = NULL;
> > +
> > +             if (finalize)
> > +             {
> > +                     /*
> > +                      * To avoid keeping memory allocated beyond when its needed, copy
> > +                      * the result to the current memory context, and free it in the
> > +                      * transaction context.
> > +                      */
> > +                     InstrStack *stack = palloc(sizeof(InstrStack));
> > +
> > +                     memcpy(stack, instr->stack, sizeof(InstrStack));
> > +                     pfree(instr->stack);
> > +                     instr->stack = stack;
> > +             }
> > +     }
> >  }
>
> Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
> pfree. Won't that typically increase memory usage due to temporarily having
> two versions of the InstrStack?

This is mainly about making the caller to InstrStop take
responsibility for freeing the memory allocation when its current
context gets freed, which will potentially be before
CurTransactionContext (or whichever context we chose) gets freed. If
we kept it around and e.g. this was a long transaction with many
queries executed, we'd keep consuming more and more memory otherwise.

Or looking at it differently, the principle here is, if you've reached
InstrStop (with finalize=true), no aborts involving this stack can
occur anymore, and so all that's left is for the caller to utilize the
instrumentation data as it sees fit, e.g. emitting it to the logs,
etc.

> >  /* report usage after parallel executor shutdown */
> >  void
> > -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> > +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
>
> Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
> it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
> make that pretty clear...

Good point, why not, added pg_nodiscard - seems like cheap insurance
to help notice someone they're calling it incorrectly.

> > @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
> >                       int64           total_blks_hit;
> >                       int64           total_blks_read;
> >                       int64           total_blks_dirtied;
> > +                     BufferUsage bufusage = instr->stack->bufusage;
> > +                     WalUsage        walusage = instr->stack->walusage;
> > ...
> > -                     total_blks_dirtied = bufferusage.shared_blks_dirtied +
> > -                             bufferusage.local_blks_dirtied;
> > +
> > +                     total_blks_hit = bufusage.shared_blks_hit +
> ...
>
> Why rename and move the declaration of bufferusage here? That seems to make
> the diff unnecessarily noisy?
>
> Kinda going the other way: Why copy this on the stack, rather than use a
> pointer?

For now to keep the diff smaller, I went the way of not renaming the variable.

I don't think efficiency matters much in that code path (i.e. the copy
of ~150 bytes seems fine to me), but happy to adjust if you think a
pointer is better.

Thanks,
Lukas

--
Lukas Fittl
Attachment

pgsql-hackers by date:
From: Bertrand Drouvot
Date: 13 January, 10:27:11
Subject: Re: Safer hash table initialization macro
From: "zengman"
Date: 13 January, 11:05:57
Subject: Re: [[BUG] pg_stat_statements crashes with var and non-var expressions in IN clause
Re: Stack-based tracking of per-node WAL/buffer usage - Mailing list pgsql-hackers

Attachment

Previous

Next