Re: Enhancing Memory Context Statistics Reporting - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Enhancing Memory Context Statistics Reporting
Date
Msg-id CA+TgmoZ2QodyjBOrA9OHdqbH9c7na+YRAYU2sey0ofXFM2FBAw@mail.gmail.com
Whole thread Raw
In response to Re: Enhancing Memory Context Statistics Reporting  (Rahila Syed <rahilasyed90@gmail.com>)
Responses Re: Enhancing Memory Context Statistics Reporting
List pgsql-hackers
On Fri, Jan 9, 2026 at 6:22 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
> After further discussion and reviewing Robert's email[1] on this topic, a safer solution
> is to avoid running ProcessGetMemoryContextInterrupt during an aborted transaction.
> This should help prevent additional errors when the transaction is already in error handling
> state.  Also, reporting memory context statistics from an aborting transaction won't
> be very useful as some of that memory usage won't be valid after abort completes.
> Attached is the updated patch that addresses this.

I think the question here is whether it's safe to interrupt the server
at an arbitrary CFI to run ProcessGetMemoryContextInterrupt. The good
news is that we now know that that function won't be executed (or at
least not in a substantive way) when we're an aborted transaction. But
we have no guarantee that we're in a transaction at all, which I think
is concerning, because it seems hard to write code that behaves
properly both inside of a transaction and outside of a transaction.
Some things I notice reading through:

+       LWLockAcquire(client_keys_lock, LW_SHARED);

LWLocks normally use NamesLikeThis not names_like_this and are
generally created by just listing them in lwlocklist.h.

Also, it's not really safe to acquire an LWLock if there's no
transaction active. If we error afterward, what will release the
LWLock?

+       else
+       {
+               clientProcNumber = client_keys[MyProcNumber];
+               client_keys[MyProcNumber] = -1;
+               LWLockRelease(client_keys_lock);
+       }

The "else" is not really necessary here because the "if" portion ends
with "return".

+       memstats_ctx = AllocSetContextCreate((MemoryContext) NULL,
+
          "publish_memory_context_statistics",
+
          ALLOCSET_SMALL_SIZES);

The comments do a good job justifying this, but as far as I know it
would be the only instance of this pattern in the entire source tree.
Are we really sure we want to deviate from the idea of having the
memory context tree be a tree? And is it really so bad if the memory
used to report memory contexts is included in the output?

+               MemoryStatsDsaArea =
GetNamedDSA("memory_context_statistics_dsa",
+
          &found);
...
+               MemoryStatsDsHash =
GetNamedDSHash("memory_context_statistics_dshash",
+
            &memctx_dsh_params, &found);
...
+       entry = dshash_find_or_insert(MemoryStatsDsHash, key, &found);

Like LWLockAcquire, all of this code supposes that there's a
transaction available to manage resource acquisition and release and
to clean up after errors. I doubt that any of this is safe without a
transaction.

I think my view overall here is: bailing out when the current
transaction is aborted is a good start in terms of making this safe,
but I don't think it's enough. If we get here when the backend is
idle, for example, which seems like it would be possible, then we'll
have no active transaction, and then I think that a lot of what's
being called here is not going to necessarily work as intended. There
are other places in the source code that do various push-ups to get
around this problem -- e.g. look at RemoveTempRelationsCallback, which
knows that it will be called from a transaction but not whether that
transaction will be aborted. I don't think that's the approach you
want here: restarting an aborted transaction in this context wouldn't
be smart. But spinning one up if there isn't one might be reasonable.
See ProcessCatchupInterrupt() for an example of existing code that
deals with this problem -- again that's probably not quite the same
situation, but it gives you an idea what other people have done
before.

One other point to note here is that if, by some chance, a failure
occurs as a result of receiving a memory-context interrupt, it's going
to fail the currently-running transaction. Unlike the problem
discussed in the previous paragraph, that's not a system-integrity
issue: from the point of view of PostgreSQL's transaction system,
having process #1 cause process #2's transaction to fail is 100% OK
and nothing will break. From the user's point of view, however, this
will be very painful: you just wanted to inspect the state of a
running transaction, very possibly a long-running one that had done a
lot of work, and you accidentally killed it. So I think it's going to
be important that such cases are extremely rare in practice. If they
happen because of extremely weird scenarios like somebody manually
removing a shared memory segment while it's still in use by
PostgreSQL, or the system running out of memory at exactly the wrong
moment, I think that's probably fine. If they happen because of
something like a low-probability race condition, that's going to be
unacceptable to users. Nobody is going to be happy about using the
feature if it can cause the transactions that they're querying to
randomly die, even if the chances are low. I'm not saying the patch
actually has this problem, just that it's something to think very
carefully about.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: [oauth] SASL mechanisms
Next
From: Robert Haas
Date:
Subject: code contributions for 2025, WIP version