Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
Date
Msg-id aEFFyLi3YqOEAJy7@paquier.xyz
Whole thread Raw
In response to Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData  (Noah Misch <noah@leadboat.com>)
Responses Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
List pgsql-hackers
On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
> The wasShutdown case reaches consistency from the beginning, so I don't see
> that as an example of a time we benefit from reading pg_twophase before
> reaching consistency.  Can you elaborate on that?
>
> What's the benefit you're trying to get by reading pg_twophase before reaching
> consistency?

My point is mostly about code complicity and consistency, by using the
same logic (aka reading the contents of pg_twophase) at the same point
in the recovery process and perform some filtering of the contents we
know we will not be able to trust.  So I mean to do that once at its
beginning of recovery, where we only compare the names of the 2PC file
names with the XID boundaries in the checkpoint record:
- Discard any files with an XID newer than the next XID.  We know that
these would be in WAL anyway, if they replay from a timeline where it
matters.
- Discard any files that are older than the oldest XID.  We know that
these files don't matter as 2PC transactions hold the XID horizon.
- Keep the others for later evaluation.

So there's no actual need to check the contents of the files, still
that implies trusting the names of the 2PC files in pg_twophase/.

> I can think of one benefit of attempting to read pg_twophase before reaching
> consistency.  Suppose we can prove that a pg_twophase file will cause an error
> by end of recovery, regardless of what WAL contains.  It's nice to fail
> recovery immediately instead of failing recovery when we reach consistency.
> However, I doubt that benefit is important enough to depart from our usual
> principle and incur additional storage seeks in order to achieve that benefit.
> If recovery will certainly fail, you are going to have a bad day anyway.
> Accelerating recovery failure is a small benefit, particularly when we'd
> accelerate failure for only a small slice of recovery failure causes.

Well, I kind of disagree here.  Failing recovery faster can be
beneficial.  It perhaps has less merit since 7ff23c6d277d, meaning
that we should replay less after a failure during crash recovery,
still that seems useful to me if we can do that.  That depends on the
amount of trust put in the data read, of course, and if only WAL is
trusted, there's not much that can be done at the beginning of
recovery.

>> I agree that moving towards a solution where we get rid entirely of
>> the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
>> and actually is there a reason to not just nuke and replace them
>> something based on the checkpoint record itself?
>
> I don't know what this means.

As of the contents of the last patch posted on this thread, I am
referring to the calls of TransactionIdDidCommit() and
TransactionIdDidAbort() moved from ProcessTwoPhaseBuffer(), which is
called mostly always when WAL is replayed (consistency may or may not
be reached), to RecoverPreparedTransactions(), which is run just
before consistency is marked as such in the system.  When
RecoverPreparedTransactions() is called, we're ready to mark the
cluster as OK for writes and WAL has been already fully replayed with
all sanity checks done.  CLOG accesses at this stage would not be an
issue.

>> Wouldn't it be OK in this case to assume that the contents of this
>> file will be in WAL anyway?
>
> Sure.  Meanwhile, if a twophase file is going to be in later WAL, what's the
> value in opening the file before we get to that WAL?

True.  We could avoid loading some 2PC files if we know that these
could be in WAL when replaying.

There is still one point that I'm really unclear about.  Some 2PC
transactions are flushed at checkpoint time, so we will have to trust
the contents of pg_twophase/ at some point.  Do you mean to delay that
to happen always when consistency is reached or if we know that we're
starting from a clean state?  What I'd really prefer to avoid is
having two code paths in charge of reading the contents of
pg_twophase.

The point I am trying to make is that there has to be a certain level
of trust in the contents of pg_twophase, at some point during replay.
Or, we invent a new mechanism where all the twophase files go through
WAL and remove the need for pg_twophase/ when recovering.  For
example, we could have an extra record generated at each checkpoint
with the contents of the 2PC files still pending for commit
(potentially costly if the same transaction persists across multiple
checkpoints as this gets repeated), or something entirely different.
Something like that would put WAL as the sole source of trust by
design.

Or are you seeing things differently?  In which case, I am not sure
what's the line you think would be "correct" here (well, you did say
only WAL until consistency is reached), nor do I completely understand
how much 2PC transaction state we should have in shared memory until
consistency is reached.  Or perhaps you mean to somehow eliminate more
that?  I'm unsure how much this would imply for the existing recovery
mechanisms (replication origin advancing at replay for prepared
transactions may be one area to look at, for example).
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Missing program_XXX calling in pgbench tests
Next
From: Yugo Nagata
Date:
Subject: Re: Prevent internal error at concurrent CREATE OR REPLACE FUNCTION