Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
Date
Msg-id aFSk-NQ0efcFQ5IF@paquier.xyz
Whole thread Raw
In response to Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
On Fri, Jun 06, 2025 at 03:34:13PM -0700, Noah Misch wrote:
> On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:
>> On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
>> and perform some filtering of the contents we
>> know we will not be able to trust.  So I mean to do that once at its
>> beginning of recovery, where we only compare the names of the 2PC file
>> names with the XID boundaries in the checkpoint record:
>>
>> - Discard any files with an XID newer than the next XID.  We know that
>> these would be in WAL anyway, if they replay from a timeline where it
>> matters.
>
> v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
> to emit a WARNING.  However, this can happen without any exceptional events in
> the backup's history.  The message should be LOG or DEBUG if done this early.
> This is not new in the patch, and it would be okay to not change it as part of
> $SUBJECT.  I mention it because, if we scanned pg_twophase after reaching
> consistency, an ERROR would be appropriate.  (A too-new pg_twophase file then
> signifies the backup creator having copied files after the backup stop, a
> violation of the backup protocol.)

I need to ponder more about this point, I think.

>> - Discard any files that are older than the oldest XID.  We know that
>> these files don't matter as 2PC transactions hold the XID horizon.
>
> In contrast, a WARNING is fine here.  A too-old file implies one of these
> three, each of which counts as an exceptional event:
>
> - an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
> - unlink() failed in RemoveTwoPhaseFile()
> - backup protocol violation
>
> I agree start-of-recovery can correctly do your list of two discard actions;
> they do not require a consistent state.  If the patch brings us to the point
> that recovery does only those two things with pg_twophase before reaching
> consistency, that sounds fine.  Doing them that early doesn't sound optimal to
> me, but I've not edited that area as much as you have.  If it gets the
> user-visible behaviors right and doesn't look fragile, I'll be fine with it.

Noted.

>> So there's no actual need to check the contents of the files, still
>> that implies trusting the names of the 2PC files in pg_twophase/.
>
> Trusting file names is fine.

Noted, thanks.

>> or if we know that we're
>> starting from a clean state?
>
> I don't know what "clean slate" means.  If "clean state"=wasShutdown, that's
> just a special case of "consistency is reached".

By "clean state", I meant wasShutdown = true, yes.  Sorry for the
confusion.

>> Or are you seeing things differently?  In which case, I am not sure
>> what's the line you think would be "correct" here (well, you did say
>> only WAL until consistency is reached), nor do I completely understand
>> how much 2PC transaction state we should have in shared memory until
>> consistency is reached.
>
> Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
> beyond the point recovery has reached.  In other words, recovery should keep
> anachronistic information out of shared memory.  For example, one could
> imagine a straw man implementation in which, before reaching consistency, we
> load each pg_twophase file that does pass its CRC check.  I'd dislike that,
> because it would facilitate anachronisms involving the
> GlobalTransactionData.gid field.  We could end up with two GXACT having the
> same gid, one being the present (according to recovery progress) instance of
> that gid and another being a future instance.  The system might not
> malfunction today, but I'd consider that fragile.  Anachronistic entries might
> cause recovery to need more shared memory than max_prepared_transactions has
> allocated.

So this comes down to attempting to translate the information from the
2PC files to shared memory at a point earlier than what we could.
Thinking more about this point, there is something that I can see us
potentially do here: how about tracking the file names in shared
memory when we go through them in restoreTwoPhaseData()?  That
addresses my point about doing only one scan of pg_twophase/ at the
beginning of recovery.  I have not studied in details how doable it
is, but we should be able to delay any calls to
ProcessTwoPhaseBuffer() until we are sure that it is safe to do, which
would in charge of reading the files based on the names we've picked
at the early stages.

RecoverPreparedTransactions() is an easy one, we only call it at the
end of recovery.  The sticky point to study is the call of
StandbyRecoverPreparedTransactions() when replaying a shutdown
checkpoint..  Ugh.

> You might find some more-important goal preempts that no-anachronisms goal.

Right, thanks.

By the way, what do you think we should do with 0001 at this stage
(your refactoring work with some changes I've sprinkled)?  I was
looking at it this morning with fresh eyes and my opinion about it is
that it improves the situation in all these 2PC callers on HEAD now
that we need to deal with fxids for the file names, so that's an
independent piece.  Perhaps this should be applied once v19 opens for
business while we sort out the rest that 0002 was trying to cover?
It's not something that we can backpatch safely due to the ABI change
in 2PC callback, unfortunately, just a cleanup piece.

Note: this introduced two calls to FullTransactionIdFromAllowableAt()
in StandbyTransactionIdIsPrepared() and PrepareRedoAdd(), which worried
me a bit on second look and gave me a pause:
- The call in StandbyTransactionIdIsPrepared() should use
AdjustToFullTransactionId(), ReadTwoPhaseFile() calls
TwoPhaseFilePath(), which uses AdjustToFullTransactionId() on HEAD.
- The call in PrepareRedoAdd() slightly worried me, but I think that
we're OK with using FullTransactionIdFromAllowableAt(), as we are in
the case of a 2PC transaction recovered from WAL, with the XID coming
from the file's header.  I think that we'd better put an
Assert(InRecovery) in this path, just for safety.

What do you think?  0002 needs more thoughts, so I'm attaching a
slightly-reviewed version of 0001 for now.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Improve the performance of Unicode Normalization Forms.
Next
From: Alexander Korotkov
Date:
Subject: Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly