Home > mailing lists

Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From	Nisha Moond
Subject	Re: Fix slot synchronization with two_phase decoding enabled
Date	May 21 07:48:03
Msg-id	CABdArM6XdTMjPXq0d6GWNHz9KHTB+RaVx=aJU-9_TaqVTND4Pg@mail.gmail.com Whole thread Raw
In response to	RE: Fix slot synchronization with two_phase decoding enabled ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
List	pgsql-hackers

Tree view

On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote:
> >
> > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > While I cannot be entirely certain of my analysis, I believe the root
> > > cause might be related to the backward movement of the confirmed_flush
> > > LSN. The following scenario seems possible:
> > >
> > > 1. The walsender enables the two_phase and sets two_phase_at (which
> > > should be the same as confirmed_flush).
> > > 2. The slot's confirmed_flush regresses for some reason.
> > > 3. The slotsync worker retrieves the remote slot information and
> > > enables two_phase for the local slot.
> > >
> >
> > Yes, this is possible. Here is my theory as to how it can happen in the current
> > case. In the failed test, after the primary has prepared a transaction, the
> > transaction won't be replicated to the subscriber as two_phase was not
> > enabled for the slot. However, subsequent keepalive messages can send the
> > latest WAL location to the subscriber and get the confirmation of the same from
> > the subscriber without its origin being moved. Now, after we restart the apply
> > worker (due to disable/enable for a subscription), it will use the previous
> > origin_lsn to temporarily move back the confirmed flush LSN as explained in
> > one of the previous emails in another thread [1]. During this temporary
> > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at
> > and confirm_flush_lsn values, leading to the assertion failure. We see this
> > issue intermittently because it depends on the timing of slotsync worker's
> > request to fetch the slot's value.
>
> Based on this theory, I can reproduce the BF failure in the 040 tap-test on
> HEAD after applying the 0001 patch. This is achieved by using the injection
> point to stop the walsender from sending a keepalive before receiving the old
> origin position from the apply worker, ensuring the confirmed_flush
> consistently moves backward before slotsync.
>
> Additionally, I've reproduced the duplicate data issue on HEAD without slotsync
> using the attached script (after applying the injection point patch). This
> issue arises if we immediately disable the subscription after the
> confirm_flush_lsn moves backward, preventing the walsender from advancing the
> confirm_flush_lsn.
>
> In this case, if a prepared transaction exists before two_phase_at, then after
> re-enabling the subscription, it will replicate that prepared transaction when
> decoding the PREPARE record and replicate that again when decoding the COMMIT
> PREPARED record. In such cases, the apply worker keeps reporting the error:
>
> ERROR: transaction identifier "pg_gid_16387_755" is already in use.
>
> Apart from above, we're investigating whether the same issue can occur in
> back-branches and will share the results once ready.
>

The issue was confirmed to occur on back branches as well, due to
confirmed_flush_lsn moving backward. It has now been fixed on HEAD and
all supported back-branches down to PG13.

For details, refer to the separate thread [1]; the fix was committed
(commit: ad5eaf3)[2].

The BF failure has not occurred since the fix, but we’ll continue to
keep an eye.

[1] https://www.postgresql.org/message-id/CAJpy0uDZ29P=BYB1JDWMCh-6wXaNqMwG1u1mB4=10Ly0x7HhwQ@mail.gmail.com
[2] https://github.com/postgres/postgres/commit/ad5eaf390c58294e2e4c1509aa87bf13261a5d15

--
Thanks,
Nisha

pgsql-hackers by date:

From: Michael Paquier
Date: 21 May, 07:43:11
Subject: Re: Add comment explaining why queryid is int64 in pg_stat_statements

From: Amit Kapila
Date: 21 May, 07:54:41
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

Previous

Next