Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers
From | Nisha Moond |
---|---|
Subject | Re: Fix slot synchronization with two_phase decoding enabled |
Date | |
Msg-id | CABdArM6XdTMjPXq0d6GWNHz9KHTB+RaVx=aJU-9_TaqVTND4Pg@mail.gmail.com Whole thread Raw |
In response to | RE: Fix slot synchronization with two_phase decoding enabled ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Tue, May 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Mon, May 5, 2025 at 6:59 PM Amit Kapila wrote: > > > > On Sun, May 4, 2025 at 2:33 PM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > While I cannot be entirely certain of my analysis, I believe the root > > > cause might be related to the backward movement of the confirmed_flush > > > LSN. The following scenario seems possible: > > > > > > 1. The walsender enables the two_phase and sets two_phase_at (which > > > should be the same as confirmed_flush). > > > 2. The slot's confirmed_flush regresses for some reason. > > > 3. The slotsync worker retrieves the remote slot information and > > > enables two_phase for the local slot. > > > > > > > Yes, this is possible. Here is my theory as to how it can happen in the current > > case. In the failed test, after the primary has prepared a transaction, the > > transaction won't be replicated to the subscriber as two_phase was not > > enabled for the slot. However, subsequent keepalive messages can send the > > latest WAL location to the subscriber and get the confirmation of the same from > > the subscriber without its origin being moved. Now, after we restart the apply > > worker (due to disable/enable for a subscription), it will use the previous > > origin_lsn to temporarily move back the confirmed flush LSN as explained in > > one of the previous emails in another thread [1]. During this temporary > > movement of confirm flush LSN, the slotsync worker fetches the two_phase_at > > and confirm_flush_lsn values, leading to the assertion failure. We see this > > issue intermittently because it depends on the timing of slotsync worker's > > request to fetch the slot's value. > > Based on this theory, I can reproduce the BF failure in the 040 tap-test on > HEAD after applying the 0001 patch. This is achieved by using the injection > point to stop the walsender from sending a keepalive before receiving the old > origin position from the apply worker, ensuring the confirmed_flush > consistently moves backward before slotsync. > > Additionally, I've reproduced the duplicate data issue on HEAD without slotsync > using the attached script (after applying the injection point patch). This > issue arises if we immediately disable the subscription after the > confirm_flush_lsn moves backward, preventing the walsender from advancing the > confirm_flush_lsn. > > In this case, if a prepared transaction exists before two_phase_at, then after > re-enabling the subscription, it will replicate that prepared transaction when > decoding the PREPARE record and replicate that again when decoding the COMMIT > PREPARED record. In such cases, the apply worker keeps reporting the error: > > ERROR: transaction identifier "pg_gid_16387_755" is already in use. > > Apart from above, we're investigating whether the same issue can occur in > back-branches and will share the results once ready. > The issue was confirmed to occur on back branches as well, due to confirmed_flush_lsn moving backward. It has now been fixed on HEAD and all supported back-branches down to PG13. For details, refer to the separate thread [1]; the fix was committed (commit: ad5eaf3)[2]. The BF failure has not occurred since the fix, but we’ll continue to keep an eye. [1] https://www.postgresql.org/message-id/CAJpy0uDZ29P=BYB1JDWMCh-6wXaNqMwG1u1mB4=10Ly0x7HhwQ@mail.gmail.com [2] https://github.com/postgres/postgres/commit/ad5eaf390c58294e2e4c1509aa87bf13261a5d15 -- Thanks, Nisha
pgsql-hackers by date: