Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAA4eK1Jm4fCbjv4g=qjVWmCcXJV9QAPJ3UuxW9QMdCFumcTSeA@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Conflict detection for update_deleted in logical replication
RE: Conflict detection for update_deleted in logical replication |
List | pgsql-hackers |
On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attaching the V32 patch set which addressed comments in [1]~[5]. > Review comments: =============== * +advance_conflict_slot_xmin(FullTransactionId new_xmin) +{ + FullTransactionId full_xmin; + FullTransactionId next_full_xid; + + Assert(MyReplicationSlot); + Assert(FullTransactionIdIsValid(new_xmin)); + + next_full_xid = ReadNextFullTransactionId(); + + /* + * Compute FullTransactionId for the current xmin. This handles the case + * where transaction ID wraparound has occurred. + */ + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid, + MyReplicationSlot->data.xmin); + + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) + return; The above code suggests that the launcher could compute a new xmin that is less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen when the user toggles retain_conflict_info flag at some random time when the launcher is trying to compute the new xmin value for the slot. One of the possible combinations of steps for this race could be as follows: 1. The subscriber has two subscriptions, A and B. Subscription A has retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the launcher calls get_subscription_list(), and worker A is already alive. 3. Assuming the apply worker will restart on changing retain_conflict_info, the user enables retain_conflict_info for subscription B. 4. The launcher processes the subscription B first in the first cycle, and starts worker B. Say, worker B gets 759 as candidate_xid. 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn happens, worker A gets 760 as candidate_xid and updates it to oldest_nonremovable_xid. 7. The launcher processes the subscription A in the first cycle, and the final xmin value is 760, because it only checks the oldest_nonremovable_xid from worker A. The launcher then updates the value to slot.xmin. 8. In the next cycle, the launcher finds that worker B has an older oldest_nonremovable_xid 759, so the minimal xid would now be 759. The launher would have retreated the slot's xmin unless we had the above check in the quoted code. I think the above race is possible because the system lets the changed subscription values of a subscription take effect asynchronously by workers. The one more similar race condition handled by the patch is as follows: * ... + * It's necessary to use FullTransactionId here to mitigate potential race + * conditions. Such scenarios might occur if the replication slot is not + * yet created by the launcher while the apply worker has already + * initialized this field. During this period, a transaction ID wraparound + * could falsely make this ID appear as if it originates from the future + * w.r.t the transaction ID stored in the slot maintained by launcher. See + * advance_conflict_slot_xmin. ... + FullTransactionId oldest_nonremovable_xid; This case can happen if the user disables and immediately enables the retain_conflict_info option. In this case, the launcher may drop the slot after noticing the disable. But the apply worker may not notice the disable and it only notices that the retain_conflict_info is still enabled, so it will keep maintaining oldest_nonremovable_xid when the slot is not created. It is okay to handle both the race conditions, but I am worried we may miss some such race conditions which could lead to difficult-to-find bugs. So, at least for the first version of this option (aka for patches 0001 to 0003), we can add a condition that allows us to change retain_conflict_info only on disabled subscriptions. This will simplify the patch. We can make a separate patch to allow changing retain_conflict_info option for enabled subscriptions. That will make it easier to evaluate such race conditions and the solutions more deeply. -- With Regards, Amit Kapila.
pgsql-hackers by date: