Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAA4eK1Jm4fCbjv4g=qjVWmCcXJV9QAPJ3UuxW9QMdCFumcTSeA@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
RE: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V32 patch set which addressed comments in [1]~[5].
>

Review comments:
===============
*
+advance_conflict_slot_xmin(FullTransactionId new_xmin)
+{
+ FullTransactionId full_xmin;
+ FullTransactionId next_full_xid;
+
+ Assert(MyReplicationSlot);
+ Assert(FullTransactionIdIsValid(new_xmin));
+
+ next_full_xid = ReadNextFullTransactionId();
+
+ /*
+ * Compute FullTransactionId for the current xmin. This handles the case
+ * where transaction ID wraparound has occurred.
+ */
+ full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
+ MyReplicationSlot->data.xmin);
+
+ if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
+ return;

The above code suggests that the launcher could compute a new xmin
that is less than slot's xmin. At first, this looks odd to me, but
IIUC, this can happen when the user toggles retain_conflict_info flag
at some random time when the launcher is trying to compute the new
xmin value for the slot. One of the possible combinations of steps for
this race could be as follows:

1. The subscriber has two subscriptions, A and B. Subscription A has
retain_conflict_info as true, and B has retain_conflict_info as false
2. Say the launcher calls get_subscription_list(), and worker A is
already alive.
3. Assuming the apply worker will restart on changing
retain_conflict_info, the user enables retain_conflict_info for
subscription B.
4. The launcher processes the subscription B first in the first cycle,
and starts worker B. Say, worker B gets 759 as candidate_xid.
5. The launcher creates the conflict detection slot, xmin = 759
6. Say a new txn happens, worker A gets 760 as candidate_xid and
updates it to oldest_nonremovable_xid.
7. The launcher processes the subscription A in the first cycle, and
the final xmin value is 760, because it only checks the
oldest_nonremovable_xid from worker A. The launcher then updates the
value to slot.xmin.
8. In the next cycle, the launcher finds that worker B has an older
oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
launher would have retreated the slot's xmin unless we had the above
check in the quoted code.

I think the above race is possible because the system lets the changed
subscription values of a subscription take effect asynchronously by
workers. The one more similar race condition handled by the patch is
as follows:

*
...
+ * It's necessary to use FullTransactionId here to mitigate potential race
+ * conditions. Such scenarios might occur if the replication slot is not
+ * yet created by the launcher while the apply worker has already
+ * initialized this field. During this period, a transaction ID wraparound
+ * could falsely make this ID appear as if it originates from the future
+ * w.r.t the transaction ID stored in the slot maintained by launcher. See
+ * advance_conflict_slot_xmin.
...
+ FullTransactionId oldest_nonremovable_xid;

This case can happen if the user disables and immediately enables the
retain_conflict_info option. In this case, the launcher may drop the
slot after noticing the disable. But the apply worker may not notice
the disable and it only notices that the retain_conflict_info is still
enabled, so it will keep maintaining oldest_nonremovable_xid when the
slot is not created.

It is okay to handle both the race conditions, but I am worried we may
miss some such race conditions which could lead to difficult-to-find
bugs. So, at least for the first version of this option (aka for
patches 0001 to 0003), we can add a condition that allows us to change
retain_conflict_info only on disabled subscriptions. This will
simplify the patch. We can make a separate patch to allow changing
retain_conflict_info option for enabled subscriptions. That will make
it easier to evaluate such race conditions and the solutions more
deeply.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Batch TIDs lookup in ambulkdelete
Next
From: Fujii Masao
Date:
Subject: Re: Suggestions for improving \conninfo output in v18