Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAA4eK1+POfiOGDaGQ=Mab5R8Vgd8k1N9vKkqMay+jYQKT3w+cA@mail.gmail.com
Whole thread Raw
In response to RE: Conflict detection for update_deleted in logical replication  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses RE: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is V71 patch set which addressed above comments and [1].
>

IIUC, this patch after stopping the retention, it immediately starts
retrying to resume by transitioning through various phases. This can
consume CPU and network resources, if the apply_worker takes a long
time to catch up.

Few other comments:
1.
+ /*
+ * Return if the launcher has not initialized oldest_nonremovable_xid.
+ *
+ * It might seem feasible to directly check the conflict detection
+ * slot.xmin instead of relying on the launcher to assign the worker's
+ * oldest_nonremovable_xid; however, that could lead to a race condition
+ * where slot.xmin is set to InvalidTransactionId immediately after the
+ * check. In such cases, oldest_nonremovable_xid would no longer be
+ * protected by a replication slot and could become unreliable if a
+ * wraparound occurs.
+ */
+ if (!TransactionIdIsValid(nonremovable_xid))
+ return;

I understand the reason why we get assigned the worker's
non_removable_xid from launcher but all this makes the code complex to
understand. Isn't there any other way to achieve it? One naïve and
inefficient way could be to just restart the worker probably after
updating its retentionactive flag. I am not suggesting to make this
change but just a brainstorming point.

2. The function should_stop_conflict_info_retention() is invoked from
wait_for_local_flush() and then it can lead further state
transitioning which doesn't appear neat and makes code difficult to
understand.

3.
+ /*
+ * If conflict info retention was previously stopped due to a timeout, and
+ * the time required to advance the non-removable transaction ID has now
+ * decreased to within acceptable limits, resume the rentention.
+ */
+ if (!MySubscription->retentionactive)
+ {
+ rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
+ process_rdt_phase_transition(rdt_data, false);
+ return;
+ }

In this check, where do we check the time has come in acceptable
range? Can you update comments to make it clear?

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Richard Guo
Date:
Subject: Re: Eager aggregation, take 3
Next
From: Richard Guo
Date:
Subject: Re: Eager aggregation, take 3