On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is V71 patch set which addressed above comments and [1].
> >
>
> IIUC, this patch after stopping the retention, it immediately starts retrying to
> resume by transitioning through various phases. This can consume CPU and
> network resources, if the apply_worker takes a long time to catch up.
I agree. I think one way is to increase the interval in each cycle when the retention has
been stopped and the worker is retrying to resume the retention. I have
updated the patch for the same.
>
> Few other comments:
> 1.
> + /*
> + * Return if the launcher has not initialized oldest_nonremovable_xid.
> + *
> + * It might seem feasible to directly check the conflict detection
> + * slot.xmin instead of relying on the launcher to assign the worker's
> + * oldest_nonremovable_xid; however, that could lead to a race
> + condition
> + * where slot.xmin is set to InvalidTransactionId immediately after the
> + * check. In such cases, oldest_nonremovable_xid would no longer be
> + * protected by a replication slot and could become unreliable if a
> + * wraparound occurs.
> + */
> + if (!TransactionIdIsValid(nonremovable_xid))
> + return;
>
> I understand the reason why we get assigned the worker's non_removable_xid
> from launcher but all this makes the code complex to understand. Isn't there
> any other way to achieve it? One naïve and inefficient way could be to just
> restart the worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.
I'll keep thinking about it, and for now, I've added a comment mentioning
that rebooting is a simpler solution.
>
> 2. The function should_stop_conflict_info_retention() is invoked from
> wait_for_local_flush() and then it can lead further state transitioning which
> doesn't appear neat and makes code difficult to understand.
I changed the logic to avoid proceeding to next phase when
the retention is stopped.
>
> 3.
> + /*
> + * If conflict info retention was previously stopped due to a timeout,
> + and
> + * the time required to advance the non-removable transaction ID has
> + now
> + * decreased to within acceptable limits, resume the rentention.
> + */
> + if (!MySubscription->retentionactive)
> + {
> + rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
> + process_rdt_phase_transition(rdt_data, false); return; }
>
> In this check, where do we check the time has come in acceptable range? Can
> you update comments to make it clear?
I updated the comments to mention that the code can reach here only
when the time is within max_retention_duration.
Here is the V72 patch set which addressed above and Shveta's comments[1].
[1] https://www.postgresql.org/message-id/CAJpy0uDw7SmCN_jOvpNUzo_sE4ZsgpqQ5_vHLjm4aCm10eBApA%40mail.gmail.com
Best Regards,
Hou zj