RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id TY4PR01MB169075232AE3A6E643F0EB749940EA@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is V71 patch set which addressed above comments and [1].
> >
> 
> IIUC, this patch after stopping the retention, it immediately starts retrying to
> resume by transitioning through various phases. This can consume CPU and
> network resources, if the apply_worker takes a long time to catch up.

I agree. I think one way is to increase the interval in each cycle when the retention has
been stopped and the worker is retrying to resume the retention. I have
updated the patch for the same.

> 
> Few other comments:
> 1.
> + /*
> + * Return if the launcher has not initialized oldest_nonremovable_xid.
> + *
> + * It might seem feasible to directly check the conflict detection
> + * slot.xmin instead of relying on the launcher to assign the worker's
> + * oldest_nonremovable_xid; however, that could lead to a race
> + condition
> + * where slot.xmin is set to InvalidTransactionId immediately after the
> + * check. In such cases, oldest_nonremovable_xid would no longer be
> + * protected by a replication slot and could become unreliable if a
> + * wraparound occurs.
> + */
> + if (!TransactionIdIsValid(nonremovable_xid))
> + return;
> 
> I understand the reason why we get assigned the worker's non_removable_xid
> from launcher but all this makes the code complex to understand. Isn't there
> any other way to achieve it? One naïve and inefficient way could be to just
> restart the worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.

I'll keep thinking about it, and for now, I've added a comment mentioning
that rebooting is a simpler solution.

> 
> 2. The function should_stop_conflict_info_retention() is invoked from
> wait_for_local_flush() and then it can lead further state transitioning which
> doesn't appear neat and makes code difficult to understand.

I changed the logic to avoid proceeding to next phase when
the retention is stopped.

> 
> 3.
> + /*
> + * If conflict info retention was previously stopped due to a timeout,
> + and
> + * the time required to advance the non-removable transaction ID has
> + now
> + * decreased to within acceptable limits, resume the rentention.
> + */
> + if (!MySubscription->retentionactive)
> + {
> + rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
> + process_rdt_phase_transition(rdt_data, false); return; }
> 
> In this check, where do we check the time has come in acceptable range? Can
> you update comments to make it clear?

I updated the comments to mention that the code can reach here only
when the time is within max_retention_duration.

Here is the V72 patch set which addressed above and Shveta's comments[1].

[1] https://www.postgresql.org/message-id/CAJpy0uDw7SmCN_jOvpNUzo_sE4ZsgpqQ5_vHLjm4aCm10eBApA%40mail.gmail.com

Best Regards,
Hou zj



Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: PostgreSQL 18 GA press release draft
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication