RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Conflict detection for update_deleted in logical replication
Date
Msg-id TY4PR01MB16907A4AA86C504E2DC5088BD9408A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Friday, September 12, 2025 2:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu)
> > > > > <houzj.fnst@fujitsu.com>
> > > > > wrote:
> > > > >
> > > > > Few other comments:
> > > > > 1.
> > > > > + /*
> > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > > + *
> > > > > + * It might seem feasible to directly check the conflict
> > > > > + detection
> > > > > + * slot.xmin instead of relying on the launcher to assign the
> > > > > + worker's
> > > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > > + condition
> > > > > + * where slot.xmin is set to InvalidTransactionId immediately
> > > > > + after the
> > > > > + * check. In such cases, oldest_nonremovable_xid would no
> > > > > + longer be
> > > > > + * protected by a replication slot and could become unreliable
> > > > > + if a
> > > > > + * wraparound occurs.
> > > > > + */
> > > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > > + return;
> > > > >
> > > > > I understand the reason why we get assigned the worker's
> > > > > non_removable_xid from launcher but all this makes the code
> > > > > complex to understand. Isn't there any other way to achieve it?
> > > > > One naïve and inefficient way could be to just restart the
> > > > > worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.
> > > >
> > > > I'll keep thinking about it, and for now, I've added a comment
> > > > mentioning that rebooting is a simpler solution.
> > > >
> > >
> > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED
> > > mode as soon as we encounter the first subscription with
> > > retain_dead_tuples set during the traversal of the sublist. We
> > > release it only after updating xmin outside the sublist traversal.
> > > We then acquire it in EXCLUSIVE mode to fetch the resume the
> > > retention in worker for the period till we fetch slot's xmin.
> > >
> > > This will close the above race condition but acquiring LWLock while
> > > traversing subscription is not advisable as that will make it
> > > uninterruptible. The other possibility is to use some heavy-weight
> > > lock here, say a lock on pg_subscription catalog but that has a
> > > drawback that it can conflict with DDL operations. Yet another way
> > > is to invent a new lock-type for this.
> > >
> > > OTOH, the simple strategy to let apply-worker restart to resume
> > > retention will keep the handling simpler. We do something similar at
> > > the start of apply-worker if we find that some subscription
> > > parameter is changed. I think we need more opinions on this matter.
> >
> > IMHO the situation of retention being disabled and re-enabled is not a
> > common occurrence. It typically happens in specific scenarios where
> > there's a significant replication lag or the user has not configured
> > the retention timeout correctly. Because these are corner cases, I
> > believe we should avoid over-engineering a solution and simply restart
> > the worker, as Amit suggested.
> 
> +1
> 
> While it's ideal if workers could initialize their oldest_nonremovable_xid values
> on-the-fly, I believe we can begin with the simple solution given that stopping
> and resuming retaining of conflict info would not happen so often. In fact,
> frequent stops and restarts would typically be a sign that users might be not
> configuring the options properly for their systems. IIUC if the workers are able to
> do that, we can support to activate retain_conflict_info even for enabled
> subscriptions. I think we can leave it for future improvements if necessary.

I agree. Here is a V73 patch that will restart the worker if the retention
resumes. I also addressed other comments posted by Amit[1].

[1] https://www.postgresql.org/message-id/CAA4eK1LA7mnvKT9L8Nx_h%2B0TCvq-Ob2BGPO1bQKhkGHtoZKsow%40mail.gmail.com

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Incorrect logic in XLogNeedsFlush()
Next
From: Naga Appani
Date:
Subject: Re: [Proposal] Expose internal MultiXact member count function for efficient monitoring