Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAFiTN-uXki5nX+1RpFvo++dT1NSrEJhPxGtJAqQRS89vuPX7pA@mail.gmail.com
Whole thread Raw
In response to Re: Conflict detection for update_deleted in logical replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > > wrote:
> > >
> > > Few other comments:
> > > 1.
> > > + /*
> > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > + *
> > > + * It might seem feasible to directly check the conflict detection
> > > + * slot.xmin instead of relying on the launcher to assign the worker's
> > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > + condition
> > > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > > + * protected by a replication slot and could become unreliable if a
> > > + * wraparound occurs.
> > > + */
> > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > + return;
> > >
> > > I understand the reason why we get assigned the worker's non_removable_xid
> > > from launcher but all this makes the code complex to understand. Isn't there
> > > any other way to achieve it? One naïve and inefficient way could be to just
> > > restart the worker probably after updating its retentionactive flag. I am not
> > > suggesting to make this change but just a brainstorming point.
> >
> > I'll keep thinking about it, and for now, I've added a comment mentioning
> > that rebooting is a simpler solution.
> >
>
> Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
> as soon as we encounter the first subscription with retain_dead_tuples
> set during the traversal of the sublist. We release it only after
> updating xmin outside the sublist traversal. We then acquire it in
> EXCLUSIVE mode to fetch the resume the retention in worker for the
> period till we fetch slot's xmin.
>
> This will close the above race condition but acquiring LWLock while
> traversing subscription is not advisable as that will make it
> uninterruptible. The other possibility is to use some heavy-weight
> lock here, say a lock on pg_subscription catalog but that has a
> drawback that it can conflict with DDL operations. Yet another way is
> to invent a new lock-type for this.
>
> OTOH, the simple strategy to let apply-worker restart to resume
> retention will keep the handling simpler. We do something similar at
> the start of apply-worker if we find that some subscription parameter
> is changed. I think we need more opinions on this matter.

IMHO the situation of retention being disabled and re-enabled is not a
common occurrence. It typically happens in specific scenarios where
there's a significant replication lag or the user has not configured
the retention timeout correctly. Because these are corner cases, I
believe we should avoid over-engineering a solution and simply restart
the worker, as Amit suggested.

--
Regards,
Dilip Kumar
Google



pgsql-hackers by date:

Previous
From: Vamshikrishna T
Date:
Subject: Re: AIX support
Next
From: "Greg Burd"
Date:
Subject: Re: [PATCH] Add tests for Bitmapset