Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAFiTN-uXki5nX+1RpFvo++dT1NSrEJhPxGtJAqQRS89vuPX7pA@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Conflict detection for update_deleted in logical replication
|
List | pgsql-hackers |
On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > > wrote: > > > > > > Few other comments: > > > 1. > > > + /* > > > + * Return if the launcher has not initialized oldest_nonremovable_xid. > > > + * > > > + * It might seem feasible to directly check the conflict detection > > > + * slot.xmin instead of relying on the launcher to assign the worker's > > > + * oldest_nonremovable_xid; however, that could lead to a race > > > + condition > > > + * where slot.xmin is set to InvalidTransactionId immediately after the > > > + * check. In such cases, oldest_nonremovable_xid would no longer be > > > + * protected by a replication slot and could become unreliable if a > > > + * wraparound occurs. > > > + */ > > > + if (!TransactionIdIsValid(nonremovable_xid)) > > > + return; > > > > > > I understand the reason why we get assigned the worker's non_removable_xid > > > from launcher but all this makes the code complex to understand. Isn't there > > > any other way to achieve it? One naïve and inefficient way could be to just > > > restart the worker probably after updating its retentionactive flag. I am not > > > suggesting to make this change but just a brainstorming point. > > > > I'll keep thinking about it, and for now, I've added a comment mentioning > > that rebooting is a simpler solution. > > > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode > as soon as we encounter the first subscription with retain_dead_tuples > set during the traversal of the sublist. We release it only after > updating xmin outside the sublist traversal. We then acquire it in > EXCLUSIVE mode to fetch the resume the retention in worker for the > period till we fetch slot's xmin. > > This will close the above race condition but acquiring LWLock while > traversing subscription is not advisable as that will make it > uninterruptible. The other possibility is to use some heavy-weight > lock here, say a lock on pg_subscription catalog but that has a > drawback that it can conflict with DDL operations. Yet another way is > to invent a new lock-type for this. > > OTOH, the simple strategy to let apply-worker restart to resume > retention will keep the handling simpler. We do something similar at > the start of apply-worker if we find that some subscription parameter > is changed. I think we need more opinions on this matter. IMHO the situation of retention being disabled and re-enabled is not a common occurrence. It typically happens in specific scenarios where there's a significant replication lag or the user has not configured the retention timeout correctly. Because these are corner cases, I believe we should avoid over-engineering a solution and simply restart the worker, as Amit suggested. -- Regards, Dilip Kumar Google
pgsql-hackers by date: