Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAD21AoDox9K9Vnhssj+S0ovW2bJFsNddXY1A-MWXVVfbim2NJg@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
RE: Conflict detection for update_deleted in logical replication
|
List | pgsql-hackers |
On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > > > wrote: > > > > > > > > Few other comments: > > > > 1. > > > > + /* > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid. > > > > + * > > > > + * It might seem feasible to directly check the conflict detection > > > > + * slot.xmin instead of relying on the launcher to assign the worker's > > > > + * oldest_nonremovable_xid; however, that could lead to a race > > > > + condition > > > > + * where slot.xmin is set to InvalidTransactionId immediately after the > > > > + * check. In such cases, oldest_nonremovable_xid would no longer be > > > > + * protected by a replication slot and could become unreliable if a > > > > + * wraparound occurs. > > > > + */ > > > > + if (!TransactionIdIsValid(nonremovable_xid)) > > > > + return; > > > > > > > > I understand the reason why we get assigned the worker's non_removable_xid > > > > from launcher but all this makes the code complex to understand. Isn't there > > > > any other way to achieve it? One naïve and inefficient way could be to just > > > > restart the worker probably after updating its retentionactive flag. I am not > > > > suggesting to make this change but just a brainstorming point. > > > > > > I'll keep thinking about it, and for now, I've added a comment mentioning > > > that rebooting is a simpler solution. > > > > > > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode > > as soon as we encounter the first subscription with retain_dead_tuples > > set during the traversal of the sublist. We release it only after > > updating xmin outside the sublist traversal. We then acquire it in > > EXCLUSIVE mode to fetch the resume the retention in worker for the > > period till we fetch slot's xmin. > > > > This will close the above race condition but acquiring LWLock while > > traversing subscription is not advisable as that will make it > > uninterruptible. The other possibility is to use some heavy-weight > > lock here, say a lock on pg_subscription catalog but that has a > > drawback that it can conflict with DDL operations. Yet another way is > > to invent a new lock-type for this. > > > > OTOH, the simple strategy to let apply-worker restart to resume > > retention will keep the handling simpler. We do something similar at > > the start of apply-worker if we find that some subscription parameter > > is changed. I think we need more opinions on this matter. > > IMHO the situation of retention being disabled and re-enabled is not a > common occurrence. It typically happens in specific scenarios where > there's a significant replication lag or the user has not configured > the retention timeout correctly. Because these are corner cases, I > believe we should avoid over-engineering a solution and simply restart > the worker, as Amit suggested. +1 While it's ideal if workers could initialize their oldest_nonremovable_xid values on-the-fly, I believe we can begin with the simple solution given that stopping and resuming retaining of conflict info would not happen so often. In fact, frequent stops and restarts would typically be a sign that users might be not configuring the options properly for their systems. IIUC if the workers are able to do that, we can support to activate retain_conflict_info even for enabled subscriptions. I think we can leave it for future improvements if necessary. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: