RE: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | TY4PR01MB16907A4AA86C504E2DC5088BD9408A@TY4PR01MB16907.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Friday, September 12, 2025 2:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> > > > > > wrote: > > > > > > > > > > Few other comments: > > > > > 1. > > > > > + /* > > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid. > > > > > + * > > > > > + * It might seem feasible to directly check the conflict > > > > > + detection > > > > > + * slot.xmin instead of relying on the launcher to assign the > > > > > + worker's > > > > > + * oldest_nonremovable_xid; however, that could lead to a race > > > > > + condition > > > > > + * where slot.xmin is set to InvalidTransactionId immediately > > > > > + after the > > > > > + * check. In such cases, oldest_nonremovable_xid would no > > > > > + longer be > > > > > + * protected by a replication slot and could become unreliable > > > > > + if a > > > > > + * wraparound occurs. > > > > > + */ > > > > > + if (!TransactionIdIsValid(nonremovable_xid)) > > > > > + return; > > > > > > > > > > I understand the reason why we get assigned the worker's > > > > > non_removable_xid from launcher but all this makes the code > > > > > complex to understand. Isn't there any other way to achieve it? > > > > > One naïve and inefficient way could be to just restart the > > > > > worker probably after updating its retentionactive flag. I am not > suggesting to make this change but just a brainstorming point. > > > > > > > > I'll keep thinking about it, and for now, I've added a comment > > > > mentioning that rebooting is a simpler solution. > > > > > > > > > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED > > > mode as soon as we encounter the first subscription with > > > retain_dead_tuples set during the traversal of the sublist. We > > > release it only after updating xmin outside the sublist traversal. > > > We then acquire it in EXCLUSIVE mode to fetch the resume the > > > retention in worker for the period till we fetch slot's xmin. > > > > > > This will close the above race condition but acquiring LWLock while > > > traversing subscription is not advisable as that will make it > > > uninterruptible. The other possibility is to use some heavy-weight > > > lock here, say a lock on pg_subscription catalog but that has a > > > drawback that it can conflict with DDL operations. Yet another way > > > is to invent a new lock-type for this. > > > > > > OTOH, the simple strategy to let apply-worker restart to resume > > > retention will keep the handling simpler. We do something similar at > > > the start of apply-worker if we find that some subscription > > > parameter is changed. I think we need more opinions on this matter. > > > > IMHO the situation of retention being disabled and re-enabled is not a > > common occurrence. It typically happens in specific scenarios where > > there's a significant replication lag or the user has not configured > > the retention timeout correctly. Because these are corner cases, I > > believe we should avoid over-engineering a solution and simply restart > > the worker, as Amit suggested. > > +1 > > While it's ideal if workers could initialize their oldest_nonremovable_xid values > on-the-fly, I believe we can begin with the simple solution given that stopping > and resuming retaining of conflict info would not happen so often. In fact, > frequent stops and restarts would typically be a sign that users might be not > configuring the options properly for their systems. IIUC if the workers are able to > do that, we can support to activate retain_conflict_info even for enabled > subscriptions. I think we can leave it for future improvements if necessary. I agree. Here is a V73 patch that will restart the worker if the retention resumes. I also addressed other comments posted by Amit[1]. [1] https://www.postgresql.org/message-id/CAA4eK1LA7mnvKT9L8Nx_h%2B0TCvq-Ob2BGPO1bQKhkGHtoZKsow%40mail.gmail.com Best Regards, Hou zj
Attachment
pgsql-hackers by date: