Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers

From shveta malik
Subject Re: Conflict detection for update_deleted in logical replication
Date
Msg-id CAJpy0uA1espW3t7o9BMYN-fNtmpvSLMLGuM3PxNsU=z5UHsNEg@mail.gmail.com
Whole thread Raw
In response to RE: Conflict detection for update_deleted in logical replication  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses Re: Conflict detection for update_deleted in logical replication
List pgsql-hackers
On Tue, Sep 16, 2025 at 10:08 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 16, 2025 11:54 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, September 15, 2025 8:11 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thanks, the changes look good to me. I have merged them in V75 patch.
> > > >
> > >
> > > Pushed. I see a BF which is not related with this commit but a
> > > previous commit for the work in this thread.
> > >
> > > See LOGs [1]:
> > > regress_log_035_conflicts
> > > -----------------------------------
> > > [11:16:47.604](0.015s) not ok 24 - the deleted column is removed
> > > [11:16:47.605](0.002s) #   Failed test 'the deleted column is removed'
> > > #   at
> > > /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_confl
> > > ict
> > > s.pl
> > > line 562.
> > >
> > > Then the corresponding subscriber LOG:
> > >
> > > 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > > vacuuming "postgres.public.tab"
> > > 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > > finished vacuuming "postgres.public.tab": index scans: 0
> > > pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly
> > > scanned
> > > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples
> > > missed: 1 dead from 1 pages not removed due to cleanup lock contention
> > > removable
> > > cutoff: 787, which was 0 XIDs old when operation ended ...
> > >
> > > This indicates that the Vacuum is not able to remove the row even
> > > after the slot is advanced because some other background process holds
> > > a lock/pin on the page. I think that is possible because the page was
> > > dirty due to apply of update operation and bgwriter/checkpointer could try to
> > write such a page.
> > >
> > > I'll analyze more tomorrow and share if I have any new findings.
> >
> > I agree with the analysis. I attempted to delete a tuple from a table and, while
> > executing VACUUM(verbose) on this table, I executed a checkpoint
> > concurrently.
> > Using the debugger, I stoped in SyncOneBuffer() after acquiring the page
> > block.
> > This allowed me to reproduce the same log where the deleted row could not be
> > removed:
> >
> > --
> > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1
> > dead from 1 pages not removed due to cleanup lock contention
> > --
> >
> > I think we can remove the VACUUM for removing the deleted column. We have
> > already confirmed that the replication slot.xmin has advanced, which should
> > be sufficient to prove that the feature works correctly.
>
> Apart from above fix, I noticed another BF failure[1].
>
> --
> timed out waiting for match: (?^:logical replication worker for subscription "tap_sub_a_b" will resume retaining the
informationfor detecting conflicts 
> --
>
> It is clear from the log[2] that the apply worker resumes retention immediately
> after the synchronized_standby_slots configuration is removed, but before the
> max_retention_duration is set to 0. We expected resumption to occur only after
> adjusting max_retention_duration to 0, thus overlooking the log. To ensure
> stability, we should postpone the removal of synchronized_standby_slots until
> setting max_retention_duration to 0.
>
> I can reproduce it locally by adding "sleep 10;" after resetting the
> synchronized_standby_slots GUC and before resume test
>
> I updated the patch to fix this as well.
>

Thank You for the patch. Fix looks good.

Shall we update the comment:
+# Drop the physical slot and reset the synchronized_standby_slots setting. We
+# change this after setting max_retention_duration to 0, ensuring the apply
+# worker does not resume prematurely without noticing the updated
+# max_retention_duration value.

to:
Drop the physical slot and reset the synchronized_standby_slots
setting. This adjustment is made after setting max_retention_duration
to 0, ensuring consistent results in the test case as the resumption
becomes possible immediately after resetting
synchronized_standby_slots, due to the smaller max_retention_duration
value of 1ms previously.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Next
From: Michael Paquier
Date:
Subject: Re: [PATCH] Add tests for Bitmapset