Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CAJpy0uA1espW3t7o9BMYN-fNtmpvSLMLGuM3PxNsU=z5UHsNEg@mail.gmail.com Whole thread Raw |
In response to | RE: Conflict detection for update_deleted in logical replication ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
Re: Conflict detection for update_deleted in logical replication
|
List | pgsql-hackers |
On Tue, Sep 16, 2025 at 10:08 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, September 16, 2025 11:54 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, September 15, 2025 8:11 PM Amit Kapila > > <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Thanks, the changes look good to me. I have merged them in V75 patch. > > > > > > > > > > Pushed. I see a BF which is not related with this commit but a > > > previous commit for the work in this thread. > > > > > > See LOGs [1]: > > > regress_log_035_conflicts > > > ----------------------------------- > > > [11:16:47.604](0.015s) not ok 24 - the deleted column is removed > > > [11:16:47.605](0.002s) # Failed test 'the deleted column is removed' > > > # at > > > /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_confl > > > ict > > > s.pl > > > line 562. > > > > > > Then the corresponding subscriber LOG: > > > > > > 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO: > > > vacuuming "postgres.public.tab" > > > 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO: > > > finished vacuuming "postgres.public.tab": index scans: 0 > > > pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly > > > scanned > > > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples > > > missed: 1 dead from 1 pages not removed due to cleanup lock contention > > > removable > > > cutoff: 787, which was 0 XIDs old when operation ended ... > > > > > > This indicates that the Vacuum is not able to remove the row even > > > after the slot is advanced because some other background process holds > > > a lock/pin on the page. I think that is possible because the page was > > > dirty due to apply of update operation and bgwriter/checkpointer could try to > > write such a page. > > > > > > I'll analyze more tomorrow and share if I have any new findings. > > > > I agree with the analysis. I attempted to delete a tuple from a table and, while > > executing VACUUM(verbose) on this table, I executed a checkpoint > > concurrently. > > Using the debugger, I stoped in SyncOneBuffer() after acquiring the page > > block. > > This allowed me to reproduce the same log where the deleted row could not be > > removed: > > > > -- > > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1 > > dead from 1 pages not removed due to cleanup lock contention > > -- > > > > I think we can remove the VACUUM for removing the deleted column. We have > > already confirmed that the replication slot.xmin has advanced, which should > > be sufficient to prove that the feature works correctly. > > Apart from above fix, I noticed another BF failure[1]. > > -- > timed out waiting for match: (?^:logical replication worker for subscription "tap_sub_a_b" will resume retaining the informationfor detecting conflicts > -- > > It is clear from the log[2] that the apply worker resumes retention immediately > after the synchronized_standby_slots configuration is removed, but before the > max_retention_duration is set to 0. We expected resumption to occur only after > adjusting max_retention_duration to 0, thus overlooking the log. To ensure > stability, we should postpone the removal of synchronized_standby_slots until > setting max_retention_duration to 0. > > I can reproduce it locally by adding "sleep 10;" after resetting the > synchronized_standby_slots GUC and before resume test > > I updated the patch to fix this as well. > Thank You for the patch. Fix looks good. Shall we update the comment: +# Drop the physical slot and reset the synchronized_standby_slots setting. We +# change this after setting max_retention_duration to 0, ensuring the apply +# worker does not resume prematurely without noticing the updated +# max_retention_duration value. to: Drop the physical slot and reset the synchronized_standby_slots setting. This adjustment is made after setting max_retention_duration to 0, ensuring consistent results in the test case as the resumption becomes possible immediately after resetting synchronized_standby_slots, due to the smaller max_retention_duration value of 1ms previously. thanks Shveta
pgsql-hackers by date: