RE: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers
| From | Zhijie Hou (Fujitsu) |
|---|---|
| Subject | RE: Assertion failure in SnapBuildInitialSnapshot() |
| Date | |
| Msg-id | TY4PR01MB16907CA7F45898B73F084F80B94C2A@TY4PR01MB16907.jpnprd01.prod.outlook.com Whole thread Raw |
| In response to | Re: Assertion failure in SnapBuildInitialSnapshot() (Masahiko Sawada <sawada.mshk@gmail.com>) |
| Responses |
Re: Assertion failure in SnapBuildInitialSnapshot()
|
| List | pgsql-hackers |
On Thursday, October 30, 2025 7:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar > <spradeepkumar29@gmail.com> wrote: > > > > Hi All, > > In this thread they proposed fix_concurrent_slot_xmin_update.patch will > solve this assert failure. After applying this patch I execute > pg_sync_replication_slots() (which calls SyncReplicationSlots → > synchronize_slots() → synchronize_one_slot() → > ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in > ReplicationSlotsComputeRequiredXmin() because the > ReplicationSlotControlLock is not held in that code path. By default > sync_replication_slots is off, so the background slot-sync worker is not > spawned; invoking the UDF directly exercises the path without the lock. I have > a small patch that acquires ReplicationSlotControlLock in the manual sync > path; that stops the assert. > > > > > > The assert is raised inside ReplicationSlotsComputeRequiredXmin() > because that function expects either that already_locked is false (and it will > acquire what it needs), or that callers already hold both > ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the > manual-sync path called by the UDF, neither lock is held, so the assertion trips. > > > > Why this happens: > > The background slot sync worker (spawned when sync_replication_slots = > on) acquires the necessary locks before calling the routines that > update/compute slot xmins, so the worker path is safe.The manual path > through the SQL-callable UDF does not take the same locks before calling > synchronize_slots()/synchronize_one_slot(). As a result the invariant > assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading > to the assert. > > > > Proposed fix: > > In synchronize_slots() (the code path used by > SyncReplicationSlots()/pg_sync_replication_slots()), acquire > ReplicationSlotControlLock before any call that can end up calling > ReplicationSlotsComputeRequiredXmin(true). > > It would be great if we have a test case for this issue possibly using injection > points. > > Also, I think it's worth considering the idea Robert shared before[1]: > > --- > But what about just surgically preventing that? > ProcArraySetReplicationSlotXmin() could refuse to retreat the values, > perhaps? If it computes an older value than what's there, it just does nothing? > --- > > We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it > sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to > retreat the values. I reviewed the thread and think that we could not straightforwardly apply a similar strategy to prevent the retreat of xmin/catalog_xmin here. This is because we maintain a central value (replication_slot_xmin/replication_slot_catalog_xmin) in ProcArraySetReplicationSlotXmin, where the value is expected to decrease when certain slots are dropped or invalidated. Therefore, I think we might need to continue with the original proposal to invert the lock and also address the code path for slotsync. > [1] > https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o > _25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com Best Regards, Hou zj
pgsql-hackers by date: