RE: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Assertion failure in SnapBuildInitialSnapshot()
Date
Msg-id TY4PR01MB16907CA7F45898B73F084F80B94C2A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Assertion failure in SnapBuildInitialSnapshot()  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Assertion failure in SnapBuildInitialSnapshot()
List pgsql-hackers
On Thursday, October 30, 2025 7:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar
> <spradeepkumar29@gmail.com> wrote:
> >
> > Hi All,
> > In this thread they proposed fix_concurrent_slot_xmin_update.patch will
> solve this assert failure. After applying this patch I execute
> pg_sync_replication_slots() (which calls SyncReplicationSlots →
> synchronize_slots() → synchronize_one_slot() →
> ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in
> ReplicationSlotsComputeRequiredXmin() because the
> ReplicationSlotControlLock is not held in that code path. By default
> sync_replication_slots is off, so the background slot-sync worker is not
> spawned; invoking the UDF directly exercises the path without the lock. I have
> a small patch that acquires ReplicationSlotControlLock in the manual sync
> path; that stops the assert.
> >
> >
> > The assert is raised inside ReplicationSlotsComputeRequiredXmin()
> because that function expects either that already_locked is false (and it will
> acquire what it needs), or that callers already hold both
> ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the
> manual-sync path called by the UDF, neither lock is held, so the assertion trips.
> >
> > Why this happens:
> > The background slot sync worker (spawned when sync_replication_slots =
> on) acquires the necessary locks before calling the routines that
> update/compute slot xmins, so the worker path is safe.The manual path
> through the SQL-callable UDF does not take the same locks before calling
> synchronize_slots()/synchronize_one_slot(). As a result the invariant
> assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
> to the assert.
> >
> > Proposed fix:
> > In synchronize_slots() (the code path used by
> SyncReplicationSlots()/pg_sync_replication_slots()), acquire
> ReplicationSlotControlLock before any call that can end up calling
> ReplicationSlotsComputeRequiredXmin(true).
> 
> It would be great if we have a test case for this issue possibly using injection
> points.
> 
> Also, I think it's worth considering the idea Robert shared before[1]:
> 
> ---
> But what about just surgically preventing that?
> ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
> perhaps? If it computes an older value than what's there, it just does nothing?
> ---
> 
> We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it
> sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to
> retreat the values.

I reviewed the thread and think that we could not straightforwardly apply a
similar strategy to prevent the retreat of xmin/catalog_xmin here. This is
because we maintain a central value
(replication_slot_xmin/replication_slot_catalog_xmin) in
ProcArraySetReplicationSlotXmin, where the value is expected to decrease when
certain slots are dropped or invalidated. Therefore, I think we might need to
continue with the original proposal to invert the lock and also address the code
path for slotsync.

> [1]
> https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o
> _25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com

Best Regards,
Hou zj

pgsql-hackers by date:

Previous
From: Bryan Green
Date:
Subject: Re: [Patch] Windows relation extension failure at 2GB and 4GB
Next
From: Paul A Jungwirth
Date:
Subject: Re: GiST README typos