Re: Clear logical slot's 'synced' flag on promotion of standby - Mailing list pgsql-hackers

From Ashutosh Sharma
Subject Re: Clear logical slot's 'synced' flag on promotion of standby
Date
Msg-id CAE9k0P=WXRHXLGxkegFLj9tVLrY45+uTtdgv+Pjt1mqyit4zZw@mail.gmail.com
Whole thread Raw
In response to Re: Clear logical slot's 'synced' flag on promotion of standby  (Ajin Cherian <itsajin@gmail.com>)
List pgsql-hackers
Hi,

On Tue, Sep 9, 2025 at 12:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 4:21 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > Hi,
> >
> > This is a spin-off thread from [1].
> >
> > Currently, in the slot-sync worker, we have an error scenario [2]
> > where, during slot synchronization, if we detect a slot with the same
> > name and its synced flag is set to false, we emit an error. The
> > rationale is to avoid potentially overwriting a user-created slot.
> >
> > But while analyzing [1], we observed that this error can lead to
> > inconsistent behavior during switchovers. On the first switchover, the
> > new standby logs an error: "Exiting from slot synchronization because
> > a slot with the same name already exists on the standby."   But during
> > a double switchover, this error does not occur.
> >
> > Upon re-evaluating this, it seems more appropriate to clear the synced
> > flag after promotion, as the flag does not hold any meaning on the
> > primary. Doing so would ensure consistent behavior across all
> > switchovers, as the same error will be raised avoiding the risk of
> > overwriting user's slots.
> >
> > A patch can be posted soon on the same idea.
>
> Hi Shveta,
>
> Here’s a patch that addresses this issue. It clears any “synced” flags
> on logical replication slots when a standby is promoted. I’ve also
> added handling for crashes; if the server crashes before the flags are
> cleared, they are reset on restart.
> The restart logic was a bit tricky, since I had to rely on the
> database state to decide when the reset is needed. Documentation on
> these states is sparse, but from my testing I found that
> DB_IN_CRASH_RECOVERY occurs when a standby crashes during promotion.
> That’s the state I use to trigger the flag reset on restart.
>

+ * required resources. Clear any leftover 'synced' flags on replication
+ * slots when in crash recovery on the primary. The DB_IN_CRASH_RECOVERY
+ * state check ensures that this code is only reached when a standby
+ * server crashes during promotion.
  */
  StartupReplicationSlots();
+ if (ControlFile->state == DB_IN_CRASH_RECOVERY)

I believe the primary server can also enter the DB_IN_CRASH_RECOVERY
state. For example, if the primary is already in crash recovery and
crashes again while in crash recovery, it will restart in the
DB_IN_CRASH_RECOVERY state, no?

--

With this change are we saying that on primary the synced flag must be
always false. Because the postgres doc on pg_replication_slots says:

"The value of this column has no meaning on the primary server; the
column value on the primary is default false for all slots but may (if
leftover from a promoted standby) also be true."

--
With Regards,
Ashutosh Sharma.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
Next
From: Richard Guo
Date:
Subject: Re: Eager aggregation, take 3