Re: failover logical replication slots - Mailing list pgsql-hackers

From Fabrice Chapuis
Subject Re: failover logical replication slots
Date
Msg-id CAA5-nLCojwRhu5Xmv66wNRC+Q_X-_KESyeihbAfQCVj8ZS1U4A@mail.gmail.com
Whole thread Raw
In response to Re: failover logical replication slots  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Thanks for the reply Amit, 

I don't really understand the logic of the implementation. If the slot name matches that of the primary slot and this slot is in failover mode, how could it be any different on the standby slot?
After the first failover, the following failovers will work given that the sync flag is true on both the primary and standby slots.

After new sandby is attached to the primary, can we imagine that when the sync worker process is started we check if a failover slot exists on the standby, if so we drop it before recreating  a new one for syncing?

Regards,

Fabrice



On Thu, Jun 12, 2025 at 5:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jun 11, 2025 at 10:17 PM Fabrice Chapuis
<fabrice636861@gmail.com> wrote:
>
> Thanks for your reply.
> The problem I see is that after creating a new subscription, we have:
>
> 1) if a failover occurs, on the new primary node, the failover and sync flags are both set to true, so there's no problem.
>
> 2) when the old node returns as a secondary in the cluster, the failover flag is set to true and the sync flag is set to false then
> the error message is generated:  ERROR: exiting from slot synchronization because same name slot "sub_test" already exists on the standby
>
> Why not change the value of the synced flag when the standby is joining the cluster ? If the slot on the primary node has the same name as the slot on the secondary node and the failover flag is set to true,
>
> if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) {
> slot->data.synced = true
> ...

IIUC, Hou-san also mentioned the same idea, but it is not that
straightforward because the user may have created a logical slot with
the same name but with a few other different properties like
two_phase, slot_type, etc. I think we can try to compare all such slot
properties to ensure that we can overwrite the same name slot, but
there is still a chance that we may overwrite a slot that the user has
created for some other purpose. Now, we may want to extend this
functionality such that we give some knob to user which allows us to
overwrite the existing slots with same name. Then user can use this
knob (GUC or something else) when starting the node as standby after
switchover and allow the overwrite for existing slots.

As mentioned by Hou-San and Dilip, I also think it is more important
for the old node that comes as a standby to remove logical slots to
avoid WAL accumulation. For example, we can provide a function like
pg_drop_all_slots() with a type parameter indicating logical or
physical, and then utilities like patroni that provide switchover
functionality can use that function to remove all existing slots
(maybe keep the slots that are required for failover) when starting
the node as a standby.

--
With Regards,
Amit Kapila.

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Logical Replication slot disappeared after promote Standby
Next
From: Tatsuo Ishii
Date:
Subject: Re: [PATCH] Proposal: Improvements to PDF stylesheet and table column widths