Re: Issue with logical replication slot during switchover - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Issue with logical replication slot during switchover |
Date | |
Msg-id | CAJpy0uD9b-d28yS38mxF4ddDhoP9hX9FtP=pt33bvaD6_wVrng@mail.gmail.com Whole thread Raw |
In response to | Issue with logical replication slot during switchover (Fabrice Chapuis <fabrice636861@gmail.com>) |
List | pgsql-hackers |
On Thu, Aug 7, 2025 at 6:50 PM Fabrice Chapuis <fabrice636861@gmail.com> wrote: > > Hi, > > An issue occurred during the initial switchover using PostgreSQL version 17.5. The setup consists of a cluster with twonodes, managed by Patroni version 4.0.5. > Logical replication is configured on the same instance, and the new feature enabling logical replication slots to be failover-safein a highly available environment is used. Logical slot management is currently disabled in Patroni. > > Following are some screen captured during the swichover > > 1. Run the switchover with Patroni > > patronictl switchover > > Current cluster topology > > + Cluster: ClusterX (7529893278186104053) ----+----+-----------+ > > | Member | Host | Role | State | TL | Lag in MB | > > +----------+--------------+---------+-----------+----+-----------+ > > | node_1 | xxxxxxxxxxxx | Leader | running | 4 | | > > | node_2 | xxxxxxxxxxxx | Replica | streaming | 4 | 0 | > > +----------+--------------+---------+-----------+----+-----------+ > > 2. Check the slot on the new Primary > > select * from pg_replication_slots where slot_type = 'logical'; > +-[ RECORD 1 ]--------+----------------+ > | slot_name | logical_slot | > | plugin | pgoutput | > | slot_type | logical | > | datoid | 25605 | > | database | db_test | > | temporary | f | > | active | t | > | active_pid | 3841546 | > | xmin | | > | catalog_xmin | 10399 | > | restart_lsn | 0/37002410 | > | confirmed_flush_lsn | 0/37002448 | > | wal_status | reserved | > | safe_wal_size | | > | two_phase | f | > | inactive_since | | > | conflicting | f | > | invalidation_reason | | > | failover | t | > | synced | t | > +---------------------+----------------+ > Logical replication is active again after the promote > > 3. Check the slot on the new standby > select * from pg_replication_slots where slot_type = 'logical'; > +-[ RECORD 1 ]--------+-------------------------------+ > | slot_name | logical_slot | > | plugin | pgoutput | > | slot_type | logical | > | datoid | 25605 | > | database | db_test | > | temporary | f | > | active | f | > | active_pid | | > | xmin | | > | catalog_xmin | 10397 | > | restart_lsn | 0/3638F5F0 | > | confirmed_flush_lsn | 0/3638F6A0 | > | wal_status | reserved | > | safe_wal_size | | > | two_phase | f | > | inactive_since | 2025-08-05 10:21:03.342587+02 | > | conflicting | f | > | invalidation_reason | | > | failover | t | > | synced | f | > +---------------------+--------------------------- > > The synced flag keep value false. > Following error in in the log > 2025-06-10 16:40:58.996 CEST [739829]: [1-1] user=,db=,client=,application= LOG: slot sync worker started > 2025-06-10 16:40:59.011 CEST [739829]: [2-1] user=,db=,client=,application= ERROR: exiting from slot synchronization becausesame name slot "logical_slot" already exists on the standby > > I would like to make a proposal to address the issue: > Since the logical slot is in a failover state on both the primary and the standby, an attempt could be made to resynchronizethem. > I modify the slotsync.c module > +++ b/src/backend/replication/logical/slotsync.c > @@ -649,24 +649,46 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) > > return false; > } > - > - /* Search for the named slot */ > + // Both local and remote slot have the same name > if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) > { > bool synced; > + bool failover_status = remote_slot->failover; > > SpinLockAcquire(&slot->mutex); > synced = slot->data.synced; > SpinLockRelease(&slot->mutex); > + > + if (!synced){ > + > + Assert(!MyReplicationSlot); > + > + if (failover_status){ > + > + ReplicationSlotAcquire(remote_slot->name, true, true); > + > + // Put the synced flag to true to attempt resynchronizing failover slot on the standby > + MyReplicationSlot->data.synced = true; > + > + ReplicationSlotMarkDirty(); > > - /* User-created slot with the same name exists, raise ERROR. */ > - if (!synced) > - ereport(ERROR, > + ReplicationSlotRelease(); > + > + /* Get rid of a replication slot that is no longer wanted */ > + ereport(WARNING, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("slot \"%s\" local slot has the same name as remote slot and they are infailover mode, try to synchronize them", > + remote_slot->name)); > + return false; /* Going back to the main loop after droping the failover slot */ > + } > + else > + /* User-created slot with the same name exists, raise ERROR. */ > + ereport(ERROR, > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("exiting from slot synchronization because same" > - " name slot \"%s\" already exists on the standby", > - remote_slot->name)); > - > + " name slot \"%s\" already exists on the standby", > + remote_slot->name)); > + } > /* > * The slot has been synchronized before. > * > This message follows the discussions started in this thread: > https://www.postgresql.org/message-id/CAA5-nLDvnqGtBsKu4T_s-cS%2BdGbpSLEzRwgep1XfYzGhQ4o65A%40mail.gmail.com > > Help would be appreciated to move this point forward > Thank You for starting a new thread and working on it. I think you have given reference to the wrong thread, the correct one is [1]. IIUC, the proposed fix is checking if remote_slot is failover-enabled and the slot with the same name exists locally but has 'synced'=false, enable the 'synced' flag and proceed with synchronization from next cycle onward, else error out. But remote-slot's failover will always be true if we have reached this stage. Did you actually mean to check if the local slot is failover-enabled but has 'synced' set to false (indicating it’s a new standby after a switchover)? Even with that check, it might not be the correct thing to overwrite such a slot internally. I think in [1], we discussed a couple of ideas related to a GUC, alter-api, drop_all_slots API, but I don't see any of those proposed here. Do we want to try any of that? [1]: https://www.postgresql.org/message-id/flat/CAA5-nLD0vKn6T1-OHROBNfN2Pxa17zVo4UoVBdfHn2y%3D7nKixA%40mail.gmail.com thanks Shveta
pgsql-hackers by date: