Re: sync_standbys_defined and pg_stat_replication - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: sync_standbys_defined and pg_stat_replication
Date
Msg-id CANwKhkMTkckNSoz0kUYFvaMDA_6g2uuABcvhoowu64e9e+op=A@mail.gmail.com
Whole thread Raw
In response to sync_standbys_defined and pg_stat_replication  (Jeremy Schneider <schneider@ardentperf.com>)
Responses Re: sync_standbys_defined and pg_stat_replication
List pgsql-hackers
On Tue, 7 Oct 2025 at 08:59, Jeremy Schneider <schneider@ardentperf.com> wrote:
> For failover to work correctly, if someone changes the GUC
> synchronous_standby_names to enable sync replication, then we need to
> understand the exact moment when backends will begin to block in order
> to correctly determine when we can failover without data loss.

There is an early out in SyncRepWaitForLSN() when
WalSndCtl->sync_standbys_status has SYNC_STANDBY_DEFINED unset. That
flag gets set by the checkpointer in
SyncRepUpdateSyncStandbysDefined() via CheckpointWriteDelay() among
other places. But only when it's not executing a fast checkpoint or
it's not behind on checkpoints.

In other words, synchronous_standby_names will not become effective
until checkpointer has some downtime. While this is a small problem on
its own, there is no way to check if this has happened or not.

For the config update getting delayed the fix seems simple - just do
the config update unconditionally. Patch attached.

For the other problem, my thinking is to provide a new function that
allows a user to check if synchronous replication is active.

Ideally this function would give other information also needed by
cluster managers. Specifically when a replica is removed from
synchronous standby names we would need still need to consider that
replica as a potential synchronous replica until a quorum matching the
current synchronous_standby_names setting overtakes the last LSN
confirmed by a replica matching the removed name.

To illustrate the situation where this is needed, consider s_s_n =
'ANY 1 (A B)'. While this setting is active we have to check latest
replicated LSN from both A and B to know which one to promote. Lets
say transaction X is replicated to A and confirmed, but not yet to B.
Now A is removed so s_s_n becomes 'ANY 1 (B)'. Based on this setting
it is always safe to promote B, but until B receives the LSN  that was
on primary when synchronous_standby_names was changed, it might not
have all the data. This one is possible to work around by checking the
relevant values from pg_stat_replication, but it would be nice to have
a neater interface.

My proposal is something like this:

postgres=# SELECT * FROM pg_sync_replication_status();
 is_active | synchronous_standby_names | has_quorum
-----------+---------------------------+------------
 t         | ANY 1 (A B)               | f
(1 row)

Thoughts?

Regards,
Ants Aasma

Attachment

pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Fix overflow of nbatch
Next
From: Sami Imseih
Date:
Subject: Re: another autovacuum scheduling thread