Thread: Small fixes needed by high-availability tools

Small fixes needed by high-availability tools

From
Andrey Borodin
Date:
Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I
ratherwant to have a design consensus first. 

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We can disallow new
connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this
restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
     if (PostAuthDelay > 0)
         pg_usleep(PostAuthDelay * 1000000L);

+    /* Check if we need to wait for startup synchronous replication */
+    if (!am_walsender &&
+        !superuser() &&
+        !StartupSyncRepEstablished())
+    {
+        ereport(FATAL,
+                (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+                 errmsg("cannot connect until synchronous replication is established with standbys according to
startup_synchronous_standby_level")));
+    }

We might also want to have some kind of cache that quorum was already established. Also the place where the check is
donemight be not most appropriate. 

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the
solutionis incomplete without first patch in the current thread. 

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as
committed.This leads to loosing data with UPSERTs. 

The key change is how we process cancels in SyncRepWaitForLSN().

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but
beforeby written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose
toadd a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused. 

Currently we just use a separate program lwaldump [4] which just reads WAL until last valid record. In case of failover
pg_consuluses LSNs from lwaldump. This approach works well, but is cumbersome. 


There are other caveats of replication, but IMO these 3 problems are most annoying in terms of data durability.

I'd greatly appreciate any thoughts on this.


Best regards, Andrey Borodin.


[0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] https://www.postgresql.org/message-id/flat/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z+jeg@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com#415dc2f7d41b8a251b419256407bb64d
[3] https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[4] https://github.com/g0djan/lwaldump


Attachment

Re: Small fixes needed by high-availability tools

From
Matthias van de Meent
Date:
On Fri, 2 May 2025 at 15:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers!
>
> I want to revive attempts to fix some old edge cases of physical quorum replication.
>
> Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I
ratherwant to have a design consensus first.
 
[...]
> 2. Do not allow to cancel locally written transaction
>
> The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the
solutionis incomplete without first patch in the current thread.
 

I'm trying to figure out where in the thread you find this this "some
agreement". Could you reference the posts you're referring to?

> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as
committed.This leads to loosing data with UPSERTs.
 

Could you explain why specifically UPSERTs would lose data (vs any
other user workload) in cancellations during SyncRepWaitForLSN?

> The key change is how we process cancels in SyncRepWaitForLSN().

I personally think we should rather move to CSN-based snapshots on
both primary and replica (with LSN as CSN), and make visibility of
other transactions depend on how much persistence your session wants
(SQL spec permitting, of course).

I.e., if you have synchronous_commit=remote_apply, you wait with
sending the commit success message until you have confirmation that
your commit LSN has been applied on the configured amount of replicas,
and snapshots are taken based on the latest LSN that is known to be
applied everywhere, but if you have synchronous_commit=off, you could
read the commits (even those committed in s_c=remote_apply sessions)
immediately after they've been included in the logs (potentially with
some added slack to account for system state updates).
Similarly, all snapshots you create in a backend with
synchronous_commit=remote_apply would use the highest LSN which is
remotely applied according to the applicable rules, while
synchronous_commit=off implies "all transactions which have been
logged as committed".
Changing synchronous_commit to a value that requires higher
persistence level would cause the backend to wait for its newest
snapshot LSN to reach that persistence level; IMO an acceptable
trade-off for switching s_c at runtime.

This is based on the assumption that if you don't want your commit to
be very durable, you probably also don't care as much about the
durability of the data you can see, and if you want your commits to be
very durable, you probably want to see only very durable data.

This would also unify the commit visibility order between primary and
secondary nodes, and would allow users to have session-level 'wait for
LSN x to be persistent' with much reduced lock times.

(CC-ed to Ants, given his interest in this topic)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Small fixes needed by high-availability tools

From
Amit Kapila
Date:
On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> 3. Allow reading LSN written by walreciever, but not flushed yet
>
> Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but
beforeby written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.
>

In which case, can we rely on written WAL that is not yet flushed?
Because say you decide based on written WAL and choose node-1 in above
case for failover, what if it restarts without flushing the written
WAL?

> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I
proposeto add a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused. 
>

It seems to me that this is less controversial than your other two
proposals. So, we can discuss this in a separate thread as well.

--
With Regards,
Amit Kapila.



Re: Small fixes needed by high-availability tools

From
Ants Aasma
Date:
Hi, dropping in my 2 cents here.

On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> >> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data
ascommitted. This leads to loosing data with UPSERTs.
 
> >
> > Could you explain why specifically UPSERTs would lose data (vs any
> > other user workload) in cancellations during SyncRepWaitForLSN?
>
> Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is
observingnon-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it.
 
> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
> 2. JDBC cancels query by after default timeout
> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

Right. I think upsert is a red herring here. Any system trying to
implement idempotency/exactly once delivery will be built around a
similar pattern. Check if a transaction has already been executed, if
not run the transaction, commit, on failure retry. This is
particularly vulnerable to the visibility issue because the retry is
likely to land on the partitioned off leader.

> >
> >> The key change is how we process cancels in SyncRepWaitForLSN().
> >
> > I personally think we should rather move to CSN-based snapshots on
> > both primary and replica (with LSN as CSN), and make visibility of
> > other transactions depend on how much persistence your session wants
> > (SQL spec permitting, of course).
>
> CSN is a snapshot technique and does not affect sync rep in durability aspect. You still WAL-log xid commit.

CSN based snapshots enable delaying visibility without blocking on
cancelling a commit, and relatedly having async commits remain
invisible. List of concurrent xids snapshots require shared memory to
keep track of which transactions are running and are therefore limited
in size, running a transaction, commiting and then cancelling allows
for a potentially unlimited amount of concurrent transactions.

> > I.e., if you have synchronous_commit=remote_apply, you wait with
> > sending the commit success message until you have confirmation that
> > your commit LSN has been applied on the configured amount of replicas,
> > and snapshots are taken based on the latest LSN that is known to be
> > applied everywhere, but if you have synchronous_commit=off, you could
> > read the commits (even those committed in s_c=remote_apply sessions)
> > immediately after they've been included in the logs (potentially with
> > some added slack to account for system state updates).
>
> Introducing dependency of snapshot on synchronous_commit level is the interesting idea, but it still depends on that
cancelcannot make effect of transaction visible. It does not contradict ideas that I propose here, but support it.
 
>
> CSN is discussed for a couple of decades already, anything makes you believe it will arrive soon and we do not to fix
existingproblems?
 

A couple of things give me hope. One is Heikki's approach of adding a
xid visibility cache to the snapshot [1], which proved to be
surprisingly effective.

The other is having a resolution in sight on how to handle async
transaction visibility. My recollection is that this was the major
issue that derailed the feature last time it was attempted. Allowing
different users to see different states based on their durability
requirements looks like a satisfactory answer to this problem. Whether
it's by overloading synchronous_commit, or a new guc, or a transaction
isolation parameter is a small matter of bikeshedding.

There is even an interesting paper on how this type of approach can be
used to reduce lock durations in contended workloads by moving the
wait to readers [2].

Third, the recent Jepsen report seems to have renewed wider interest
in this problem. [3]

In a related topic, I'm also looking at tail latencies with
synchronous replication. Some storage devices have occasional hiccups,
and because WAL necessarily causes head of line blocking, which can
magnify the problem by multiple orders of magnitude. Right now we
don't even try to replicate before WAL is flushed locally. Ideally I
would like to support quorum commit to return when a transaction is
not yet persistent on the local disk. This will require some
additional awareness on PostgreSQL side that it is running as part of
a cluster, similarly to the no visibility before replicated durability
problem we are discussing here.

Regards,
Ants Aasma

[1]
https://www.postgresql.org/message-id/flat/80f254d3-8ee9-4cde-a7e3-ee99998154da%40iki.fi#8a550e2adaa6810e25c497f24a2a83fd
[2] https://cs.uwaterloo.ca/~kdaudjee/ED.pdf
[3] https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4




On 2025/05/13 0:47, Andrey Borodin wrote:
> Moved off from "Small fixes needed by high-availability tools"
> 
>> On 12 May 2025, at 01:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>>
>>> 3. Allow reading LSN written by walreciever, but not flushed yet
>>>
>>> Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but
beforeby written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.

In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting
the most advanced node? No?


>>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I
proposeto add a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused.
 

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Allow reading LSN written by walreciever, but not flushed yet

From
Mihail Nikalayeu
Date:
Hello, everyone!

> Or might want LSN "flush everything you have written, and return that LSN". That will also do the trick, but is not
necessary.
I think it is a better option. Less things may go wrong in such a case.

Best regards,
Mikhail.



Re: Small fixes needed by high-availability tools

From
Mihail Nikalayeu
Date:
Hello, everyone!

> On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> >> Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data
ascommitted. This leads to loosing data with UPSERTs. 
>> > >
>> > > Could you explain why specifically UPSERTs would lose data (vs any
>> > other user workload) in cancellations during SyncRepWaitForLSN?
>>
>> Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is
observingnon-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO >NOTHING" and retrying it. 
>> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
>> 2. JDBC cancels query by after default timeout
>> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

> Right. I think upsert is a red herring here. Any system trying to
> implement idempotency/exactly once delivery will be built around a
> similar pattern. Check if a transaction has already been executed, if
> not run the transaction, commit, on failure retry. This is
> particularly vulnerable to the visibility issue because the retry is
> likely to land on the partitioned off leader.

I think UPSERT is just one specific case here. Any data that becomes
visible and then disappears can cause a variety of issues.

For example, the system receives a callback from a payment system,
marks an order as "PAID," commits the transaction, and returns a 200
response to the payment system (so it won't retry the callback).
However, if the transaction is lost due to a new primary, we end up
with an order that is paid in the real world, but the system is
unaware of it.

And yes, that patch has actually been applied on top of HEAD by most
PG cloud providers for over four years now.... [0].

> One idea to solve this problem could be that whenever we cancel
> sync_rep_wait, we set some system-wide flag that indicates that any
> new transaction must ensure that all the current data is replicated to
> the synchronous standby. Once we ensure that we have waited for
> pending transactions to replicate, we can toggle back that system-wide
> flag. Now, if the system restarts for any reason during such a wait,
> we can use your idea to disallow new connections until the standby
> quorum is established.

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50



Re: Allow reading LSN written by walreciever, but not flushed yet

From
Alexander Kukushkin
Date:
Hi Fujii,


On Tue, 13 May 2025 at 13:13, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting
the most advanced node? No?

I think it is a bit more complex than that. There are also cases when we want to ensure that there are "healthy" standby nodes when switchover is requested.
Meaning of "healthy" could be something like: "According to the write LSN it is not lagging more than 16MB" or similar.
Now it is possible to extract this value using pg_stat_get_wal_receiver()/pg_stat_wal_receiver, but it works only when the walreceiver process is alive.
 
>>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

IMO, most of HA tools are prepared for it. They can't rely only on write/flush LSN, because standby may be replaying WALs from the archive using restore_command and as a result only replay LSN is progressing.
That is, they are supposed to be doing something like max(write_lsn, replay_lsn).

--
Regards,
--
Alexander Kukushkin

Re: Allow reading LSN written by walreciever, but not flushed yet

From
Alexander Kukushkin
Date:


On Mon, 12 May 2025 at 17:48, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Done so. Thanks!

TBH, the current function name pg_last_wal_receive_lsn() is confusing, and introducing a new one pg_last_wal_receive_unflushed_lsn() doesn't make it better :(
What about actually adding TWO new functions, pg_last_wal_write_lsn() and pg_last_wal_flush_lsn()?
These names are more aligned with column names in pg_stat_replication view and speak for themselves.

And, we can keep pg_last_wal_receive_lsn() as an alias of pg_last_wal_flush_lsn() for backward compatibility.

--
Regards,
--
Alexander Kukushkin

Re: Small fixes needed by high-availability tools

From
Amit Kapila
Date:
On Wed, May 14, 2025 at 2:15 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
>
> > One idea to solve this problem could be that whenever we cancel
> > sync_rep_wait, we set some system-wide flag that indicates that any
> > new transaction must ensure that all the current data is replicated to
> > the synchronous standby. Once we ensure that we have waited for
> > pending transactions to replicate, we can toggle back that system-wide
> > flag. Now, if the system restarts for any reason during such a wait,
> > we can use your idea to disallow new connections until the standby
> > quorum is established.
>
> It might not necessarily be a flag—it could be some LSN value instead.
> Also, it's not just about a "new transaction," but about any new
> snapshot that could see data not yet replicated to the synchronous
> standby.
>

Sounds reasonable to me. Let us see what others think about it.

--
With Regards,
Amit Kapila.