Re: Allow reading LSN written by walreciever, but not flushed yet - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: Allow reading LSN written by walreciever, but not flushed yet
Date
Msg-id 7E23D6B9-B928-41CF-8471-04A6926D8305@yandex-team.ru
Whole thread Raw
In response to Small fixes needed by high-availability tools  (Andrey Borodin <x4mmm@yandex-team.ru>)
List pgsql-hackers

> On 21 May 2025, at 15:03, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2025/05/21 17:35, Andrey Borodin wrote:
>> Well, we implemented this and made tests that do a lot of failovers. These tests observed data loss in some
infrequentcases due to wrong new primary selection. Because "few seconds" is actually unknown random time. 
>
> I see your point. But doesn't a similar issue exist even with the write LSN?
> For example, even if node1's write LSN is ahead of node2's at one moment,
> node2 might catch up or surpass it a few seconds later.
>
> If the walreceiver is no longer running, we can assume the write LSN has
> reached its final value. So by waiting for the walreceiver to exit on both nodes,
> we can "safely" compare their write LSNs to decide which one is ahead.
> Also, in this situation, since XLogWalRcvFlush() is called during WalRcvDie(),
> the flush LSN seems effectively guaranteed to match the write LSN.
> So it seems also safe to use the flush LSN.

You are right. Receive LSN is meaningless when receive is in progress. So the only way to know receive LSN is to stop
receiving...
I need to think more about it.

>>>>>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I
proposeto add a new function which returns LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()),but unused. 
>>>
>>> GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
>>> when the walreceiver restarts. This behavior is OK for your purpose?
>> It is OK, because:
>> 1. It's strictly no worse than flushed LSN
>
> Could you clarify this?
>
> XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has advanced,
> while XLogWalRcvWrite() updates writtenUpto unconditionally. That means the flush
> LSN (as reported by pg_last_wal_receive_lsn()) never moves backward, whereas
> the write LSN might.

Write LSN cannot move backwards beyond flush LSN. Receive LSN >= flush LSN.

> Because of this difference in behavior, I was thinking that
> we might need to track the maximum write LSN seen so far and have the function
> return that value.

That would be ideal. Or, maybe just maximum LSN that we told Primary we have received...


Best regards, Andrey Borodin.




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: MERGE issues around inheritance
Next
From: Maciek Sakrejda
Date:
Subject: Re: plan shape work