Re: Implement waiting for wal lsn replay: reloaded - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: Implement waiting for wal lsn replay: reloaded
Date
Msg-id CAPpHfdtgcyh0ZK-AtXXvQBaBsNFR=p_0KF1yN7kouEmuuE2nwA@mail.gmail.com
Whole thread Raw
In response to Re: Implement waiting for wal lsn replay: reloaded  (Xuneng Zhou <xunengzhou@gmail.com>)
List pgsql-hackers
Hi, Xuneng Zhou!

On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Thanks for working on this.
>
> I’ve just come across this thread and haven’t had a chance to dig into
> the patch yet, but I’m keen to review it soon.

Great.  Thank you for your attention to this patch.  I appreciate your
intention to review it.

> In the meantime, I have
> a quick question: is WAIT FOR REPLY intended mainly for user-defined
> functions, or can internal code invoke it as well?

Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend.  But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process.  This would enable to call
WaitForLSNReplay() wherever it is needed.  There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.


> During a recent performance run [1] I noticed heavy polling in
> read_local_xlog_page_guts(). Heikki’s comment from a few months ago
> also hints that we could replace this check–sleep–repeat loop with the
> condition-variable (CV) infrastructure used by walsender:
>
> /*
>  * Loop waiting for xlog to be available if necessary
>  *
>  * TODO: The walsender has its own version of this function, which uses a
>  * condition variable to wake up whenever WAL is flushed. We could use the
>  * same infrastructure here, instead of the check/sleep/repeat style of
>  * loop.
>  */
>
> Because read_local_xlog_page_guts() waits for a specific flush or
> replay LSN, polling becomes inefficient when the wait is long. I built
> a POC patch that swaps polling for CVs, but a single global CV (or
> even separate “flush” and “replay” CVs) isn’t ideal:
>
> The wake-up routines don’t know which LSN each waiter cares about, so
> they’d have to broadcast on every flush/replay. Caching the minimum
> outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
> eliminate them—multiple backends might wait for different LSNs
> simultaneously. A more precise solution would require a request queue
> that maps waiters to target LSNs and issues targeted wake-ups, adding
> complexity.
>
> Walsender accepts the potential broadcast overhead by using two cvs
> for different waiters, so it might be acceptable for
> read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
> becomes available to backend code, we might leverage it to eliminate
> the polling for waiting replay in read_local_xlog_page_guts() without
> introducing a bespoke dispatcher. I’d appreciate any thoughts on
> whether that use case is in scope.

This looks like a great new use-case for facilities developed in this
patch!  I'll remove the restriction to use WaitForLSNReplay() only in
backend.  I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization.  Let me know if you need any
assistance.

------
Regards,
Alexander Korotkov
Supabase



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: vacuumdb --analyze-only does not need to issue VACUUM (ONLY_DATABASE_STATS) ?
Next
From: Jelte Fennema-Nio
Date:
Subject: New commitfest app release on August 19th