Re: Should io_method=worker remain the default? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Should io_method=worker remain the default?
Date
Msg-id CA+hUKGKR8m8Cv_rjGQggW6TCXnaqOXyk3ROA-rA69XcP4_63pw@mail.gmail.com
Whole thread Raw
In response to Re: Should io_method=worker remain the default?  (Andres Freund <andres@anarazel.de>)
Responses Re: Should io_method=worker remain the default?
List pgsql-hackers
On Tue, Sep 9, 2025 at 9:02 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-09-08 16:45:52 -0400, Andres Freund wrote:
> > I don't think accelerating copying from the pagecache into postgres shared
> > buffers really is a goal of AIO.
>
> I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid
> using AIO when data is in the page cache, I'm fairly sure we'd want to use
> that. However, there is not, from what I know (both fincore() and RWF_NOWAIT
> are too expensive). The maximum gain from using AIO when the data is already
> in the page cache is just not very big, and it can cause slowdowns due to IPC
> overhead etc.

FWIW, I briefly played around with RWF_NOWAIT in pre-AIO streaming
work: I tried preadv2(RWF_NOWAIT) before issuing WILLNEED advice.  I
cooked up some experimental heuristics to do it only when it seemed
likely to pay off*.  I also played with probing to find the frontier
where fadvise had "completed", while writing toy implementations of
some of Melanie's feedback control ideas.  It was awful.  Fun systems
programming puzzles, but it felt like jogging with one's shoelaces
tied together compared to proper AIO interfaces.

Note that io_uring already has vaguely similar behaviour internally:
see IOSQE_ASYNC heuristics in method_io_uring.c and man
io_uring_sqe_set_flags.

In the new AIO world, I therefore assume we'd only be talking about a
potential path that could skip some overheads for
io_method=worker/sync with a primed page cache, and that seems to have
some fundamental issues: (1) AFAIK the plan is to drop io_method=sync
soon, it's only a temporary be-more-like-v17 mitigation in case of
unforeseen problems or in case we decided not to launch with worker by
default, and this thread has (re-)concluded we should stick with
worker, (2) preadv2() is Linux-only and I'm not aware of a similar
"non-blocking file I/O" interface on any other system**, and yet
io_method=worker is primarily intended as a portable fallback for
systems lacking a better native option and (3) even though it should
win for Jeff's test case by skipping workers entirely if an initial
RWF_NOWAIT attempt succeeds, you could presumably change some
parameters and make it lose (number of backends vs number of I/O
workers performing copyout, cf in_flight_before > 4 in io_uring code,
and performing checksums as discussed).

Still, it's interesting to contemplate the two independent points of
variation: concurrency of page cache copyout (IOSQE_ASYNC, magic
number 4, what other other potential native I/O methods do here) and
concurrency of checksum computation (potential for worker pool
handoff).

*One scheme kept stats in a per-relation shm object.  That was
abandoned per the above reasoning, and, digressing a bit here, I'm
currently much more interested in tracking facts about our own buffer
pool contents, to inform streaming decisions and skip the buffer
mapping table in many common cases.  Digressing even further, my first
priority for per-relation shm objects is not even that, it's to
improve the fsync hand-off queue: (1) we probably shouldn't trust
Linux to remember relation sizes until we've fsync'd, and (2) Andres's
asynchronous buffer write project wants a no-throw guarantee when
enqueuing in a critical section.

**Anyone know of one?  This is basically a really ancient and
deliberate Unix design decision to hide I/O asynchrony and buffering
from user space completely, unlike pretty much every other OS, shining
through.  (I've thought about proposing it for FreeBSD as a
programming exercise but I'd rather spend my spare time making its AIO
better.)



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Next
From: Thomas Munro
Date:
Subject: Re: Should io_method=worker remain the default?