Re: Should io_method=worker remain the default? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Should io_method=worker remain the default? |
Date | |
Msg-id | CA+hUKGKR8m8Cv_rjGQggW6TCXnaqOXyk3ROA-rA69XcP4_63pw@mail.gmail.com Whole thread Raw |
In response to | Re: Should io_method=worker remain the default? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Should io_method=worker remain the default?
|
List | pgsql-hackers |
On Tue, Sep 9, 2025 at 9:02 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-09-08 16:45:52 -0400, Andres Freund wrote: > > I don't think accelerating copying from the pagecache into postgres shared > > buffers really is a goal of AIO. > > I forgot an addendum: In fact, if there were a sufficiently cheap way to avoid > using AIO when data is in the page cache, I'm fairly sure we'd want to use > that. However, there is not, from what I know (both fincore() and RWF_NOWAIT > are too expensive). The maximum gain from using AIO when the data is already > in the page cache is just not very big, and it can cause slowdowns due to IPC > overhead etc. FWIW, I briefly played around with RWF_NOWAIT in pre-AIO streaming work: I tried preadv2(RWF_NOWAIT) before issuing WILLNEED advice. I cooked up some experimental heuristics to do it only when it seemed likely to pay off*. I also played with probing to find the frontier where fadvise had "completed", while writing toy implementations of some of Melanie's feedback control ideas. It was awful. Fun systems programming puzzles, but it felt like jogging with one's shoelaces tied together compared to proper AIO interfaces. Note that io_uring already has vaguely similar behaviour internally: see IOSQE_ASYNC heuristics in method_io_uring.c and man io_uring_sqe_set_flags. In the new AIO world, I therefore assume we'd only be talking about a potential path that could skip some overheads for io_method=worker/sync with a primed page cache, and that seems to have some fundamental issues: (1) AFAIK the plan is to drop io_method=sync soon, it's only a temporary be-more-like-v17 mitigation in case of unforeseen problems or in case we decided not to launch with worker by default, and this thread has (re-)concluded we should stick with worker, (2) preadv2() is Linux-only and I'm not aware of a similar "non-blocking file I/O" interface on any other system**, and yet io_method=worker is primarily intended as a portable fallback for systems lacking a better native option and (3) even though it should win for Jeff's test case by skipping workers entirely if an initial RWF_NOWAIT attempt succeeds, you could presumably change some parameters and make it lose (number of backends vs number of I/O workers performing copyout, cf in_flight_before > 4 in io_uring code, and performing checksums as discussed). Still, it's interesting to contemplate the two independent points of variation: concurrency of page cache copyout (IOSQE_ASYNC, magic number 4, what other other potential native I/O methods do here) and concurrency of checksum computation (potential for worker pool handoff). *One scheme kept stats in a per-relation shm object. That was abandoned per the above reasoning, and, digressing a bit here, I'm currently much more interested in tracking facts about our own buffer pool contents, to inform streaming decisions and skip the buffer mapping table in many common cases. Digressing even further, my first priority for per-relation shm objects is not even that, it's to improve the fsync hand-off queue: (1) we probably shouldn't trust Linux to remember relation sizes until we've fsync'd, and (2) Andres's asynchronous buffer write project wants a no-throw guarantee when enqueuing in a critical section. **Anyone know of one? This is basically a really ancient and deliberate Unix design decision to hide I/O asynchrony and buffering from user space completely, unlike pretty much every other OS, shining through. (I've thought about proposing it for FreeBSD as a programming exercise but I'd rather spend my spare time making its AIO better.)
pgsql-hackers by date: