Re: Should io_method=worker remain the default? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Should io_method=worker remain the default? |
Date | |
Msg-id | CA+hUKG+wCagh6mogKS_VL-SVABomho9beg7pLpntHRK2P6QrWA@mail.gmail.com Whole thread Raw |
In response to | Re: Should io_method=worker remain the default? (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Should io_method=worker remain the default?
Re: Should io_method=worker remain the default? |
List | pgsql-hackers |
On Sat, Sep 6, 2025 at 8:26 AM Jeff Davis <pgsql@j-davis.com> wrote: > On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote: > > I think the regression is not due to anything inherent to worker, but > > due to > > pressure on AioWorkerSubmissionQueueLock - at least that's what I'm > > seeing on > > a older two socket machine. It's possible the bottleneck is different > > on a > > newer machine (my newer workstation is busy on another benchmark rn). > > I believe what's happening is that parallelism of the IO completion > work (e.g. checksum verification) is reduced. In worker mode, the > completion work is happening on the io workers (of which there are 3); > while in sync mode the completion work is happening in the backends (of > which there are 32). Make sense. Some raw thoughts on this topic, and how we got here: This type of extreme workload, namely not doing any physical I/O, just copying the same data from the kernel page cache to the buffer pool over and over again, is also the workload where io_method=worker can beat io_method=io_uring (and other native I/O methods I've prototyped), assuming io_workers is increased to a level that keeps up. That does lead to a couple of ideas such as handing completions off to a new generic task worker pool, even for native I/O methods. At a guess, you'd probably only want to do that when extra kernel completion notifications are drained ahead of the one you need right now, so that it only kicks in when you're reading ahead further than you strictly need right now (that's just a guess about the cost of IPC and buffer header cache line ping-pong effects vs the cost of running the checksums yourself, not properly investigated). I also experimented with going the other way: having I/O workers do *just* the preadv() and then hand completion duties back to the backend that waits for the IO. I have patches for that, creating a per-backend completion queue that I/O methods could opt to use (I experimented with that model a part of my ongoing struggle with multi-process PostgreSQL + native AIO which fundamentally assumes threads on all other OSes, but it's trivial to enable it for io_method=worker too). But that leveled the playing field in the wrong direction! This I/O worker property is actually good when tuned correctly so that you get more rather than less checksum computation concurrency, so I abandoned that experiment. Of course if we did that, in your case here you'd get all 32 backends computing checksums no matter how many I/O workers are running. The original motivation for workers processing completions rather than more directly simulating io_uring (which was the original and only implementation when I got involved in this) was the sheer simplicity of it, and this choice didn't matter much before the checksums-by-default change went in just a little ahead of basic AIO support. > There may be lock contention too, but I don't think that's the primary > issue. Interesting that it shows up so clearly for Andres but not for you. I've made a note to try out those lock improvement patches I mentioned on a dual socket box once I'm done with some other work that is keeping me busy... > I attached a test patch for illustration. It simplifies the code inside > the LWLock to enqueue/dequeue only, and simplifies and reduces the > wakeups by doing pseudo-random wakeups only when enqueuing. Reducing > the wakeups should reduce the number of signals generated without > hurting my case, because the workers are never idle. And reducing the > instructions while holding the LWLock should reduce lock contention. > But the patch barely makes a difference: still around 24tps. BTW There are already a couple of levels of signal suppression: if workers are not idle then we don't set any latches, and even if we did, SetLatch() only sends signals when the recipient is actually waiting, which shouldn't happen when the pool is completely busy. + nextWakeupWorker = (nextWakeupWorker + 1) % io_workers; FWIW, I experimented extensively with wakeup distribution schemes like that ^ and others, and found that they performed significantly worse in terms of avg(latency) and stddev(latency) for partially idle worker pools, as discussed a bit in that auto-tuning thread. It seems to be generally better to keep a small number of workers busy. > What *does* make a difference is changing io_worker_queue_size. A lower > value of 16 effectively starves the workers of work to do, and I get a > speedup to about 28tps. A higher value of 512 gives the workers more > chance to issue the IOs -- and more responsibility to complete them -- > and it drops to 17tps. Furthermore, while the test is running, the io > workers are constantly at 100% (mostly verifying checksums) and the > backends are at 50% (20% when io_worker_queue_size=512). I would value your feedback and this type of analysis on the thread about automatic tuning for v19. It should find the setting that keeps your executor backends 100% busy, and some number of low-numbered I/O workers 100% busy (only because here you're doing no physical I/O that would cause them to sleep in D state), but leave some spare worker capacity to allow for recent variation. It should also keep the queue from overflowing and falling back to synchronous I/O. Keeping the queue size low is important for latency, because we don't want new queries to get stuck behind a huge queue: each new IO should always be processed ASAP if the storage can keep up. That's kinda why a queue size of 64 is probably "enough", since 32 workers is the highest pool size and it doesn't like there to be more than one extra queued IO per currently running worker (which is debatable, see discussion on this simplicity of that policy vs theoretical arguments about some unknown variables). Though I do think there are good reasons to make the queue size user-configurable for developers only, and I'm planning to propose that and more as a debug_ setting, since it turns out to be quite useful to testing theories about higher level rate control systems, discussed a bit in that thread, and hopefully with some more patches a bit later... You might also be interested in the phenomenon discussed in the index prefetching thread, where concurrent reads of the *same* block from different backs currently force a wait, which can be avoided with a patch he posted. In that thread it was happening due to index scans trying to stream the same block repeatedly in one stream, but it's also known to happen in torture tests much like yours where different backends step on each other's toes; on the other hand you're reporting 100% CPU for backends, so I guess it's not happening much... hmm, perhaps you disabled synchronous scans, or something about your test conditions avoids that? > As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype- > limits -Werror=missing-braces". But I notice that the meson build > doesn't seem to use -funroll-loops or -ftree-vectorize when building > checksums.c. Is that intentional? If not, perhaps slower checksum > calculations explain my results. Ah, good discovery.
pgsql-hackers by date: