Home > mailing lists

Re: Should io_method=worker remain the default? - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Should io_method=worker remain the default?
Date	September 8 05:39:35
Msg-id	CA+hUKG+wCagh6mogKS_VL-SVABomho9beg7pLpntHRK2P6QrWA@mail.gmail.com Whole thread Raw
In response to	Re: Should io_method=worker remain the default? (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Should io_method=worker remain the default? Re: Should io_method=worker remain the default?
List	pgsql-hackers

Tree view

On Sat, Sep 6, 2025 at 8:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2025-09-03 at 11:55 -0400, Andres Freund wrote:
> > I think the regression is not due to anything inherent to worker, but
> > due to
> > pressure on AioWorkerSubmissionQueueLock - at least that's what I'm
> > seeing on
> > a older two socket machine. It's possible the bottleneck is different
> > on a
> > newer machine (my newer workstation is busy on another benchmark rn).
>
> I believe what's happening is that parallelism of the IO completion
> work (e.g. checksum verification) is reduced. In worker mode, the
> completion work is happening on the io workers (of which there are 3);
> while in sync mode the completion work is happening in the backends (of
> which there are 32).

Make sense.

Some raw thoughts on this topic, and how we got here:  This type of
extreme workload, namely not doing any physical I/O, just copying the
same data from the kernel page cache to the buffer pool over and over
again, is also the workload where io_method=worker can beat
io_method=io_uring (and other native I/O methods I've prototyped),
assuming io_workers is increased to a level that keeps up.  That does
lead to a couple of ideas such as handing completions off to a new
generic task worker pool, even for native I/O methods.  At a guess,
you'd probably only want to do that when extra kernel completion
notifications are drained ahead of the one you need right now, so that
it only kicks in when you're reading ahead further than you strictly
need right now (that's just a guess about the cost of IPC and buffer
header cache line ping-pong effects vs the cost of running the
checksums yourself, not properly investigated).  I also experimented
with going the other way: having I/O workers do *just* the preadv()
and then hand completion duties back to the backend that waits for the
IO.  I have patches for that, creating a per-backend completion queue
that I/O methods could opt to use (I experimented with that model a
part of my ongoing struggle with multi-process PostgreSQL + native AIO
which fundamentally assumes threads on all other OSes, but it's
trivial to enable it for io_method=worker too).  But that leveled the
playing field in the wrong direction!  This I/O worker property is
actually good when tuned correctly so that you get more rather than
less checksum computation concurrency, so I abandoned that experiment.
Of course if we did that, in your case here you'd get all 32 backends
computing checksums no matter how many I/O workers are running.  The
original motivation for workers processing completions rather than
more directly simulating io_uring (which was the original and only
implementation when I got involved in this) was the sheer simplicity
of it, and this choice didn't matter much before the
checksums-by-default change went in just a little ahead of basic AIO
support.

> There may be lock contention too, but I don't think that's the primary
> issue.

Interesting that it shows up so clearly for Andres but not for you.
I've made a note to try out those lock improvement patches I mentioned
on a dual socket box once I'm done with some other work that is
keeping me busy...

> I attached a test patch for illustration. It simplifies the code inside
> the LWLock to enqueue/dequeue only, and simplifies and reduces the
> wakeups by doing pseudo-random wakeups only when enqueuing. Reducing
> the wakeups should reduce the number of signals generated without
> hurting my case, because the workers are never idle. And reducing the
> instructions while holding the LWLock should reduce lock contention.
> But the patch barely makes a difference: still around 24tps.

BTW There are already a couple of levels of signal suppression: if
workers are not idle then we don't set any latches, and even if we
did, SetLatch() only sends signals when the recipient is actually
waiting, which shouldn't happen when the pool is completely busy.

+    nextWakeupWorker = (nextWakeupWorker + 1) % io_workers;

FWIW, I experimented extensively with wakeup distribution schemes like
that ^ and others, and found that they performed significantly worse
in terms of avg(latency) and stddev(latency) for partially idle worker
pools, as discussed a bit in that auto-tuning thread.  It seems to be
generally better to keep a small number of workers busy.

> What *does* make a difference is changing io_worker_queue_size. A lower
> value of 16 effectively starves the workers of work to do, and I get a
> speedup to about 28tps. A higher value of 512 gives the workers more
> chance to issue the IOs -- and more responsibility to complete them --
> and it drops to 17tps. Furthermore, while the test is running, the io
> workers are constantly at 100% (mostly verifying checksums) and the
> backends are at 50% (20% when io_worker_queue_size=512).

I would value your feedback and this type of analysis on the thread
about automatic tuning for v19.  It should find the setting that keeps
your executor backends 100% busy, and some number of low-numbered I/O
workers 100% busy (only because here you're doing no physical I/O that
would cause them to sleep in D state), but leave some spare worker
capacity to allow for recent variation.  It should also keep the queue
from overflowing and falling back to synchronous I/O.  Keeping the
queue size low is important for latency, because we don't want new
queries to get stuck behind a huge queue: each new IO should always be
processed ASAP if the storage can keep up.  That's kinda why a queue
size of 64 is probably "enough", since 32 workers is the highest pool
size and it doesn't like there to be more than one extra queued IO per
currently running worker (which is debatable, see discussion on this
simplicity of that policy vs theoretical arguments about some unknown
variables).  Though I do think there are good reasons to make the
queue size user-configurable for developers only, and I'm planning to
propose that and more as a debug_ setting, since it turns out to be
quite useful to testing theories about higher level rate control
systems, discussed a bit in that thread, and hopefully with some more
patches a bit later...

You might also be interested in the phenomenon discussed in the index
prefetching thread, where concurrent reads of the *same* block from
different backs currently force a wait, which can be avoided with a
patch he posted.  In that thread it was happening due to index scans
trying to stream the same block repeatedly in one stream, but it's
also known to happen in torture tests much like yours where different
backends step on each other's toes; on the other hand you're reporting
100% CPU for backends, so I guess it's not happening much... hmm,
perhaps you disabled synchronous scans, or something about your test
conditions avoids that?

> As an aside, I'm building with meson using -Dc_args="-msse4.2 -Wtype-
> limits -Werror=missing-braces". But I notice that the meson build
> doesn't seem to use -funroll-loops or -ftree-vectorize when building
> checksums.c. Is that intentional? If not, perhaps slower checksum
> calculations explain my results.

Ah, good discovery.

pgsql-hackers by date:

From: Chao Li
Date: 08 September, 05:38:07
Subject: Re: Set log_lock_waits=on by default

From: Thomas Munro
Date: 08 September, 05:44:57
Subject: Re: Should io_method=worker remain the default?

Re: Should io_method=worker remain the default? - Mailing list pgsql-hackers

Previous

Next