Kernel AIO on FreeBSD, macOS and a couple of other Unixen - Mailing list pgsql-hackers

From Thomas Munro
Subject Kernel AIO on FreeBSD, macOS and a couple of other Unixen
Date
Msg-id CA+hUKGJwxOqoyfj7F12BgJR1P6s4aOPgMcxvzg-=qq+OiJ6Tdg@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

Here is a proof-of-concept patch for io_method=posix_aio.

It works pretty well on FreeBSD, making good use of a couple of
extensions.  It's working better than all my previous attempts on
macOS, but I guess it's more of a curiosity for developers, and scan
performance is also affected by the lack of vectored I/O, meaning
more, smaller IOs depending on buffer pool state.  It also passes
tests on NetBSD, and once worked on AIX.  That's the four surviving
Unixen with general kernel-mode POSIX AIO, as far as I know.

Up against io_method=worker in simple tests, you can't win sometimes
because of the help they supply with checksum validation when they run
completion, and if the data is served straight out of kernel cache
that might be a bigger factor, but that's not specific to this IO
method.

It also compiles and passes tests on Linux with glibc's user space
thread-based POSIX AIO implementation, which could be convenient for
developers, but it's emphatically not a target, it's utterly terrible.

The main problem I faced was cross-process completions.  As
implemented, it slows down concurrent scans with colliding buffers by
introducing a bunch of cross-process ping-pong.  The only way I can
think of to do better involves a helper thread (an approach
tentatively used in experiments for Windows AIO, but we already had a
helper thread there so that's a free pass).  The real solution is
general multi-threading to make that pain all go away (and uhh replace
it with other pain).

Another problem was the lack of a completion queue on macOS.  I think
the history is that they ported kqueue from FreeBSD but removed
EVFILT_AIO because they didn't have AIO yet, and then a couple of
years later introduced AIO but never connected them together.
Recently I hit on a simple new idea that doesn't require massive
amounts of IO polling to recreate a completion queue, and it seems to
have legs.

Some changes and problems that came up along the way as I pulled this
old AIOv1 code apart and put it back together again for AIOv2:

 * I adopted the same general queue/completion_lock-per-backend design
as io_uring adopted in AIOv2; I like it much better this way and it
has a nice just-delete-lots-of-lines pathway to a multi-threaded mode
 * I don't think it's right for this IO method to PANIC if aio_read()
etc fail with EAGAIN: it makes some sense for io_method=io_uring
because it sizes its own submission queue perfectly, but with this
API, equivalent resources are configured elsewhere eg sysctl -a | grep
aio.  So I think it should follow io_method=worker instead and fall
back to synchronous execution if there's an unknowable system limit in
the way (in practice this seems to be easiest to hit on a Mac with
default sysctls)
 * To do that, I had to deviate slightly from the contract (and name)
of pgaio_prepare_submit(), and call it *after* submitting, but if
submission fails with EAGAIN, there was no supported way to set the
PGAIO_HS_SYNCHRONOUS flag while already in PGAIO_HS_STAGED state
before going to PGAIO_HS_SUBMITTED, without which a backend might
concurrently reach wait_one() and hang, so I invented
pgaio_io_prepare_submit_synchronously() that just sets that flag for
you first
 * io_method=worker should of course use that too: it has no
wait_one() so we probably just never thought about that, but
pg_aios.f_sync might as well the truth, as a diagnostic clue
 * In that new function, I discovered that if I didn't insert
pg_read_barrier() before ioh->flags |= PGAIO_HS_SYNCHRONOUS, then the
flag could occasionally be lost on my Mac laptop, which would normally
imply a memory model screwup, except that I don't expect ioh->flags to
be written to by any other backend, not in a previous generation, not
ever after initialisation, which made me begin to wonder if I might be
seeing something related to what Alexander and Konstantin reported,
despite knowing that when you start to suspect the compiler or CPU is
wrong it's usually just time to forget about computers and go for a
walk, but what I am missing?  The test workload was deliberately
generating a lot of concurrent scans of the same table to exercise the
cross-proces stuff, which means lots of asynchronous signals being
handled while submission is underway...

Attachment

pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: amcheck support for BRIN indexes
Next
From: Álvaro Herrera
Date:
Subject: Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators