Re: AIO v2.5 - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: AIO v2.5
Date
Msg-id CAEze2Wh-6VqUEgqjhmYK_v+ca+98DPGpt1nj08QceF4Ka0mf6g@mail.gmail.com
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
Responses Re: AIO v2.5
List pgsql-hackers
On Wed, 9 Jul 2025 at 16:59, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-07-09 13:26:09 +0200, Matthias van de Meent wrote:
> > I've been going through the new AIO code as an effort to rebase and
> > adapt Neon to PG18. In doing so, I found the following
> > items/curiosities:
> >
> > 1. In aio/README.md, the following code snippet is found:
> >
> > [...]
> > pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
> > [...]
> >
> > I believe it would be clearer if it took a reference to the buffer:
> >
> > pgaio_io_set_handle_data_32(ioh, (uint32 *) &buffer, 1);
> >
> > The main reason here is that common practice is to have a `Buffer
> > buffer;` whereas a Buffer * is more commonly plural.
>
> It's also just simply wrong as-is :/. Interpreting the buffer id as a pointer
> obviously makes no sense...

Given that the snippet didn't contain type indications for buffer upto
that point, technically the buffer variable could've been defined as
`Buffer* buffer;` which would've been type-correct. That would be very
confusing however, hence the suggested change.

After your mail, I also noticed the later snippet which should be updated, too:

```
-smgrstartreadv(ioh, operation->smgr, forknum, blkno,
-               BufferGetBlock(buffer), 1);
+void *page = BufferGetBlock(buffer);
+smgrstartreadv(ioh, operation->smgr, forknum, blkno,
+               &page, 1);
```

> > 3. I noticed that there is AIO code for writev-related operations
> > (specifically, pgaio_io_start_writev is exposed, as is
> > PGAIO_OP_WRITEV), but no practical way to excercise that code: it's
> > not called from anywhere in the project, and there is no way for
> > extensions to register the relevant callbacks required to make writev
> > work well on buffered contents. Is that intentional?
>
> Yes.  We obviously do want to support writes eventually, and it didn't seem
> useful to not have the most basic code for writes in the AIO infrastructure.
>
> You could still use it to e.g. write out temporary file data or such.

Yes, though IIUC that would require an implementation of at least
PgAioTargetInfo for such a use case (it's definitely not a SMGR
target), which currently isn't available and can't be registered
dynamically by an extension. Or maybe did I miss something?

> > and there is no way for extensions to register the relevant callbacks
> > required to make writev work well on buffered contents. Is that intentional?
>
> FWIW, the problem with writev for buffered IO is not so much with the AIO
> infrastructure, but with buffer locking and sync.c...

Maybe "buffered contents" wasn't the right phrasing, but my main point
remains - I can't seem to find a way to make use of the
pgaio_start_writev API in the 18 branch that doesn't require patching
core code or heavily relying on the behaviour of various pgaio
internals:

The only valid pgaio target is PGAIO_TID_SMGR, which effectively
forces you to use buffered relations as IO target (either shared, or
backend-local/temp). Then, there are no callbacks defined for writev,
so the user will effectively have to do setup and cleanup on their
own, making worker-based IO practically impossible, and requiring the
user to handle all lock acquisition/releases in their own process'
time instead of in the AIO paths - requiring significantly more
involved error handling.


(PS. I'm not quite 100% sure that it is impossible to use, just that
there are rather few handles available for using this part of the new
tool, and it seems completely untested in the PG18 branch)

-----

Something else I've just noticed is the use of int32 in
PgAIOHandle->result. In sync and worker mode, pg_preadv and pg_pwritev
return ssize_t, which most modern systems can't fit in int32 (the
output was int before, then size_t, then ssize_t: [0]). While not
directly an issue in default PG18 due to the use of 1GB relation
segments capping the max IO size for SMGR-managed IOs (and various
other code-level constraints), this may have more issues when an
extension starts bulk-reading data on a system compiled with
RELSEG_SIZE >= 2GB; I can't find any protective checks against
overflows in downcasting the IO result.

I did notice Linux seems to guarantee it won't read more than
0x7FFF_F000 in any one operation, but I can't find any similar
guarantees for e.g. MacOS or Windows (well, our win32 port limits
pg_pread/writev to 1GB, which seems sufficient for this exact case).

If we're keeping int32, then it would be great if the considerations
for why it's OK to use 32 bits to store the ssize_t vlaue would be
included in the docs/comments.


Kind regards,

Matthias van de Meent
Databricks

[0] https://postgr.es/m/flat/1672202.1703441340%40sss.pgh.pa.us



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Remaining dependency on setlocale()
Next
From: "David E. Wheeler"
Date:
Subject: Re: encode/decode support for base64url