Re: AIO v2.5 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: AIO v2.5
Date
Msg-id e6db33f3-50de-43d3-9d9f-747c3b376e80@vondra.me
Whole thread Raw
In response to Re: AIO v2.5  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hi,

I've been running some benchmarks comparing the io_methods, to help with
resolving this PG18 open item. So here are some results, and my brief
analysis of it.

I was hoping to get this out sooner before beta2 :-( and some tests are
still running, but I don't think it'll affect the conclusions.


The TL;DR version
-----------------

* The "worker" method seems good, and I think we should keep it as a
default. We should probably think about increasing the number of workers
a bit, the current io_workers=3 seems to be too low and regresses in a
couple tests.

* The "sync" seems OK too, but it's more of a conservative choice, i.e.
more weight for keeping the PG17 behavior / not causing regressions. But
I haven't seen that (with enough workers). And there are cases when the
"worker" is much faster. It'd be a shame to throw away that benefit.

* There might be bugs in "worker", simply because it has to deal with
multiple concurrent processes etc. But I guess we'll fix those just like
other bugs. I don't think it's a good argument against "worker" default.

* All my tests were done on Linux and NVMe drives. It'd be good to do
similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan
to do some of that, but it'd be great to cover more cases. I can help
with getting my script running, a run takes ~1-2 days.


A more detailed version ...
---------------------------

The benchmark I did works like this:

1) It generates datasets with different data distributions (patterns in
the data - uniform, linear, cyclic, ...). Each table is ~4.3GB of data.

2) It then runs queries on that, with a BETWEEN clause with a certain
selectivity (matching % of the rows), forcing a particular scan type
(indexscan, bitmapscan, seqscan).

3) For each query it measures duration of "cold" and "warm" runs. Cold
means "nothing in page cache / shared  buffers", while "warm" means
everything is cached somewhere (by the first "cold" run).

There's a couple relevant parameters varied between the runs:

* effective_io_concurrency = [0, 1, 16, 32]
* shared_buffers = [4GB, 32GB]

Parallel query was disabled for these tests.

The test also included PG17 for comparison, but I forgot PG18 enabled
checksums by default. So PG17 results are with checksums off, which in
some cases means PG17 seems a little bit faster. I'm re-running it with
checksums enabled on PG17, and that seems to eliminate the differences
(as expected).


Scripts
-------

The benchmark scripts/results/charts are available here:

  https://github.com/tvondra/iomethod-tests

The SQL data generator and the script driving the benchmark:

* https://github.com/tvondra/iomethod-tests/blob/master/create2.sql

* https://github.com/tvondra/iomethod-tests/blob/master/test-full-cost-2.sh

The script is fairly simple. It first generates the combinations of
parameters to test, randomizes them, and then tests each of them.
Results are recorded in a CSV file "results.csv".


Test machines
-------------

As usual, I did this on two machines I have at home:

* ryzen (~2024)
  * Ryzen 9 9900X (12 cores)
  * 64GB RAM
  * 4x NVMe SSD (Samsung 990 PRO 1TB) in RAID0

* xeon (~2016)
  * 2x Xeon 2699v4 (44 cores)
  * 64GB RAM
  * 1x NVMe SSD WDC Ultrastar DC SN640 960GB

The ryzen is much beefier in terms of I/O, it can do ~20GB/s in
sequential read. The xeon is much more modest (and generally older).


Charts
------

Most of the repository is PDF charts generated from the CSV results.
There's a README explaining the naming convention of the charts, and
some other details.

But in general there are two "types" of charts. The first one fixes all
parameters except for "dataset", and then shows comparison of results
for all datasets.

For example the attached "ryzen-rows-cold-32GB-16-unscaled.pdf" shows
results for:

 * ryzen machine
 * selectivity calculated as "% of rows" (not pages)
 * cold runs, i.e. data not cached
 * shared_buffers=32GB
 * effective_io_concurrency=16
 * unscaled - each plot has custom y-range

There are "scaled" plots too, with all the plots scaled to the same
y-range. This makes it easier to confirm plots from the same PDF (e.g.
how the different scans perform). But that's irrelevant for picking the
io_method default. and it often hides details in some plots.

The other type fixes "dataset" and varies effective_io_concurrency, so
see how it affects duration (e.g. due to prefetching). For example the
attached ryzen-rows-cold-32GB-uniform-unscaled.pdf shows that for the
"uniform" data set.

The charts show behavior for the whole selectivity range, from 0% to
100%, for each scan type. Of course, the scan type may not be the right
choice for that selectivity point. For example, we'd probably not pick
an index scan for 50% selectivity, or a seqscan for 0.01%. But I think
it makes sense to still consider the whole range, for robustness. We
make estimation/planning mistakes fairly regularly, so it's important to
care about what happens in those cases.


Findings
--------

I'm attaching only three PDFs with charts from the cold runs, to keep
the e-mail small (each PDF is ~100-200kB). Feel free to check the other
PDFs in the git repository, but it's all very similar and the attached
PDFs are quite representative.

Some basic observations:

a) index scans

There's almost no difference for indexscans, i.e. the middle column in
the PDFs. There's a bit of variation on some of the cyclic/linear data
sets, but it seems more like random noise than a systemic difference.

Which is not all that surprising, considering index scans don't really
use read_stream yet, so there's no prefetching etc.

b) bitmapscans (ryzen-bitmapscan-uniform.png)

Bitmapscans are much more affected, with a lot of differences. See the
attached PNG image, for example. That shows that at ~10% you get 2x
faster query with io_method=worker than "sync".

And it also shows that PG18 with io_method=sync is much faster than
PG17. Which is a bit surprising, TBH. At first I thought it's due to the
"checksums=off" on PG17 in this run, but that doesn't seem to be the
case - results from PG17 with checksums=on show the same thing. In fact,
the difference increases, which is rather bizarre. (There's a branch
"run2-17-checksums-on" with results from that run.)

I believe this is likely due to the bitmapscan prefetch fixes, which we
decided to not backpatch (IIRC), because the results with e_i_c=0 show
no difference between PG17 and PG18/sync.

The "ryzen" results however demonstrate that 3 workers may be too low.
The timing spikes to ~3000ms (at ~1% selectivity), before quickly
dropping back to ~1000ms. The other datasets show similar difference.
With 12 workers, there's no such problem.

On "xeon" the differences are much smaller, or not visible at all. My
guess it's due to the hardware differences (slower CPU / single NVMe for
storage).

But I'm also wondering what would happen if there are multiple queries
doing bitmapscans. Consider the benchmark is only a backend running
queries, and parallel query is disabled.

The problems with too few workers are much better visible in the
log-scale charts (just add "-log" at the end). See the attached example
(ryzen-bitmapscan-uniform-log.png). It shows how much slower it's
compared to all other methods (effectively regression compared to PG17)
for a huge chunk of the selectivity. I'm sure we'd pick bitmapscan for
some of those queries.


c) seqscan (ryzen-seqscan-uniform.png)

This shows a lot of significant differences between different io_method
options. And by significant I mean ~2x difference, on both machines.

On "ryzen" the "worker" takes ~800ms, while "sync" is ~1700ms. PG17
seems a bit faster, but that's due to missing checksums, and with
checksums it's almost exactly the same as PG18/sync. io_uring is in
between at ~1200ms.

On "xeon" it's very similar, but io_uring does better and worker/3 a bit
worse (which I think is another reason to increase io_workers default).

I believe these differences can be explained by where the work happens
for different io_method cases. And by work I mean memcpy and checksum
verification. With "sync/io_uring" it's in the backend itself, while
with "worker" it's offloaded to the worker processes. So single-process
vs. multi-process memory bandwidth.

Which is nice, but we need to have enough of them ;-)


d) warm runs

The "warm" runs are very uninteresting. There's literally no difference
between the various io_methods. On "ryzen" it's very smooth, while on
"xeon" the "seqscan" is very noisy. I believe this is due to NUMA
effects (two sockets etc.) but I need to verify that. It affects all
io_methods equally, so it's irrelevant for this.


e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)

There's an interesting difference difference I noticed in the run with
checksums on PG17. The full PDF is available here:

https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf

The interesting thing is that PG17 indexscans on uniform dataset got a
little bit faster. In the attached PDF it's exactly on par with PG18,
but here it got a bit faster. Which makes no sense, if it has to also
verify checksums. I haven't had time to investigate this yet.


Conclusion
----------

That's all I have at the moment. I still think it makes sense to keep
io_method=worker, but bump up the io_workers a bit higher. Could we also
add some suggestions how to pick a good value to the docs?

You can probably find more interesting results in the other PDFs in the
repository. I'm certainly going to keep looking.

You might also run the benchmark on different hardware, and either
build/publish the plots somewhere, or just give me the CSV and I'll do
that. Better to find strange stuff / regressions now.

The repository also has branches with plots showing results with WIP
indexscan prefetching. (It's excluded from the PDFs I presented here).
The conclusions are similar to what we found here - "worker" is good
with enough workers, io_uring is good too. Sync has issues for some of
the data sets, but still helps a lot.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: XLogCtl->ckptFullXid is unused
Next
From: Nathan Bossart
Date:
Subject: Re: Improve LWLock tranche name visibility across backends