Re: AIO v2.5 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | e6db33f3-50de-43d3-9d9f-747c3b376e80@vondra.me Whole thread Raw |
In response to | Re: AIO v2.5 (Noah Misch <noah@leadboat.com>) |
List | pgsql-hackers |
Hi, I've been running some benchmarks comparing the io_methods, to help with resolving this PG18 open item. So here are some results, and my brief analysis of it. I was hoping to get this out sooner before beta2 :-( and some tests are still running, but I don't think it'll affect the conclusions. The TL;DR version ----------------- * The "worker" method seems good, and I think we should keep it as a default. We should probably think about increasing the number of workers a bit, the current io_workers=3 seems to be too low and regresses in a couple tests. * The "sync" seems OK too, but it's more of a conservative choice, i.e. more weight for keeping the PG17 behavior / not causing regressions. But I haven't seen that (with enough workers). And there are cases when the "worker" is much faster. It'd be a shame to throw away that benefit. * There might be bugs in "worker", simply because it has to deal with multiple concurrent processes etc. But I guess we'll fix those just like other bugs. I don't think it's a good argument against "worker" default. * All my tests were done on Linux and NVMe drives. It'd be good to do similar testing on other platforms (e.g. FreeBSD) and/or storage. I plan to do some of that, but it'd be great to cover more cases. I can help with getting my script running, a run takes ~1-2 days. A more detailed version ... --------------------------- The benchmark I did works like this: 1) It generates datasets with different data distributions (patterns in the data - uniform, linear, cyclic, ...). Each table is ~4.3GB of data. 2) It then runs queries on that, with a BETWEEN clause with a certain selectivity (matching % of the rows), forcing a particular scan type (indexscan, bitmapscan, seqscan). 3) For each query it measures duration of "cold" and "warm" runs. Cold means "nothing in page cache / shared buffers", while "warm" means everything is cached somewhere (by the first "cold" run). There's a couple relevant parameters varied between the runs: * effective_io_concurrency = [0, 1, 16, 32] * shared_buffers = [4GB, 32GB] Parallel query was disabled for these tests. The test also included PG17 for comparison, but I forgot PG18 enabled checksums by default. So PG17 results are with checksums off, which in some cases means PG17 seems a little bit faster. I'm re-running it with checksums enabled on PG17, and that seems to eliminate the differences (as expected). Scripts ------- The benchmark scripts/results/charts are available here: https://github.com/tvondra/iomethod-tests The SQL data generator and the script driving the benchmark: * https://github.com/tvondra/iomethod-tests/blob/master/create2.sql * https://github.com/tvondra/iomethod-tests/blob/master/test-full-cost-2.sh The script is fairly simple. It first generates the combinations of parameters to test, randomizes them, and then tests each of them. Results are recorded in a CSV file "results.csv". Test machines ------------- As usual, I did this on two machines I have at home: * ryzen (~2024) * Ryzen 9 9900X (12 cores) * 64GB RAM * 4x NVMe SSD (Samsung 990 PRO 1TB) in RAID0 * xeon (~2016) * 2x Xeon 2699v4 (44 cores) * 64GB RAM * 1x NVMe SSD WDC Ultrastar DC SN640 960GB The ryzen is much beefier in terms of I/O, it can do ~20GB/s in sequential read. The xeon is much more modest (and generally older). Charts ------ Most of the repository is PDF charts generated from the CSV results. There's a README explaining the naming convention of the charts, and some other details. But in general there are two "types" of charts. The first one fixes all parameters except for "dataset", and then shows comparison of results for all datasets. For example the attached "ryzen-rows-cold-32GB-16-unscaled.pdf" shows results for: * ryzen machine * selectivity calculated as "% of rows" (not pages) * cold runs, i.e. data not cached * shared_buffers=32GB * effective_io_concurrency=16 * unscaled - each plot has custom y-range There are "scaled" plots too, with all the plots scaled to the same y-range. This makes it easier to confirm plots from the same PDF (e.g. how the different scans perform). But that's irrelevant for picking the io_method default. and it often hides details in some plots. The other type fixes "dataset" and varies effective_io_concurrency, so see how it affects duration (e.g. due to prefetching). For example the attached ryzen-rows-cold-32GB-uniform-unscaled.pdf shows that for the "uniform" data set. The charts show behavior for the whole selectivity range, from 0% to 100%, for each scan type. Of course, the scan type may not be the right choice for that selectivity point. For example, we'd probably not pick an index scan for 50% selectivity, or a seqscan for 0.01%. But I think it makes sense to still consider the whole range, for robustness. We make estimation/planning mistakes fairly regularly, so it's important to care about what happens in those cases. Findings -------- I'm attaching only three PDFs with charts from the cold runs, to keep the e-mail small (each PDF is ~100-200kB). Feel free to check the other PDFs in the git repository, but it's all very similar and the attached PDFs are quite representative. Some basic observations: a) index scans There's almost no difference for indexscans, i.e. the middle column in the PDFs. There's a bit of variation on some of the cyclic/linear data sets, but it seems more like random noise than a systemic difference. Which is not all that surprising, considering index scans don't really use read_stream yet, so there's no prefetching etc. b) bitmapscans (ryzen-bitmapscan-uniform.png) Bitmapscans are much more affected, with a lot of differences. See the attached PNG image, for example. That shows that at ~10% you get 2x faster query with io_method=worker than "sync". And it also shows that PG18 with io_method=sync is much faster than PG17. Which is a bit surprising, TBH. At first I thought it's due to the "checksums=off" on PG17 in this run, but that doesn't seem to be the case - results from PG17 with checksums=on show the same thing. In fact, the difference increases, which is rather bizarre. (There's a branch "run2-17-checksums-on" with results from that run.) I believe this is likely due to the bitmapscan prefetch fixes, which we decided to not backpatch (IIRC), because the results with e_i_c=0 show no difference between PG17 and PG18/sync. The "ryzen" results however demonstrate that 3 workers may be too low. The timing spikes to ~3000ms (at ~1% selectivity), before quickly dropping back to ~1000ms. The other datasets show similar difference. With 12 workers, there's no such problem. On "xeon" the differences are much smaller, or not visible at all. My guess it's due to the hardware differences (slower CPU / single NVMe for storage). But I'm also wondering what would happen if there are multiple queries doing bitmapscans. Consider the benchmark is only a backend running queries, and parallel query is disabled. The problems with too few workers are much better visible in the log-scale charts (just add "-log" at the end). See the attached example (ryzen-bitmapscan-uniform-log.png). It shows how much slower it's compared to all other methods (effectively regression compared to PG17) for a huge chunk of the selectivity. I'm sure we'd pick bitmapscan for some of those queries. c) seqscan (ryzen-seqscan-uniform.png) This shows a lot of significant differences between different io_method options. And by significant I mean ~2x difference, on both machines. On "ryzen" the "worker" takes ~800ms, while "sync" is ~1700ms. PG17 seems a bit faster, but that's due to missing checksums, and with checksums it's almost exactly the same as PG18/sync. io_uring is in between at ~1200ms. On "xeon" it's very similar, but io_uring does better and worker/3 a bit worse (which I think is another reason to increase io_workers default). I believe these differences can be explained by where the work happens for different io_method cases. And by work I mean memcpy and checksum verification. With "sync/io_uring" it's in the backend itself, while with "worker" it's offloaded to the worker processes. So single-process vs. multi-process memory bandwidth. Which is nice, but we need to have enough of them ;-) d) warm runs The "warm" runs are very uninteresting. There's literally no difference between the various io_methods. On "ryzen" it's very smooth, while on "xeon" the "seqscan" is very noisy. I believe this is due to NUMA effects (two sockets etc.) but I need to verify that. It affects all io_methods equally, so it's irrelevant for this. e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) There's an interesting difference difference I noticed in the run with checksums on PG17. The full PDF is available here: https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf The interesting thing is that PG17 indexscans on uniform dataset got a little bit faster. In the attached PDF it's exactly on par with PG18, but here it got a bit faster. Which makes no sense, if it has to also verify checksums. I haven't had time to investigate this yet. Conclusion ---------- That's all I have at the moment. I still think it makes sense to keep io_method=worker, but bump up the io_workers a bit higher. Could we also add some suggestions how to pick a good value to the docs? You can probably find more interesting results in the other PDFs in the repository. I'm certainly going to keep looking. You might also run the benchmark on different hardware, and either build/publish the plots somewhere, or just give me the CSV and I'll do that. Better to find strange stuff / regressions now. The repository also has branches with plots showing results with WIP indexscan prefetching. (It's excluded from the PDFs I presented here). The conclusions are similar to what we found here - "worker" is good with enough workers, io_uring is good too. Sync has issues for some of the data sets, but still helps a lot. regards -- Tomas Vondra
Attachment
pgsql-hackers by date: