Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 8c86c3a6-074e-6c88-3e7e-9452b6a37b9b@enterprisedb.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
Hi, I have results from the new extended round of prefetch tests. I've pushed everything to https://github.com/tvondra/index-prefetch-tests-2 There are scripts I used to run this (run-*.sh), raw results and various kinds of processed summaries (pdf, ods, ...) that I'll mention later. As before, this tests a number of query types: - point queries with btree and hash (equality) - ORDER BY queries with btree (inequality + order by) - SAOP queries with btree (column IN (values)) It's probably futile to go through details of all the tests - it's easier to go through the (hopefully fairly readable) shell scripts. But in principle, runs some simple queries while varying both the data set and workload: - data set may be random, sequential or cyclic (with different length) - the number of matches per value differs (i.e. equality condition may match 1, 10, 100, ..., 100k rows) - forces a particular scan type (indexscan, bitmapscan, seqscan) - each query is executed twice - first run (right after restarting DB and dropping caches) is uncached, second run should have data cached - the query is executed 5x with different parameters (so 10x in total) This is tested with three basic data sizes - fits into shared buffers, fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and 20GB (i5) / 40GB (xeon). Note: xeon has 64GB RAM, so technically the largest scale fits into RAM. But should not matter, thanks to drop-caches and restart. I also attempted to pin the backend to a particular core, in effort to eliminate scheduling-related noise. It's mostly what taskset does, but I did that from extension (https://github.com/tvondra/taskset) which allows me to do that as part of the SQL script. For the results, I'll talk about the v1 patch (as submitted here) fist. I'll use the PDF results in the "pdf" directory which generally show a pivot table by different test parameters, comparing the results by different parameters (prefetching on/off, master/patched). Feel free to do your own analysis from the raw CSV data, ofc. For example, this: https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf shows how the prefetching affects timing for point queries with different numbers of matches (1 to 100k). The numbers are timings for master and patched build. The last group is (patched/master), so the lower the number the better - 50% means patch makes the query 2x faster. There's also a heatmap, with green=good, red=bad, which makes it easier to cases that got slower/faster. The really interesting stuff starts on page 7 (in this PDF), because the first couple pages are "cached" (so it's more about measuring overhead when prefetching has no benefit). Right on page 7 you can see a couple cases with a mix of slower/faster cases, roughtly in the +/- 30% range. However, this is unrelated from the patch because those are results for bitmapheapscan. For indexscans (page 8), the results are invariably improved - the more matches the better (up to ~10x faster for 100k matches). Those were results for the "cyclic" data set. For random data set (pages 9-11) the results are pretty similar, but for "sequential" data (11-13) the prefetching is actually harmful - there are red clusters, with up to 500% slowdowns. I'm not going to explain the summary for SAOP queries (https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf), the story is roughly the same, except that there are more tested query combinations (because we also vary the pattern in the IN() list - number of values etc.). So, the conclusion from this is - generally very good results for random and cyclic data sets, but pretty bad results for sequential. But even for the random/cyclic cases there are combinations (especially with many matches) where prefetching doesn't help or even hurts. The only way to deal with this is (I think) a cheap way to identify and skip inefficient prefetches, essentially by doing two things: a) remembering more recently prefetched blocks (say, 1000+) and not prefetching them over and over b) ability to identify sequential pattern, when readahead seems to do pretty good job already (although I heard some disagreement) I've been thinking about how to do this - doing (a) seem pretty hard, because on the one hand we want to remember a fair number of blocks and we want the check "did we prefetch X" to be very cheap. So a hash table seems nice. OTOH we want to expire "old" blocks and only keep the most recent ones, and hash table doesn't really support that. Perhaps there is a great data structure for this, not sure. But after thinking about this I realized we don't need a perfect accuracy - it's fine to have false positives/negatives - it's fine to forget we already prefetched block X and prefetch it again, or prefetch it again. It's not a matter of correctness, just a matter of efficiency - after all, we can't know if it's still in memory, we only know if we prefetched it fairly recently. This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU cache that's small enough to be searched linearly (say, 8 blocks). And we have many of them (e.g. 128), so that in total we can remember 1024 block numbers. Now, every block number is mapped to a single LRU by hashing, as if we had a hash table index = hash(blockno) % 128 and we only use tha one LRU to track this block. It's tiny so we can search it linearly. To expire prefetched blocks, there's a counter incremented every time we prefetch a block, and we store it in the LRU with the block number. When checking the LRU we ignore old entries (with counter more than 1000 values back), and we also evict/replace the oldest entry if needed. This seems to work pretty well for the first requirement, but it doesn't allow identifying the sequential pattern cheaply. To do that, I added a tiny queue with a couple entries that can checked it the last couple entries are sequential. And this is what the attached 0002+0003 patches do. There are PDF with results for this build prefixed with "patch-v3" and the results are pretty good - the regressions are largely gone. It's even cleared in the PDFs comparing the impact of the two patches: https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf Which simply shows the "speedup heatmap" for the two patches, and the "v3" heatmap has much less red regression clusters. Note: The comparison-point.pdf summary has another group of columns illustrating if this scan type would be actually used, with "green" meaning "yes". This provides additional context, because e.g. for the "noisy bitmapscans" it's all white, i.e. without setting the GUcs the optimizer would pick something else (hence it's a non-issue). Let me know if the results are not clear enough (I tried to cover the important stuff, but I'm sure there's a lot of details I didn't cover), or if you think some other summary would be better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: