Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-WzkPh+L2u8_4jG=NgGgzFNqW7ZZhSxGb6mJR=2YdouL1_Q@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote: > Probably. It was hard to predict which values will be interesting, maybe > we can pick some subset now. I'll start by just doing larger steps, I > think. Maybe increase by 4x rather than 2x, that'll reduce the number of > combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't > seem to have a lot of impact anyway. I don't think that fillfactor matters all that much, either way. A low setting provides a simple way of simulating "wide" heap tuples, but that probably isn't going to make the crucial difference. It's not like the TPC-C index I used in my own recent testing (which showed that the complex patch was almost 3x faster than the simple patch) has all that strong of a pg_stats.correlation. You can probably come up with indexes/test cases where groups of related TIDs that each point to the same heap block appear together, even though in general the index tuple heap TIDs appear completely out of order. It probably isn't even that different to a simple pgbench_accounts_pkey from a prefetching POV, though, in spite of these rather conspicuous differences. In time we might find that just using pgbench_accounts_pkey directly works just as well for our purposes (unsure of that, but seems possible). > So what other index scan variations would you suggest to test? I can > imagine e.g. IN () conditions with variable list length, maybe > multi-column indexes, and/or skip scan cases. Any other ideas? The only thing that's really interesting about IN() conditions is that they provide an easy way to write a query that only returns a subset of all index tuples from every leaf page read. You can get a similar access pattern from other types of quals, but that's not quite as intuitive. I really don't think that IN() conditions are all that special. They're perfectly fine as a way of getting this general access pattern. I like to look for and debug "behavioral inconsistencies". For example, I have an open item in my notes (which I sent to you over IM a short while ago) about a backwards scan that is significantly slower than an "equivalent" forwards scan. This involves pgbench_accounts_pkey. It's quite likely that the underlying problem has nothing much to do with backwards scans. I suspect that the underlying problem is a more general one, that could also be seen with the right forwards scan test case. In general, it might make the most sense to look for pairs of similar-ish queries that are inconsistent in a way that doesn't make sense intuitively, in order to understand and fix the inconsistency. Since chances are that it's actually just some kind of performance bug that accidentally doesn't happen in only one variant of the query. I bet that there's at least a couple of not-that-noticeable performance bugs, for example due to some hard to pin down issue with prefetch distance getting out of hand. Possibly because the read stream doesn't get to see contiguous requests for TIDs that point to the same heap page, but does see it when things are slightly out of order. Two different queries that have approximately the same accesses should have approximately the same performance -- minor variations in leaf page layout or heap page layout or scan direction shouldn't be confounding. > FWIW I'm not planning to keep testing simple vs complex patches. We've > seen the complex patch can do much better in certain workloads cases, > the fact that we can discover more such cases does not change much. > > I'm much more interested in benchmarking master vs. complex patch. Great! > > It'd also be good to just not test "sync" anymore, at some point. And > > maybe to standardize on testing either "worker" or "io_uring" for most > > individual tests. There's just too many tests right now. > > > > Agreed. Might also make sense to standardize on direct I/O when testing the patch (but probably not when testing master). The fact that we can't get any OS readahead is likely to be useful. > > Andres recently told me that he isn't expecting to be able to simulate > > read-ahead with direct I/O. It seems possible that read-ahead > > eventually won't be used at all, which argues for the complex patch. > > > > True, the complex patch could prefetch the leaf pages. What I meant was that the complex patch can make up for the fact that direct I/O presumably won't ever have an equivalent to simple read-ahead. Just by having a very flexible prefetching implementation (and without any special sequential access heuristics ever being required). > > BTW, I experimented with using READ_STREAM_USE_BATCHING (not > > READ_STREAM_DEFAULT) in the complex patch. That's probably > > deadlock-prone, but I suspect that it works well enough to get a good > > sense of what is possible. What I saw (with that same TPC-C test > > query) was that "I/O Timings" was about 10x lower, even though the > > query runtime didn't change at all. This suggests to me that "I/O > > Timings" is an independently interesting measure: getting it lower > > might not visibly help when only one query runs, but it'll likely > > still lead to more efficient use of available I/O bandwidth in the > > aggregate (when many queries run at the same time). > > > > Interesting. Does that mean we should try enabling batching in some > cases? Or just that there's room for improvement? I don't know what it means myself. I never got as far as even starting to understand what it would take to make READ_STREAM_USE_BATCHING work. AFAIK it wouldn't be hard to make that work here at all, in which case we should definitely use it. OTOH, maybe it's really hard. I just don't know right now. > Could we do the next_block callbacks in a way that make deadlocks > impossible? > > I'm not that familiar with the batch mode - how would the deadlock even > happen in index scans? I have no idea. Maybe it's already safe. I didn't notice any problems (but didn't look for them, beyond running my tests plus the regression tests). > I think the only way is to try reworking some of the index AMs to use > the new interface. For some AMs (e.g. hash) it's going to be very > similar to what you did with btree, because it basically works like a > btree. For others (GiST/SP-GiST) it may be more work. The main difficulty with GiST may be that we may be obligated to fix existing (unfixed!) bugs that affect index-only scans. The master branch is subtly broken, but we can't in good conscience ignore those problems while making these kinds of changes. > It doesn't need to be committable, just good enough to be reasonably > certain it's possible. That's what I have in mind, too. If we have support for a second index AM, then we're much less likely to over-optimize for nbtree in a way that doesn't really make sense. > Understood, and I agree in principle. It's just that given the fuzziness > I find it hard how it should look like. I suspect that index AMs are much more similar for the purposes of prefetching than they are in other ways. -- Peter Geoghegan
pgsql-hackers by date: