Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 2e63cadd-2a03-46b1-866e-7ea5d3ffd37f@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On 8/5/25 23:35, Peter Geoghegan wrote: > On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote: >> Probably. It was hard to predict which values will be interesting, maybe >> we can pick some subset now. I'll start by just doing larger steps, I >> think. Maybe increase by 4x rather than 2x, that'll reduce the number of >> combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't >> seem to have a lot of impact anyway. > > I don't think that fillfactor matters all that much, either way. A low > setting provides a simple way of simulating "wide" heap tuples, but > that probably isn't going to make the crucial difference. > Agreed. > It's not like the TPC-C index I used in my own recent testing (which > showed that the complex patch was almost 3x faster than the simple > patch) has all that strong of a pg_stats.correlation. You can probably > come up with indexes/test cases where groups of related TIDs that each > point to the same heap block appear together, even though in general > the index tuple heap TIDs appear completely out of order. It probably > isn't even that different to a simple pgbench_accounts_pkey from a > prefetching POV, though, in spite of these rather conspicuous > differences. In time we might find that just using > pgbench_accounts_pkey directly works just as well for our purposes > (unsure of that, but seems possible). > That's quite possible. What concerns me about using tables like pgbench accounts table is reproducibility - initially it's correlated, and then it gets "randomized" by the workload. But maybe the exact pattern depends on the workload - how many clients, how long, how it correlates with vacuum, etc. Reproducing the dataset might be quite tricky. That's why I prefer using "reproducible" data sets. I think the data sets with "fuzz" seem like a pretty good model. I plan to experiment with adding some duplicate values / runs, possibly with two "levels" of randomness (global for all runs, and smaller local perturbations). >> So what other index scan variations would you suggest to test? I can >> imagine e.g. IN () conditions with variable list length, maybe >> multi-column indexes, and/or skip scan cases. Any other ideas? > > The only thing that's really interesting about IN() conditions is that > they provide an easy way to write a query that only returns a subset > of all index tuples from every leaf page read. You can get a similar > access pattern from other types of quals, but that's not quite as > intuitive. > > I really don't think that IN() conditions are all that special. > They're perfectly fine as a way of getting this general access > pattern. > OK > I like to look for and debug "behavioral inconsistencies". For > example, I have an open item in my notes (which I sent to you over IM > a short while ago) about a backwards scan that is significantly slower > than an "equivalent" forwards scan. This involves > pgbench_accounts_pkey. It's quite likely that the underlying problem > has nothing much to do with backwards scans. I suspect that the > underlying problem is a more general one, that could also be seen with > the right forwards scan test case. > > In general, it might make the most sense to look for pairs of > similar-ish queries that are inconsistent in a way that doesn't make > sense intuitively, in order to understand and fix the inconsistency. > Since chances are that it's actually just some kind of performance bug > that accidentally doesn't happen in only one variant of the query. > Yeah, cases like that are interesting. I plan to do some randomized testing, exploring "strange" combinations of parameters, looking for weird behaviors like that. The question is what parameters to consider - the data distributions is one such parameter. Different "types" of scans are another. > I bet that there's at least a couple of not-that-noticeable > performance bugs, for example due to some hard to pin down issue with > prefetch distance getting out of hand. Possibly because the read > stream doesn't get to see contiguous requests for TIDs that point to > the same heap page, but does see it when things are slightly out of > order. Two different queries that have approximately the same accesses > should have approximately the same performance -- minor variations in > leaf page layout or heap page layout or scan direction shouldn't be > confounding. > I think in a way cases like that are somewhat inherent, I wouldn't even call that "bug" probably. Any heuristics (driving the distance) will have such issues. Give me a heuristics and I'll construct an adversary case breaking it. I think the question will be how likely (and how serious) such cases are. If it's rare / limited to cases where we're unlikely to pick an index scan etc. then maybe it's OK. >> FWIW I'm not planning to keep testing simple vs complex patches. We've >> seen the complex patch can do much better in certain workloads cases, >> the fact that we can discover more such cases does not change much. >> >> I'm much more interested in benchmarking master vs. complex patch. > > Great! > >>> It'd also be good to just not test "sync" anymore, at some point. And >>> maybe to standardize on testing either "worker" or "io_uring" for most >>> individual tests. There's just too many tests right now. >>> >> >> Agreed. > > Might also make sense to standardize on direct I/O when testing the > patch (but probably not when testing master). The fact that we can't > get any OS readahead is likely to be useful. > I plan to keep testing with buffered I/O (with "io_method=worker"), simply because that's what most systems will keep using for a while. But it's a good idea to test with direct I/O too. >>> Andres recently told me that he isn't expecting to be able to simulate >>> read-ahead with direct I/O. It seems possible that read-ahead >>> eventually won't be used at all, which argues for the complex patch. >>> >> >> True, the complex patch could prefetch the leaf pages. > > What I meant was that the complex patch can make up for the fact that > direct I/O presumably won't ever have an equivalent to simple > read-ahead. Just by having a very flexible prefetching implementation > (and without any special sequential access heuristics ever being > required). > OK >>> BTW, I experimented with using READ_STREAM_USE_BATCHING (not >>> READ_STREAM_DEFAULT) in the complex patch. That's probably >>> deadlock-prone, but I suspect that it works well enough to get a good >>> sense of what is possible. What I saw (with that same TPC-C test >>> query) was that "I/O Timings" was about 10x lower, even though the >>> query runtime didn't change at all. This suggests to me that "I/O >>> Timings" is an independently interesting measure: getting it lower >>> might not visibly help when only one query runs, but it'll likely >>> still lead to more efficient use of available I/O bandwidth in the >>> aggregate (when many queries run at the same time). >>> >> >> Interesting. Does that mean we should try enabling batching in some >> cases? Or just that there's room for improvement? > > I don't know what it means myself. I never got as far as even starting > to understand what it would take to make READ_STREAM_USE_BATCHING > work. > > AFAIK it wouldn't be hard to make that work here at all, in which case > we should definitely use it. OTOH, maybe it's really hard. I just > don't know right now. > Same here. I read the comments about batch mode and deadlocks multiple times, and it's still not clear to me what exactly would be needed to make it safe. >> Could we do the next_block callbacks in a way that make deadlocks >> impossible? >> >> I'm not that familiar with the batch mode - how would the deadlock even >> happen in index scans? > > I have no idea. Maybe it's already safe. I didn't notice any problems > (but didn't look for them, beyond running my tests plus the regression > tests). > OK >> I think the only way is to try reworking some of the index AMs to use >> the new interface. For some AMs (e.g. hash) it's going to be very >> similar to what you did with btree, because it basically works like a >> btree. For others (GiST/SP-GiST) it may be more work. > > The main difficulty with GiST may be that we may be obligated to fix > existing (unfixed!) bugs that affect index-only scans. The master > branch is subtly broken, but we can't in good conscience ignore those > problems while making these kinds of changes. > Right, that's a valid point. The thing that worries me a bit is that the ordered scans (e.g. with reordering by distance) detach the scan from the leaf pages, i.e. the batches are no longer "tied" to a leaf page. Perhaps "worries" is not the right word - I don't think it should be a problem, but it's a difference. >> It doesn't need to be committable, just good enough to be reasonably >> certain it's possible. > > That's what I have in mind, too. If we have support for a second index > AM, then we're much less likely to over-optimize for nbtree in a way > that doesn't really make sense. > Yep. >> Understood, and I agree in principle. It's just that given the fuzziness >> I find it hard how it should look like. > > I suspect that index AMs are much more similar for the purposes of > prefetching than they are in other ways. > Probably. regards -- Tomas Vondra
pgsql-hackers by date: