Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 2e63cadd-2a03-46b1-866e-7ea5d3ffd37f@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers
On 8/5/25 23:35, Peter Geoghegan wrote:
> On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Probably. It was hard to predict which values will be interesting, maybe
>> we can pick some subset now. I'll start by just doing larger steps, I
>> think. Maybe increase by 4x rather than 2x, that'll reduce the number of
>> combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't
>> seem to have a lot of impact anyway.
> 
> I don't think that fillfactor matters all that much, either way. A low
> setting provides a simple way of simulating "wide" heap tuples, but
> that probably isn't going to make the crucial difference.
> 

Agreed.

> It's not like the TPC-C index I used in my own recent testing (which
> showed that the complex patch was almost 3x faster than the simple
> patch) has all that strong of a pg_stats.correlation. You can probably
> come up with indexes/test cases where groups of related TIDs that each
> point to the same heap block appear together, even though in general
> the index tuple heap TIDs appear completely out of order. It probably
> isn't even that different to a simple pgbench_accounts_pkey from a
> prefetching POV, though, in spite of these rather conspicuous
> differences. In time we might find that just using
> pgbench_accounts_pkey directly works just as well for our purposes
> (unsure of that, but seems possible).
> 

That's quite possible. What concerns me about using tables like pgbench
accounts table is reproducibility - initially it's correlated, and then
it gets "randomized" by the workload. But maybe the exact pattern
depends on the workload - how many clients, how long, how it correlates
with vacuum, etc. Reproducing the dataset might be quite tricky.

That's why I prefer using "reproducible" data sets. I think the data
sets with "fuzz" seem like a pretty good model. I plan to experiment
with adding some duplicate values / runs, possibly with two "levels" of
randomness (global for all runs, and smaller local perturbations).

>> So what other index scan variations would you suggest to test? I can
>> imagine e.g. IN () conditions with variable list length, maybe
>> multi-column indexes, and/or skip scan cases. Any other ideas?
> 
> The only thing that's really interesting about IN() conditions is that
> they provide an easy way to write a query that only returns a subset
> of all index tuples from every leaf page read. You can get a similar
> access pattern from other types of quals, but that's not quite as
> intuitive.
> 
> I really don't think that IN() conditions are all that special.
> They're perfectly fine as a way of getting this general access
> pattern.
> 

OK

> I like to look for and debug "behavioral inconsistencies". For
> example, I have an open item in my notes (which I sent to you over IM
> a short while ago) about a backwards scan that is significantly slower
> than an "equivalent" forwards scan. This involves
> pgbench_accounts_pkey. It's quite likely that the underlying problem
> has nothing much to do with backwards scans. I suspect that the
> underlying problem is a more general one, that could also be seen with
> the right forwards scan test case.
> > In general, it might make the most sense to look for pairs of
> similar-ish queries that are inconsistent in a way that doesn't make
> sense intuitively, in order to understand and fix the inconsistency.
> Since chances are that it's actually just some kind of performance bug
> that accidentally doesn't happen in only one variant of the query.
> 

Yeah, cases like that are interesting. I plan to do some randomized
testing, exploring "strange" combinations of parameters, looking for
weird behaviors like that.

The question is what parameters to consider - the data distributions is
one such parameter. Different "types" of scans are another.

> I bet that there's at least a couple of not-that-noticeable
> performance bugs, for example due to some hard to pin down issue with
> prefetch distance getting out of hand. Possibly because the read
> stream doesn't get to see contiguous requests for TIDs that point to
> the same heap page, but does see it when things are slightly out of
> order. Two different queries that have approximately the same accesses
> should have approximately the same performance -- minor variations in
> leaf page layout or heap page layout or scan direction shouldn't be
> confounding.
> 

I think in a way cases like that are somewhat inherent, I wouldn't even
call that "bug" probably. Any heuristics (driving the distance) will
have such issues. Give me a heuristics and I'll construct an adversary
case breaking it.

I think the question will be how likely (and how serious) such cases
are. If it's rare / limited to cases where we're unlikely to pick an
index scan etc. then maybe it's OK.

>> FWIW I'm not planning to keep testing simple vs complex patches. We've
>> seen the complex patch can do much better in certain workloads cases,
>> the fact that we can discover more such cases does not change much.
>>
>> I'm much more interested in benchmarking master vs. complex patch.
> 
> Great!
> 
>>> It'd also be good to just not test "sync" anymore, at some point. And
>>> maybe to standardize on testing either "worker" or "io_uring" for most
>>> individual tests. There's just too many tests right now.
>>>
>>
>> Agreed.
> 
> Might also make sense to standardize on direct I/O when testing the
> patch (but probably not when testing master). The fact that we can't
> get any OS readahead is likely to be useful.
> 

I plan to keep testing with buffered I/O (with "io_method=worker"),
simply because that's what most systems will keep using for a while. But
it's a good idea to test with direct I/O too.

>>> Andres recently told me that he isn't expecting to be able to simulate
>>> read-ahead with direct I/O. It seems possible that read-ahead
>>> eventually won't be used at all, which argues for the complex patch.
>>>
>>
>> True, the complex patch could prefetch the leaf pages.
> 
> What I meant was that the complex patch can make up for the fact that
> direct I/O presumably won't ever have an equivalent to simple
> read-ahead. Just by having a very flexible prefetching implementation
> (and without any special sequential access heuristics ever being
> required).
> 

OK

>>> BTW, I experimented with using READ_STREAM_USE_BATCHING (not
>>> READ_STREAM_DEFAULT) in the complex patch. That's probably
>>> deadlock-prone, but I suspect that it works well enough to get a good
>>> sense of what is possible. What I saw (with that same TPC-C test
>>> query) was that "I/O Timings" was about 10x lower, even though the
>>> query runtime didn't change at all. This suggests to me that "I/O
>>> Timings" is an independently interesting measure: getting it lower
>>> might not visibly help when only one query runs, but it'll likely
>>> still lead to more efficient use of available I/O bandwidth in the
>>> aggregate (when many queries run at the same time).
>>>
>>
>> Interesting. Does that mean we should try enabling batching in some
>> cases? Or just that there's room for improvement?
> 
> I don't know what it means myself. I never got as far as even starting
> to understand what it would take to make READ_STREAM_USE_BATCHING
> work.
> 
> AFAIK it wouldn't be hard to make that work here at all, in which case
> we should definitely use it. OTOH, maybe it's really hard. I just
> don't know right now.
> 

Same here. I read the comments about batch mode and deadlocks multiple
times, and it's still not clear to me what exactly would be needed to
make it safe.

>> Could we do the next_block callbacks in a way that make deadlocks
>> impossible?
>>
>> I'm not that familiar with the batch mode - how would the deadlock even
>> happen in index scans?
> 
> I have no idea. Maybe it's already safe. I didn't notice any problems
> (but didn't look for them, beyond running my tests plus the regression
> tests).
> 

OK

>> I think the only way is to try reworking some of the index AMs to use
>> the new interface. For some AMs (e.g. hash) it's going to be very
>> similar to what you did with btree, because it basically works like a
>> btree. For others (GiST/SP-GiST) it may be more work.
> 
> The main difficulty with GiST may be that we may be obligated to fix
> existing (unfixed!) bugs that affect index-only scans. The master
> branch is subtly broken, but we can't in good conscience ignore those
> problems while making these kinds of changes.
> 

Right, that's a valid point.

The thing that worries me a bit is that the ordered scans (e.g. with
reordering by distance) detach the scan from the leaf pages, i.e. the
batches are no longer "tied" to a leaf page.

Perhaps "worries" is not the right word - I don't think it should be a
problem, but it's a difference.

>> It doesn't need to be committable, just good enough to be reasonably
>> certain it's possible.
> 
> That's what I have in mind, too. If we have support for a second index
> AM, then we're much less likely to over-optimize for nbtree in a way
> that doesn't really make sense.
> 

Yep.

>> Understood, and I agree in principle. It's just that given the fuzziness
>> I find it hard how it should look like.
> 
> I suspect that index AMs are much more similar for the purposes of
> prefetching than they are in other ways.
> 

Probably.



regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: cpluspluscheck vs ICU again
Next
From: Álvaro Herrera
Date:
Subject: Re: SIMD optimization for list_sort