Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: index prefetching
List pgsql-hackers

On 8/5/25 19:19, Peter Geoghegan wrote:
> On Tue, Aug 5, 2025 at 10:52 AM Tomas Vondra <tomas@vondra.me> wrote:
>> I ran some more tests, comparing the two patches, using data sets
>> generated in a way to have a more gradual transition between correlated
>> and random cases.
> 
> Cool.
> 
>> So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
>> would be way too many tests, with very little change. Instead, I used
>> the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
>> it takes only about ~25 steps for the $fuzz to exceed $rows=10M.
> 
> I think that it probably makes sense to standardize on using fewer
> distinct "fuzz" settings than this going forward. It's useful to test
> more things at first, but I expect that the performance impact of
> changes from a given new patch revision will become important soon.
> 

Probably. It was hard to predict which values will be interesting, maybe
we can pick some subset now. I'll start by just doing larger steps, I
think. Maybe increase by 4x rather than 2x, that'll reduce the number of
combinations a lot. Also, I plan to stick to fillfactor=20, it doesn't
seem to have a lot of impact anyway.

>> I don't claim those data sets are perfect, or a great representation of
>> particular (real-world) data sets. It seems like a much nicer transition
>> between random and correlated data sets.
> 
> That makes sense to me.
> 
> A test suite that is representative of real-world usage patterns isn't
> so important. But it is important that we have at least one test for
> each interesting variation of an index scan. What exactly that means
> is subject to interpretation, and will likely evolve over time. But
> the general idea is that we should choose tests that experience has
> shown to be particularly good at highlighting the advantages or
> disadvantages of one approach over another (e.g., simple vs complex).
> 

True. These tests use very simple queries, with a single range clause.

So what other index scan variations would you suggest to test? I can
imagine e.g. IN () conditions with variable list length, maybe
multi-column indexes, and/or skip scan cases. Any other ideas?

FWIW I'm not planning to keep testing simple vs complex patches. We've
seen the complex patch can do much better in certain workloads cases,
the fact that we can discover more such cases does not change much.

I'm much more interested in benchmarking master vs. complex patch.

> It's just as important that we cut tests that don't seem to tell us
> anything we can't get from some other tests. I suspect that many $fuzz
> values aren't at all interesting. We double to get each increment, but
> that probably isn't all that informative, outside of the extremes.
> 
> It'd also be good to just not test "sync" anymore, at some point. And
> maybe to standardize on testing either "worker" or "io_uring" for most
> individual tests. There's just too many tests right now.
> 

Agreed.

>> In most cases the two patches perform fairly close - the green and red
>> data series mostly overlap. But there are cases where the complex patch
>> performs much better - especially for low fuzz values. Which is not
>> surprising, because those cases require higher prefetch distance, and
>> the complex patch can do that.
> 
> Right.
> 
>> It surprised me a bit the complex patch can actually help even cases
>> where I'd not expect prefetching to help very much - e.g. fuzz=0 is
>> perfectly correlated, I'd expect read-ahead to work just fine. Yet the
>> complex patch can help ~2x (at least when scanning larger fraction of
>> the data).
> 
> Maybe it has something to do with reading multiple leaf pages together
> leading to fewer icache misses.
> 

Maybe, not sure.

> Andres recently told me that he isn't expecting to be able to simulate
> read-ahead with direct I/O. It seems possible that read-ahead
> eventually won't be used at all, which argues for the complex patch.
> 

True, the complex patch could prefetch the leaf pages.

> BTW, I experimented with using READ_STREAM_USE_BATCHING (not
> READ_STREAM_DEFAULT) in the complex patch. That's probably
> deadlock-prone, but I suspect that it works well enough to get a good
> sense of what is possible. What I saw (with that same TPC-C test
> query) was that "I/O Timings" was about 10x lower, even though the
> query runtime didn't change at all. This suggests to me that "I/O
> Timings" is an independently interesting measure: getting it lower
> might not visibly help when only one query runs, but it'll likely
> still lead to more efficient use of available I/O bandwidth in the
> aggregate (when many queries run at the same time).
> 

Interesting. Does that mean we should try enabling batching in some
cases? Or just that there's room for improvement?

Could we do the next_block callbacks in a way that make deadlocks
impossible?

I'm not that familiar with the batch mode - how would the deadlock even
happen in index scans? I suppose there might be two index scans in
opposite directions, requesting the pages in different order. Or even
just index scans with different keys, that happen to touch the heap
pages in different order. Also, if we "streamify" the access to leaf
pages, there could be a deadlock between the two streams.

Not sure how to prevent these cases.


>> There are also a couple cases where "simple" performs better than
>> "complex". But most of the time this is only for the "sync" iomethod,
>> and when scanning significant fraction of the data (10%+). So that
>> doesn't seem like a great argument in favor of the simple patch,
>> considering "sync" is not a proper AIO method, I've been arguing against
>> using it as a default, and with methods like "worker" the "complex"
>> patch often performs better ...
> 
> I suspect that this is just a case of "sync" making aggressive
> prefetching a bad idea in general.
> 
>> Let's say the complex patch is the way to go. What are the open problems
>> / missing parts we need to address to make it committable?
> 
> I think that what you're interested in here is mostly project risk --
> things that come with a notable risk of blocking commit/significantly
> undermining our general approach.
> 

In a way, yes. I'm interested in anything I have not thought about.

>> I can think of these issues. I'm sure the list is incomplete and there
>> are many "smaller issues" and things I haven't even thought about:
> 
> I have a list of issues to solve in my personal notes. Most of them
> aren't particularly important.
> 

Good to hear.

>> 1) Making sure the interface can work for other index AMs (both in core
>> and out-of-core), including cases like GiST etc.
> 
> What would put your mind at ease here? Maybe you'd feel better about
> this if we also implemented prefetching for at least one other index
> AM. Maybe GiST, since it's likely both the next-hardest and next most
> important index AM (after nbtree).
> 
> Right now, I'm not motivated to work on the patch at all, since it's
> still not clear that any of it has buy-in from you. I'm willing to do
> more work to try to convince you, but it's not clear what it would
> take/where your doubts are. I'm starting to be concerned about that
> just never happening, quite honestly. Getting a feature of this
> complexity into committable shape requires laser focus.
> 

I think the only way is to try reworking some of the index AMs to use
the new interface. For some AMs (e.g. hash) it's going to be very
similar to what you did with btree, because it basically works like a
btree. For others (GiST/SP-GiST) it may be more work.

Not sure about out-of-core AMs, like pgvector etc. That may be a step
too far / too much work.

It doesn't need to be committable, just good enough to be reasonably
certain it's possible.

>> 2) Proper layering between index AM and table AM (like the TID issue
>> pointed out by Andres some time ago).
>>
>> 3) Allowing more flexible management of prefetch distance (this might
>> involve something like the "scan manager" idea suggested by Peter),
>> various improvements to ReadStream heuristics, etc.
> 
> The definition of "scan manager" is quite fuzzy right now. I think
> that the "complex" patch already implements a very basic version of
> that idea.
> 
> To me, the important point was always that the general design/API of
> index prefetching be structured in a way that would allow us to
> accomodate more sophisticated strategies. As I've said many times,
> somebody needs to see all of the costs and all of the benefits --
> that's what's needed to make optimal choices.
> 

Understood, and I agree in principle. It's just that given the fuzziness
I find it hard how it should look like.

>> 4) More testing to minimize the risk of regressions.
>>
>> 5) Figuring out how to make this work for IOS (the simple patch has some
>> special logic in the callback, which may not be great, not sure what's
>> the right solution in the complex patch).
> 
> I agree that all these items are probably the biggest risks to the
> project. I'm not sure that I can attribute this to the use of the
> "complex" approach over the "simple" approach.
> 

True, most of these points applies to both patches - including the IOS
handling in callback. And the complex patch could do it the same way,
except there would be just one callback, not a callback per index AM.

>> 6) ????
> 
> I guess that this means "unknown unknowns", which are another significant risk.
> 

Yeah, that's what I meant. Sorry, I should have been more explicit.

regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Convert varatt.h macros to static inline functions
Next
From: Tom Lane
Date:
Subject: Re: Convert varatt.h macros to static inline functions