Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-Wzk588ESCcigCURKbOHiyFEvTX-2CdyUi+Nv5RCYJ5Z15Q@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Mon, Sep 15, 2025 at 9:00 AM Tomas Vondra <tomas@vondra.me> wrote: > Yeah, this heuristics seems very effective in eliminating the regression > (at least judging by the test results I've seen so far). Two or three > question bother me about it, though: I more or less agree with all of your concerns about the INDEX_SCAN_MIN_TUPLE_DISTANCE optimization. > We're prefetching too close ahead, so close the I/O can't possibly > complete, and the overhead of submitting the I/O using AIO is higher > than what what async "saves". > > That's great, but is the distance a good measure of that? The big underlying problem is that the INDEX_SCAN_MIN_TUPLE_DISTANCE heuristic was developed in a way that overrelied on brute-force testing. It probably is still flawed in some specific way that some query will actually run into, that cannot be justified. So at this point INDEX_SCAN_MIN_TUPLE_DISTANCE should still be considered nothing more than a promising experiment. It's promising because it does appear to work with a large variety of queries (we know of no query where it doesn't basically work right now). My hope is that we'll be able to come up with a simpler and more robust approach. One where we fully understand the downsides. > 2) It's a one-time decision, not adaptive. We start prefetching, and > then at some point (not too long after the scan starts) we make a > decision whether to continue with prefetching or not. And if we disable > it, it's disabled forever. That's fine for the synthetic data sets we > use for testing, because those are synthetic. I'm not sure it'll work > this well for real-world data sets where different parts of the file may > be very different. We'll probably need to accept some hard trade-off in this area. In general (not just with this patch), prefetching works through trial and error -- the errors are useful information, and useful information isn't free. The regressions that the INDEX_SCAN_MIN_TUPLE_DISTANCE heuristic addresses are cases where the errors seem unlikely to pay for themselves. Let's not forget that these are not huge regressions -- it's not as if the patch ever does completely the wrong thing without INDEX_SCAN_MIN_TUPLE_DISTANCE. It's more like it hurts us to be constantly on the verge of doing the right thing, but never quite doing the right thing. Fundamentally, we need to be willing to pay for the cost of information through which we might be able to do better. We might be able to get the cost down, through some kind of targeted optimization, but it's unlikely to ever be close to free. > This is perfectly fine for a WIP patch, but I believe we should try to > make this adaptive. Which probably means we need to invent a "light" > version of read_stream that initially does sync I/O, and only switches > to async (with all the expensive initialization) later. And then can > switch back to sync, but is ready to maybe start prefetching again if > the data pattern changes. That does seem like it'd be ideal. But how are we supposed to decide to switch back? Right now, disabling prefetching disables the only way that we have to notice that prefetching might be useful (which is to notice that we're failing to keep up with our prefetch distance). Without INDEX_SCAN_MIN_TUPLE_DISTANCE, for those queries where prefetch distance collapses to ~2.0, we really can "decide to switch back to prefetching". But maintaining the option of switching back costs us too much (that's what we need INDEX_SCAN_MIN_TUPLE_DISTANCE to manage). > 3) Now that I look at the code in index_scan_stream_read_next, it feels > a bit weird we do the decision based on the "immediate" distance only. I > suspect this may make it quite fragile, in the sense that even a small > local irregularity in the data may result in different "step" changes. > Wouldn't it be better to base this on some "average" distance? > > In other words, I'm afraid (2) and (3) are pretty much a "performance > cliff", where a tiny difference in the input can result in wildly > different behavior. You can say the same thing about hash join spilling. It might not be practical to make a strong guarantee that this will never ever happen. It might be more useful to focus on finding a way that makes it as rare as possible. If problems like this are possible, but require a "perfect storm" of buffer hits and misses that occur in precisely the same order, then maybe it can't be too much of a problem in practice. Since it shouldn't won't occur again and again. > > Much cleanup work remains to get the changes I just described in > > proper shape (to say nothing about open items that we haven't made a > > start on yet, like moving the read stream out of indexam.c and into > > heapam). But it has been too long since the last revision. I'd like to > > establish a regular cadence for posting new revisions of the patch > > set. > > > > Thank you! I appreciate the collaboration, it's a huge help. I've enjoyed our collaboration. Feels like things are definitely moving in the right direction. This is definitely a challenging project. > I kept running the stress test, trying to find cases that regress, and > also to better understand the behavior. The script/charts are available > here: https://github.com/tvondra/prefetch-tests > > So far I haven't found any massive regressions (relative to master). > There are data sets where we regress by a couple percent (and it's not > noise). I haven't looked into the details, but I believe most of this > can be attributed to the "AIO costs" we discussed recently (with > signals), and similar things. The overall picture that these tests show is a positive one. I think that this might actually be an acceptable performance profile, across the board. What's not acceptable is the code itself, and the current uncertainty about how fragile our current approach is. I hope that we can make it less fragile in the coming weeks. -- Peter Geoghegan
pgsql-hackers by date: