Re: index prefetching - Mailing list pgsql-hackers

From Andres Freund
Subject Re: index prefetching
Date
Msg-id 6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
> > On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > > > buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> > > > real-world queries will often encounter the pattern of the same block being
> > > > read in by a read stream multiple times in close proximity sufficiently often
> > > > to make that worth it.
> > >
> > > We definitely need to be prepared for duplicate prefetch requests in
> > > the context of index scans.
> >
> > Can you (or anybody else) think of a quick and dirty way of working
> > around the problem on the read stream side? I would like to prioritize
> > getting the patch into a state where its overall performance profile
> > "feels right". From there we can iterate on fixing the underlying
> > issues in more principled ways.
>
> I think I can see a way to fix the issue, below read stream. Basically,
> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
> new flag indicating that we are waiting for an IO that was not started by the
> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
> has to be somewhat different code from today, because we have to deal with the
> fact of the "foreign" IO potentially having failed.
>
> I'll try writing a prototype for that tomorrow. I think to actually get that
> into a committable shape we need a test harness (probably a read stream
> controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

New code disabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=213.277                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.124                                     │
    │ Planning Time: 3.269 ms                                              │
    │ Execution Time: 1023.279 ms                                          │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


New code enabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=217.225                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.009                                     │
    │ Planning Time: 2.685 ms                                              │
    │ Execution Time: 602.987 ms                                           │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


With the change enabled, the sequential query is faster than the random query:

    #### backwards random table ####
    ┌────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                         QUERY PLAN                                         │
    ├────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_randomized_pk on t_randomized (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                                              │
    │   Index Searches: 1                                                                        │
    │   Buffers: shared hit=6085 read=77813                                                      │
    │   I/O Timings: shared read=347.285                                                         │
    │ Planning:                                                                                  │
    │   Buffers: shared hit=127 read=5                                                           │
    │   I/O Timings: shared read=1.001                                                           │
    │ Planning Time: 1.751 ms                                                                    │
    │ Execution Time: 820.544 ms                                                                 │
    └────────────────────────────────────────────────────────────────────────────────────────────┘
    (10 rows)



Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Marthin Laubscher
Date:
Subject: About Custom Aggregates, C Extensions and Memory
Next
From: Tom Lane
Date:
Subject: Re: Compilation issues for HASH_STATISTICS and HASH_DEBUG options