Don't synchronously wait for already-in-progress IO in read stream - Mailing list pgsql-hackers

From Andres Freund
Subject Don't synchronously wait for already-in-progress IO in read stream
Date
Msg-id zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
Whole thread Raw
List pgsql-hackers
Hi,

In the index prefetching thread we discovered that read stream performance
suffers rather substantially when a read stream is reading blocks multiple
times within the readahead distance.

The problem leading to that is that we are currently synchronously waiting for
IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
IO. If a block is read twice, that means we won't actually have enough IOs in
flight to have good performance. What's worse, currently the wait is not
attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
IO).

This does not commonly occur with in-tree users of read streams, as users like
seqscans, bitmap heap scans, vacuum, ... will never try to read the same block
twice. However with index prefetching that is a more common case.

It is possible to encounter a version of this in 18/master: If multiple scans
for the same table are in progress, they can end up waiting synchronously for
each other. However it's a much less severe issue, as the scan that is
"further ahead" will not be blocked.


To fix it, the attached patch has AsyncReadBuffers() check if the "target"
buffer already has IO in progress. If so, it assing the buffer's IO wait
reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait
for the IO. To make that work correctly, the buffer stats etc have to be
updated in that case in WaitReadBuffers().


I did not feel like I was sufficiently confident in making this work without
tests. However, it's not exactly trivial to test some versions of this, due to
the way multiple processes need to be coordinated. It took way way longer to
write tests than to make the code actually work :/.

The attached tests add a new read_stream_for_blocks() function to test_aio. I
found it also rather useful to reproduce the performance issue without the
index prefetching patch applied.  To test the cross process case the injection
point infrastructure in test_aio had to be extended a bit.


Attached are three patches:

0001: Introduces a TestAio package and splits out some existing tests out of
      001_aio.pl

0002: Add read_stream test infrastructure & tests

      Note that the tests don't test that we don't unnecessarily wait, as
      described above, just that IO works correctly.

0003: Improve performance of read stream when re-encountering blocks


To reproduce the issue, the read_stream_for_blocks() function added to
test_aio can be used, in combination with debug_io_direct=data (it's also
possible without DIO, it'd just be more work).

prep:
CREATE EXTENSION  test_aio;
CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i);

test:
SELECT pg_buffercache_evict_relation('large');
EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM
generate_series(1,10000) g(i)));
 


Without 0003 applied:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN
                         │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254
rows=40000.00loops=1)          │
 
│   Buffers: shared hit=29997 read=10003
                         │
 
│   I/O Timings: shared read=16.116
                         │
 
│   InitPlan 1
                         │
 
│     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1)
                         │
 
│           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1)
                         │
 
│                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual
time=0.554..0.988rows=10000.00 loops=1) │
 
│ Planning Time: 0.060 ms
                         │
 
│ Execution Time: 676.436 ms
                         │
 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 676.753 ms


With 0003:


┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN
                         │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks  (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453
rows=40000.00loops=1)            │
 
│   Buffers: shared hit=29997 read=10003
                         │
 
│   I/O Timings: shared read=50.467
                         │
 
│   InitPlan 1
                         │
 
│     ->  Result  (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1)
                         │
 
│           ->  ProjectSet  (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1)
                         │
 
│                 ->  Function Scan on generate_series g  (cost=0.00..100.00 rows=10000 width=4) (actual
time=0.539..0.965rows=10000.00 loops=1) │
 
│ Planning Time: 0.046 ms
                         │
 
│ Execution Time: 90.661 ms
                         │ 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Time: 90.887 ms

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: GetNamedLWLockTranche crashes on Windows in normal backend
Next
From: Andrei Lepikhov
Date:
Subject: Re: Expose custom planning data in EXPLAIN