On 10/8/25 18:17, Tomas Vondra wrote:
>
>
> On 10/8/25 17:14, Ants Aasma wrote:
>> On Mon, 6 Oct 2025 at 12:24, Tomas Vondra <tomas@vondra.me> wrote:
>>> On 10/6/25 07:29, Tom Lane wrote:
>>>> Another angle is that I expect that the ongoing AIO work will largely
>>>> destroy the existing model altogether, at least if you think in terms
>>>> of the model as trying to predict query execution time. But if what
>>>> we're trying to model is net resource demands, with an eye to
>>>> minimizing the total system load not execution time of any one query,
>>>> maybe we can continue to work with something close to what we've
>>>> traditionally done.
>>>>
>>>> No answers here, just more questions ...
>>>>
>>>
>>> I had the same thought, when working on the (index) prefetching. Which
>>> of course now relies on AIO. Without concurrency, there wasn't much
>>> difference between optimizing for resources and time, but AIO changes
>>> that. In fact, parallel query has a similar effect, because it also
>>> spreads the work to multiple concurrent processes.
>>>
>>> Parallel query simply divides the cost between workers, as if each use a
>>> fraction of resources. And the cost of the parallel plan is lower than
>>> summing up the per-worker costs. Maybe AIO should do something similar?
>>> That is, estimate the I/O concurrency and lower the cost a bit?
>>
>> OS and/or disk read-ahead is muddying the water here. Most modern
>> storage can saturate their bandwidth capability with enough concurrent
>> or large enough requests. The read-ahead is effectively increasing
>> request concurrency behind PostgreSQLs back while random is running
>> with concurrency 1. It would be very interesting to see what
>> debug_io_direct does, and also fio numbers for direct io.
>>
>
> I think I've done some of the runs with direct I/O (e.g. the Azure runs
> were doing that), and the conclusions were mostly the same. I did a
> couple runs on the other machines, but I don't have results that I could
> present. I'll try to get some, maybe it'll be different.
>
Here are results from with debug_io_direct (and shared_buffers=256MB in
both cases), from the single NVMe device.
seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
buffered I/O 115 40404 16.6
direct I/O 108 53053 23.4
I believe the difference is mostly due to page cache - with 182GB data
on 64GB RAM, that's about 30% cache hit ratio, give or take. And the
buffered runs are about 25% faster - not exactly 30%, but close. Also,
funnily enough, the seqscans are faster with direct I/O (so without
kernel read-ahead).
It's just one run, of course. But the results seem reasonable.
regards
--
Tomas Vondra