Thread: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

Hello list,

I'm submitting a patch for improving an almost 1h long pause at the start
of parallel pg_restore of a big archive. Related discussion has taken
place at pgsql-performance mailing list at:

https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net

I think I explain it rather well in the commit message, so I paste it
inline:


Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.

In this case pg_restore workers manifest constant looping of reading
small sizes (4KB) and seeking forward small lenths (around 10KB for a
compressed archive):

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

This happens as each worker scans the whole file until it finds the
entry it wants, skipping forward each block. In combination to the small
block size of the custom format dump, this causes many seeks and low
performance.

Fix by avoiding forward seeks for jumps of less than 1MB forward.
Do instead sequential reads.

Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.


This is my first patch submission, all help is much appreciated.
Regards,
Dimitris


P.S.  What is the recommended way to test a change, besides a generic make
check? And how do I run selectively only the pg_dump/restore tests, in
order to speed up my development routine?


Attachment
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote:
>
> P.S.  What is the recommended way to test a change, besides a generic make
> check? And how do I run selectively only the pg_dump/restore tests, in order
> to speed up my development routine?

I have tested it with:

   make  -C src/bin/pg_dump  check

It didn't break any test, but I also don't see any difference, the
performance boost is noticeable only when restoring a huge archive that is
missing offsets.

Any volunteer to review this one-line patch?

Thanks,
Dimitris




On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
> It didn't break any test, but I also don't see any difference, the
> performance boost is noticeable only when restoring a huge archive that is
> missing offsets.

This seems generally reasonable to me, but how did you decide on 1MB as the
threshold?  Have you tested other values?  Could the best threshold vary
based on the workload and hardware?

-- 
nathan



Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and
20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive?
Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it
madesense to guard against these cases too. 

I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of
reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it
wouldbe better to seek() forward for lengths of over 1MB.  

On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote:
>On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
>> It didn't break any test, but I also don't see any difference, the
>> performance boost is noticeable only when restoring a huge archive that is
>> missing offsets.
>
>This seems generally reasonable to me, but how did you decide on 1MB as the
>threshold?  Have you tested other values?  Could the best threshold vary
>based on the workload and hardware?
>



I just managed to run pgindent, here is v2 with the comment style fixed.
Attachment
On Fri, 4 Apr 2025, Dimitrios Apostolou wrote:

> I just managed to run pgindent, here is v2 with the comment style fixed.

Any feedback on this one-liner? Or is the lack of feedback a clue that I
have been missing something important in my patch submission? :-)

Should I CC people that are frequent committers to the file?


Thanks,
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Any feedback on this one-liner? Or is the lack of feedback a clue that I 
> have been missing something important in my patch submission? :-)

The calendar ;-).  At this point we're in feature freeze for v18,
so things that aren't bugs aren't likely to get much attention
until v19 development opens up (in July, unless things are really
going badly with v18).

You should add your patch to the July commitfest [1] to make sure
we don't lose track of it.

            regards, tom lane

[1] https://commitfest.postgresql.org/53/




On Mon, 14 Apr 2025, Tom Lane wrote:

>
> You should add your patch to the July commitfest [1] to make sure
> we don't lose track of it.

I rebased the patch (attached) and created an entry in the commitfest:

https://commitfest.postgresql.org/patch/5809/


Thanks!
Dimitris




Attached now...

On Mon, 9 Jun 2025, Dimitrios Apostolou wrote:

>
>
> On Mon, 14 Apr 2025, Tom Lane wrote:
>
>>
>>  You should add your patch to the July commitfest [1] to make sure
>>  we don't lose track of it.
>
> I rebased the patch (attached) and created an entry in the commitfest:
>
> https://commitfest.postgresql.org/patch/5809/
>
>
> Thanks!
> Dimitris
>
>
>
Attachment
On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
> Thank you for benchmarking! Before answering in more depth, I'm curious,
> what read-seek pattern do you see on the system call level (as shown by
> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).

For fseeko(), sizes less than 4096 produce a repeating pattern of read()
calls followed by approximately (4096 / size) lseek() calls.  For greater
sizes, it's just a stream of lseek().  For fread(), sizes less than 4096
produce a stream of read(fd, "...", 4096), and for greater sizes, the only
difference is that the last argument is the size.

-- 
nathan



On Tue, 10 Jun 2025, Nathan Bossart wrote:

> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...
>
> This doesn't fully explain the results you are seeing, but it does seem to
> validate the idea.  I'm curious if you see further improvement with even
> lower thresholds (e.g., 8KB, 16KB, 32KB).

By the way, I might have set the threshold to 1MB in my program, but
lowering it won't show a difference in my test case, since the lseek()s I
was noticing before the patch were mostly 8-16KB forward. Not sure what is
the defining factor for that. Maybe the compression algorithm, or how wide
the table is?


Dimitris




On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
> By the way, I might have set the threshold to 1MB in my program, but
> lowering it won't show a difference in my test case, since the lseek()s I
> was noticing before the patch were mostly 8-16KB forward. Not sure what is
> the defining factor for that. Maybe the compression algorithm, or how wide
> the table is?

I may have missed it, but could you share what the strace looks like with
the patch applied?

-- 
nathan



On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to
control FILE buffering.  I suspect that glibc ignores the size if you
pass NULL for my_buffer, so you'd need to allocate it yourself and it
should probably be aligned on PG_IO_ALIGN_SIZE for best results
(minimising the number of VM pages that must be held/pinned).  Then
you might be able to get better and less OS-dependent results.  I
haven't studied this seek business so I have no opinion on that and
what a good size would be, but interesting sizes might be
rounded to both PG_IO_ALIGN_SIZE and filesystem block size according
to fstat(fileno(stream)).  IDK, just a thought...



On Sat, 14 Jun 2025, Dimitrios Apostolou wrote:

> On Fri, 13 Jun 2025, Nathan Bossart wrote:
>
>>  On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>>>  By the way, I might have set the threshold to 1MB in my program, but
>>>  lowering it won't show a difference in my test case, since the lseek()s I
>>>  was noticing before the patch were mostly 8-16KB forward. Not sure what
>>>  is
>>>  the defining factor for that. Maybe the compression algorithm, or how
>>>  wide
>>>  the table is?
>>
>>  I may have missed it, but could you share what the strace looks like with
>>  the patch applied?
>
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096


This was from pg_restoring a zstd-compressed custom format dump.

Out of curiosity I've tried the same with an uncompressed dump
(--compress=none). Surprisingly it seems the blocksize is even smaller.

With my patched pg_restore I only get 4K reads and nothing else on
the strace output.

read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096

The unpatched pg_restore gives me the weirdest output ever:

read(4, "..."..., 4096) = 4096
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
[ ... repeats about 80 times ...]
read(4, "..."..., 4096) = 4096
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
[ ... repeats ... ]



Seeing this, I think we should really consider raising the pg_dump block
size like Tom suggested on a previous thread.


Dimitris