Thread: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
Hello list, I'm submitting a patch for improving an almost 1h long pause at the start of parallel pg_restore of a big archive. Related discussion has taken place at pgsql-performance mailing list at: https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net I think I explain it rather well in the commit message, so I paste it inline: Improve the performance of parallel pg_restore (-j) from a custom format pg_dump archive that does not include data offsets - typically happening when pg_dump has generated it by writing to stdout instead of a file. In this case pg_restore workers manifest constant looping of reading small sizes (4KB) and seeking forward small lenths (around 10KB for a compressed archive): read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 This happens as each worker scans the whole file until it finds the entry it wants, skipping forward each block. In combination to the small block size of the custom format dump, this causes many seeks and low performance. Fix by avoiding forward seeks for jumps of less than 1MB forward. Do instead sequential reads. Performance gain can be significant, depending on the size of the dump and the I/O subsystem. On my local NVMe drive, read speeds for that phase of pg_restore increased from 150MB/s to 3GB/s. This is my first patch submission, all help is much appreciated. Regards, Dimitris P.S. What is the recommended way to test a change, besides a generic make check? And how do I run selectively only the pg_dump/restore tests, in order to speed up my development routine?
Attachment
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote: > > P.S. What is the recommended way to test a change, besides a generic make > check? And how do I run selectively only the pg_dump/restore tests, in order > to speed up my development routine? I have tested it with: make -C src/bin/pg_dump check It didn't break any test, but I also don't see any difference, the performance boost is noticeable only when restoring a huge archive that is missing offsets. Any volunteer to review this one-line patch? Thanks, Dimitris
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Nathan Bossart
Date:
On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: > It didn't break any test, but I also don't see any difference, the > performance boost is noticeable only when restoring a huge archive that is > missing offsets. This seems generally reasonable to me, but how did you decide on 1MB as the threshold? Have you tested other values? Could the best threshold vary based on the workload and hardware? -- nathan
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and 20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive? Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it madesense to guard against these cases too. I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it wouldbe better to seek() forward for lengths of over 1MB. On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote: >On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: >> It didn't break any test, but I also don't see any difference, the >> performance boost is noticeable only when restoring a huge archive that is >> missing offsets. > >This seems generally reasonable to me, but how did you decide on 1MB as the >threshold? Have you tested other values? Could the best threshold vary >based on the workload and hardware? >
Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
I just managed to run pgindent, here is v2 with the comment style fixed.
Attachment
[PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Fri, 4 Apr 2025, Dimitrios Apostolou wrote: > I just managed to run pgindent, here is v2 with the comment style fixed. Any feedback on this one-liner? Or is the lack of feedback a clue that I have been missing something important in my patch submission? :-) Should I CC people that are frequent committers to the file? Thanks, Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > Any feedback on this one-liner? Or is the lack of feedback a clue that I > have been missing something important in my patch submission? :-) The calendar ;-). At this point we're in feature freeze for v18, so things that aren't bugs aren't likely to get much attention until v19 development opens up (in July, unless things are really going badly with v18). You should add your patch to the July commitfest [1] to make sure we don't lose track of it. regards, tom lane [1] https://commitfest.postgresql.org/53/
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Mon, 14 Apr 2025, Tom Lane wrote: > > You should add your patch to the July commitfest [1] to make sure > we don't lose track of it. I rebased the patch (attached) and created an entry in the commitfest: https://commitfest.postgresql.org/patch/5809/ Thanks! Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
Attached now... On Mon, 9 Jun 2025, Dimitrios Apostolou wrote: > > > On Mon, 14 Apr 2025, Tom Lane wrote: > >> >> You should add your patch to the July commitfest [1] to make sure >> we don't lose track of it. > > I rebased the patch (attached) and created an entry in the commitfest: > > https://commitfest.postgresql.org/patch/5809/ > > > Thanks! > Dimitris > > >
Attachment
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Nathan Bossart
Date:
On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote: > Thank you for benchmarking! Before answering in more depth, I'm curious, > what read-seek pattern do you see on the system call level (as shown by > strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K). For fseeko(), sizes less than 4096 produce a repeating pattern of read() calls followed by approximately (4096 / size) lseek() calls. For greater sizes, it's just a stream of lseek(). For fread(), sizes less than 4096 produce a stream of read(fd, "...", 4096), and for greater sizes, the only difference is that the last argument is the size. -- nathan
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Tue, 10 Jun 2025, Nathan Bossart wrote: > So, fseeko() starts winning around 4096 bytes. On macOS, the differences > aren't quite as dramatic, but 4096 bytes is the break-even point there, > too. I imagine there's a buffer around that size somewhere... > > This doesn't fully explain the results you are seeing, but it does seem to > validate the idea. I'm curious if you see further improvement with even > lower thresholds (e.g., 8KB, 16KB, 32KB). By the way, I might have set the threshold to 1MB in my program, but lowering it won't show a difference in my test case, since the lseek()s I was noticing before the patch were mostly 8-16KB forward. Not sure what is the defining factor for that. Maybe the compression algorithm, or how wide the table is? Dimitris
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Nathan Bossart
Date:
On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: > By the way, I might have set the threshold to 1MB in my program, but > lowering it won't show a difference in my test case, since the lseek()s I > was noticing before the patch were mostly 8-16KB forward. Not sure what is > the defining factor for that. Maybe the compression algorithm, or how wide > the table is? I may have missed it, but could you share what the strace looks like with the patch applied? -- nathan
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Thomas Munro
Date:
On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > So, fseeko() starts winning around 4096 bytes. On macOS, the differences > aren't quite as dramatic, but 4096 bytes is the break-even point there, > too. I imagine there's a buffer around that size somewhere... BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to control FILE buffering. I suspect that glibc ignores the size if you pass NULL for my_buffer, so you'd need to allocate it yourself and it should probably be aligned on PG_IO_ALIGN_SIZE for best results (minimising the number of VM pages that must be held/pinned). Then you might be able to get better and less OS-dependent results. I haven't studied this seek business so I have no opinion on that and what a good size would be, but interesting sizes might be rounded to both PG_IO_ALIGN_SIZE and filesystem block size according to fstat(fileno(stream)). IDK, just a thought...
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Sat, 14 Jun 2025, Dimitrios Apostolou wrote: > On Fri, 13 Jun 2025, Nathan Bossart wrote: > >> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote: >>> By the way, I might have set the threshold to 1MB in my program, but >>> lowering it won't show a difference in my test case, since the lseek()s I >>> was noticing before the patch were mostly 8-16KB forward. Not sure what >>> is >>> the defining factor for that. Maybe the compression algorithm, or how >>> wide >>> the table is? >> >> I may have missed it, but could you share what the strace looks like with >> the patch applied? > > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 12288) = 12288 > read(4, "..."..., 4096) = 4096 > read(4, "..."..., 8192) = 8192 > read(4, "..."..., 4096) = 4096 This was from pg_restoring a zstd-compressed custom format dump. Out of curiosity I've tried the same with an uncompressed dump (--compress=none). Surprisingly it seems the blocksize is even smaller. With my patched pg_restore I only get 4K reads and nothing else on the strace output. read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 read(4, "..."..., 4096) = 4096 The unpatched pg_restore gives me the weirdest output ever: read(4, "..."..., 4096) = 4096 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 lseek(4, 98527916032, SEEK_SET) = 98527916032 [ ... repeats about 80 times ...] read(4, "..."..., 4096) = 4096 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 lseek(4, 98527920128, SEEK_SET) = 98527920128 [ ... repeats ... ] Seeing this, I think we should really consider raising the pg_dump block size like Tom suggested on a previous thread. Dimitris