Re: AIO v2.0 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: AIO v2.0
Date
Msg-id 6y5xyw3q2773mvvsjgap27js3guklxxgjy5o24f67vkkjliubv@pio54caabde2
Whole thread Raw
In response to Re: AIO v2.0  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: AIO v2.0
List pgsql-hackers
Hi,

On 2025-01-09 20:10:24 +0200, Ants Aasma wrote:
> On Thu, 9 Jan 2025 at 18:25, Andres Freund <andres@anarazel.de> wrote:
> > > I'm curious about this because the checksum code should be fast enough
> > > to easily handle that throughput.
> >
> > It seems to top out at about ~5-6 GB/s on my 2x Xeon Gold 6442Y
> > workstation. But we don't have a good ready-made way of testing that without
> > also doing IO, so it's kinda hard to say.
>
> Interesting, I wonder if it's related to Intel increasing vpmulld
> latency to 10 already back in Haswell. The Zen 3 I'm testing on has
> latency 3 and has twice the throughput.

> Attached is a naive and crude benchmark that I used for testing here.
> Compiled with:
>
> gcc -O2 -funroll-loops -ftree-vectorize -march=native \
>   -I$(pg_config --includedir-server) \
>   bench-checksums.c -o bench-checksums-native
>
> Just fills up an array of pages and checksums them, first argument is
> number of checksums, second is array size. I used 1M checksums and 100
> pages for in cache behavior and 100000 pages for in memory
> performance.
>
> 869.85927ms @ 9.418 GB/s - generic from memory
> 772.12252ms @ 10.610 GB/s - generic in cache
> 442.61869ms @ 18.508 GB/s - native from memory
> 137.07573ms @ 59.763 GB/s - native in cache

printf '%16s\t%16s\t%s\n' march mem result; for mem in 100 100000 1000000; do for march in x86-64 x86-64-v2 x86-64-v3
x86-64-v4native; do printf "%16s\t%16s\t" $march $mem; gcc -g -g3 -O2 -funroll-loops -ftree-vectorize -march=$march -I
~/src/postgresql/src/include/-I src/include/ /tmp/bench-checksums.c -o bench-checksums-native && numactl --physcpubind
1--membind 0 ./bench-checksums-native 1000000 $mem;done; done
 

Workstation w/ 2x Xeon Gold 6442Y:

           march                 mem    result
          x86-64                 100    731.87779ms @ 11.193 GB/s
       x86-64-v2                 100    327.18580ms @ 25.038 GB/s
       x86-64-v3                 100    264.03547ms @ 31.026 GB/s
       x86-64-v4                 100    282.08065ms @ 29.041 GB/s
          native                 100    246.13766ms @ 33.282 GB/s
          x86-64              100000    842.66827ms @ 9.722 GB/s
       x86-64-v2              100000    604.52959ms @ 13.551 GB/s
       x86-64-v3              100000    477.16239ms @ 17.168 GB/s
       x86-64-v4              100000    476.07039ms @ 17.208 GB/s
          native              100000    456.08080ms @ 17.962 GB/s
          x86-64             1000000    845.51132ms @ 9.689 GB/s
       x86-64-v2             1000000    612.07973ms @ 13.384 GB/s
       x86-64-v3             1000000    485.23738ms @ 16.882 GB/s
       x86-64-v4             1000000    483.86411ms @ 16.930 GB/s
          native             1000000    462.88461ms @ 17.698 GB/s



Zen 4 laptop (AMD Ryzen 7 PRO 7840U):
           march                 mem    result
          x86-64                 100    417.19762ms @ 19.636 GB/s
       x86-64-v2                 100    130.67596ms @ 62.689 GB/s
       x86-64-v3                 100    97.07758ms @ 84.386 GB/s
       x86-64-v4                 100    95.67704ms @ 85.621 GB/s
          native                 100    95.15734ms @ 86.089 GB/s
          x86-64              100000    431.38370ms @ 18.990 GB/s
       x86-64-v2              100000    215.74856ms @ 37.970 GB/s
       x86-64-v3              100000    199.74492ms @ 41.012 GB/s
       x86-64-v4              100000    186.98300ms @ 43.811 GB/s
          native              100000    187.68125ms @ 43.648 GB/s
          x86-64             1000000    433.87893ms @ 18.881 GB/s
       x86-64-v2             1000000    217.46561ms @ 37.670 GB/s
       x86-64-v3             1000000    200.40667ms @ 40.877 GB/s
       x86-64-v4             1000000    187.51978ms @ 43.686 GB/s
          native             1000000    190.29273ms @ 43.049 GB/s


Workstation w/ 2x Xeon Gold 5215:
           march                 mem    result
          x86-64                 100    780.38881ms @ 10.497 GB/s
       x86-64-v2                 100    389.62005ms @ 21.026 GB/s
       x86-64-v3                 100    323.97294ms @ 25.286 GB/s
       x86-64-v4                 100    274.19493ms @ 29.877 GB/s
          native                 100    283.48674ms @ 28.897 GB/s
          x86-64              100000    1112.63898ms @ 7.363 GB/s
       x86-64-v2              100000    831.45641ms @ 9.853 GB/s
       x86-64-v3              100000    696.20789ms @ 11.767 GB/s
       x86-64-v4              100000    685.61636ms @ 11.948 GB/s
          native              100000    689.78023ms @ 11.876 GB/s
          x86-64             1000000    1128.65580ms @ 7.258 GB/s
       x86-64-v2             1000000    843.92594ms @ 9.707 GB/s
       x86-64-v3             1000000    718.78848ms @ 11.397 GB/s
       x86-64-v4             1000000    687.68258ms @ 11.912 GB/s
          native             1000000    705.34731ms @ 11.614 GB/s


That's quite the drastic difference between amd and intel. Of course it's also
comparing a multi-core server uarch (lower per-core bandwidth, much higher
aggregate bandwidth) with a client uarch.


The difference between the baseline CPU target and a more modern profile is
also rather impressive.  Looks like some cpu-capability based dispatch would
likely be worth it, even if it didn't matter in my case due to -march=native.


I just realized that

a) The meson build doesn't use the relevant flags for bufpage.c - it didn't
   matter in my numbers though because I was building with -O3 and
   march=native.

   This clearly ought to be fixed.

b) Neither build uses the optimized flags for pg_checksum and pg_upgrade, both
   of which include checksum_imp.h directly.

   This probably should be fixed too - perhaps by building the relevant code
   once as part of fe_utils or such?


It probably matters less than it used to - these days -O2 turns on
-ftree-loop-vectorize -ftree-slp-vectorize. But loop unrolling isn't
enabled.

I do see a perf difference at -O2 between using/not using
-funroll-loops. Interestingly not at -O3, despite -funroll-loops not actually
being enabled by -O3. I think the relevant option that *is* turned on by O3 is
-fpeel-loops.

Here's a comparison of different flags run the 6442Y

printf '%16s\t%32s\t%16s\t%s\n' march flags mem result; for mem in 100 100000; do for march in x86-64 x86-64-v2
x86-64-v3x86-64-v4 native; do for flags in "-O2" "-O2 -funroll-loops" "-O3" "-O3 -funroll-loops"; do printf
"%16s\t%32s\t%16s\t""$march" "$flags" "$mem"; gcc $flags -march=$march -I ~/src/postgresql/src/include/ -I src/include/
/tmp/bench-checksums.c-o bench-checksums-native && numactl --physcpubind 3 --membind 0 ./bench-checksums-native 3000000
$mem;done;done;done
 
           march                               flags                 mem    result
          x86-64                                 -O2                 100    2280.86253ms @ 10.775 GB/s
          x86-64                  -O2 -funroll-loops                 100    2195.66942ms @ 11.193 GB/s
          x86-64                                 -O3                 100    2422.57588ms @ 10.145 GB/s
          x86-64                  -O3 -funroll-loops                 100    2243.75826ms @ 10.953 GB/s
       x86-64-v2                                 -O2                 100    1243.68063ms @ 19.761 GB/s
       x86-64-v2                  -O2 -funroll-loops                 100    979.67783ms @ 25.086 GB/s
       x86-64-v2                                 -O3                 100    988.80296ms @ 24.854 GB/s
       x86-64-v2                  -O3 -funroll-loops                 100    991.31632ms @ 24.791 GB/s
       x86-64-v3                                 -O2                 100    1146.90165ms @ 21.428 GB/s
       x86-64-v3                  -O2 -funroll-loops                 100    785.81395ms @ 31.275 GB/s
       x86-64-v3                                 -O3                 100    800.53627ms @ 30.699 GB/s
       x86-64-v3                  -O3 -funroll-loops                 100    790.21230ms @ 31.101 GB/s
       x86-64-v4                                 -O2                 100    883.82916ms @ 27.806 GB/s
       x86-64-v4                  -O2 -funroll-loops                 100    831.55372ms @ 29.554 GB/s
       x86-64-v4                                 -O3                 100    843.23141ms @ 29.145 GB/s
       x86-64-v4                  -O3 -funroll-loops                 100    821.19969ms @ 29.927 GB/s
          native                                 -O2                 100    1197.41357ms @ 20.524 GB/s
          native                  -O2 -funroll-loops                 100    718.05253ms @ 34.226 GB/s
          native                                 -O3                 100    747.94090ms @ 32.858 GB/s
          native                  -O3 -funroll-loops                 100    751.52379ms @ 32.702 GB/s
          x86-64                                 -O2              100000    2911.47087ms @ 8.441 GB/s
          x86-64                  -O2 -funroll-loops              100000    2525.45504ms @ 9.731 GB/s
          x86-64                                 -O3              100000    2497.42016ms @ 9.841 GB/s
          x86-64                  -O3 -funroll-loops              100000    2346.33551ms @ 10.474 GB/s
       x86-64-v2                                 -O2              100000    2124.10102ms @ 11.570 GB/s
       x86-64-v2                  -O2 -funroll-loops              100000    1819.09659ms @ 13.510 GB/s
       x86-64-v2                                 -O3              100000    1613.45823ms @ 15.232 GB/s
       x86-64-v2                  -O3 -funroll-loops              100000    1607.09245ms @ 15.292 GB/s
       x86-64-v3                                 -O2              100000    1972.89390ms @ 12.457 GB/s
       x86-64-v3                  -O2 -funroll-loops              100000    1432.58229ms @ 17.155 GB/s
       x86-64-v3                                 -O3              100000    1533.18003ms @ 16.029 GB/s
       x86-64-v3                  -O3 -funroll-loops              100000    1539.39779ms @ 15.965 GB/s
       x86-64-v4                                 -O2              100000    1591.96881ms @ 15.437 GB/s
       x86-64-v4                  -O2 -funroll-loops              100000    1434.91828ms @ 17.127 GB/s
       x86-64-v4                                 -O3              100000    1454.30133ms @ 16.899 GB/s
       x86-64-v4                  -O3 -funroll-loops              100000    1429.13733ms @ 17.196 GB/s
          native                                 -O2              100000    1980.53734ms @ 12.409 GB/s
          native                  -O2 -funroll-loops              100000    1373.95337ms @ 17.887 GB/s
          native                                 -O3              100000    1517.90164ms @ 16.191 GB/s
          native                  -O3 -funroll-loops              100000    1508.37021ms @ 16.293 GB/s



> > > Is it just that the calculation is slow, or is it the fact that checksumming
> > > needs to bring the page into the CPU cache. Did you notice any hints which
> > > might be the case?
> >
> > I don't think the issue is that checksumming pulls the data into CPU caches
> >
> > 1) This is visible with SELECT that actually uses the data
> >
> > 2) I added prefetching to avoid any meaningful amount of cache misses and it
> >    doesn't change the overall timing much
> >
> > 3) It's visible with buffered IO, which has pulled the data into CPU caches
> >    already
>
> I didn't yet check the code, when doing aio completions checksumming
> be running on the same core as is going to be using the page?

With io_uring normally yes, the exception being that another backend that
needs the same page could end up running the completion.

With worker mode normally no.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Psql meta-command conninfo+
Next
From: Michail Nikolaev
Date:
Subject: Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?