Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id 05df16f8-025a-43cd-9636-3194464012ed@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
List pgsql-hackers
On 11/25/25 15:12, Jakub Wartak wrote:
> Hi Tomas!
> 
> [..]
>> Which I think is mostly the same thing you're saying, and you have the maps to support it.
> 
> Right, the thread is kind of long, you were right back then, well but
> at least we've got a solid explanation with data.
> 
>> Here's an updated version of the patch series.
> 
> Just for double confirmation, I've used those ones (v20251121*) and
> they indeed interleaved parts of shm memory.
> 
>> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
>> and a incorrect array length.
> 
> You'll need to rebase again, pg_buffercache_numa got updated again on
> Monday and clashes with 0006.
> 

Rebased patch series attached.

>> The main change is in 0006 - it sets the default allocation policy for
>> shmem to interleaving, before doing the explicit partitioning for shared
>> buffers. It does it by calling numa_set_membind before the mmap(), and
>> then numa_interleave_memory() on the allocated shmem. It does this to
>> allow using MAP_POPULATE - but that's commented out by default.
>>
>> This does seem to solve the SIGBUS failures for me. I still think there
>> might be a small chance of hitting that, because of locating an extra
>> "boundary" page on one of the nodes. But it should be solvable by
>> reserving a couple more pages.
> 
> I can confirm, never got any SIGBUS during the later described
> benchmarks, so it's much better now.
> 

Good!

>> Jakub, what do you think?
> 
> On one side not using MAP_POPULATE gives instant startup, but on the
> other it gives much better predictability latencies especially fresh
> after starting up (this might matter to folks who like to benchmark --
> us?, but initially I've just used it as a simple hack to touch
> memory). I would be wary of using MAP_POPULATE with s_b when it would
> be sized in hundreths of GBs, it could take minutes in startup, which
> would be terrible if someone would hit SIGSEGV on production and
> expect restart_after_crash=true to save him. I mean WAL redo crash
> would be terrible, but that would be terrible * 2. Also pretty
> long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
> it would hurt even more, so I think that would be a bad path(?)
> 

I think the MAP_POPULATE should be optional, enabled by GUC.

> I've benchmarked the thing in two scenarios (readonly pgbench < s_b
> size across variations of code and connections and 2nd one with
> seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
> nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
> CPU states, no turbo boost, and so on, literally great home heater
> when there's -3C outside!)
> 
> The data is baseline "100%" for master along with HP on/off (so it's
> showing diff % from respective HP setting):
> 
> scenario I: pgbench -S
> 
>                  connections
> branch   HP      1       8       64      128     1024
> master   off     100.00% 100.00% 100.00% 100.00% 100.00%
> master   on      100.00% 100.00% 100.00% 100.00% 100.00%
> numa16   off     99.13%  100.46% 99.66%  99.44%  89.60%
> numa16   on      101.80% 100.89% 99.36%  99.89%  93.43%
> numa4    off     96.82%  100.61% 99.37%  99.92%  94.41%
> numa4    on      101.83% 100.61% 99.35%  99.69%  101.48%
> pgproc16 off     99.13%  100.84% 99.38%  99.85%  91.15%
> pgproc16 on      101.72% 101.40% 99.72%  100.14% 95.20%
> pgproc4  off     98.63%  101.44% 100.05% 100.14% 90.97%
> pgproc4  on      101.05% 101.46% 99.92%  100.31% 97.60%
> sweep16  off     99.53%  101.14% 100.71% 100.75% 101.52%
> sweep16  on      97.63%  102.49% 100.42% 100.75% 105.56%
> sweep4   off     99.43%  101.59% 100.06% 100.45% 104.63%
> sweep4   on      97.69%  101.59% 100.70% 100.69% 104.70%
> 
> I would consider everything +/- 3% as noise (technically each branch
> was a different compilation/ELF binary, as changing this #define
> required to do so to get 4 vs 16; please see attached script). I miss
> the explanation why without HP it deteriorates so much with for c=1024
> with the patches.

I wouldn't expect a big difference for "pgbench -S". That workload has
so much other fairly expensive stuff (e.g. initializing index scans
etc.), the cost of buffer replacement is going to be fairly limited.

The regressions for numa/pgproc patches with 1024 clients are annoying,
but how realistic is such scenario? With 32/64 CPUs, having 1024 active
connections is a substantial overload. If we can fix this, great. But I
think such regression may be OK if we get benefits for reasonable setups
(with fewer clients).

I don't know why it's happening, though. I haven't been testing cases
with so many clients (compared to number of CPUs).

> 
> scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
> pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
> without PQ by:
>     \set num (:client_id % 8) + 1
>     select sum(octet_length(filler)) from pgbench_accounts_:num;
> 
>                  connections
> branch   HP      1       8       64      128
> master   off     100.00% 100.00% 100.00% 100.00%
> master   on      100.00% 100.00% 100.00% 100.00%
> numa16   off     115.62% 108.87% 101.08% 111.56%
> numa16   on      107.68% 104.90% 102.98% 105.51%
> numa4    off     113.55% 111.41% 101.45% 113.10%
> numa4    on      107.90% 106.60% 103.68% 106.98%
> pgproc16 off     111.70% 108.27% 98.69%  109.36%
> pgproc16 on      106.98% 100.69% 101.98% 103.42%
> pgproc4  off     112.41% 106.15% 100.03% 112.03%
> pgproc4  on      106.73% 105.77% 103.74% 101.13%
> sweep16  off     100.63% 100.38% 98.41%  103.46%
> sweep16  on      109.03% 99.15%  101.17% 99.19%
> sweep4   off     102.04% 101.16% 101.71% 91.86%
> sweep4   on      108.33% 101.69% 97.14%  100.92%
> 
> The benefit varies with like +3-10% depending on connection count.
> Quite frankly I was expecting a little bit more, especially after
> re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
> I've randomly warmed it using pgbench). Probably it's something with
> my test, I'll take yet another look hopefully soon. The good thing is
> that it never crashed and I haven't seen any errors like "Bad address"
> probably related to AIO as you saw in [1], perhaps I wasn't using
> uring.
> 

Hmmm. I'd have expected better results for this workload. So I tried
re-running my seqscan benchmark on the 176-core instance, and I got this:

    clients   master   0001   0002   0003   0004   0005   0006   0007
    -----------------------------------------------------------------
     64           44     43     35     40     53     53     46     45
     96           55     54     42     47     57     58     53     53
    128           59     59     46     50     58     58     57     60

    clients   0001   0002   0003   0004   0005   0006   0007
    --------------------------------------------------------
     64        98%    79%    92%   122%   122%   105%   104%
     96        99%    76%    86%   104%   105%    97%    97%
    128        99%    77%    84%    98%    98%    97%   101%

I did the benchmark for individual parts of the patch series. There's a
clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
0002/0003 regress it quite a bit. And with 128 clients there's no
improvement at all.

This was with the default number of partitions (i.e. 4). If I increase
the number to 16, I get this:

    clients   master   0001   0002   0003   0004   0005   0006   0007
    -----------------------------------------------------------------
     64           44     43     69     82     87     87     78     79
     96           55     54     65     85     91     91     86     86
    128           59     59     66     77     83     83     82     86

    clients   0001   0002   0003   0004   0005   0006   0007
    --------------------------------------------------------
     64        99%   158%   189%   199%   199%   180%   180%
     96       100%   119%   156%   167%   167%   157%   158%
    128        99%   112%   130%   140%   140%   139%   145%

And with 32 partitions, I get this:

    clients   master   0001   0002   0003   0004   0005   0006   0007
    -----------------------------------------------------------------
     64           44     44     88     91     90     90     84     84
     96           55     54     89     93     93     92     90     91
    128           59     59     85     84     86     85     88     87

    clients   0001   0002   0003   0004   0005   0006   0007
    --------------------------------------------------------
     64       100%   202%   208%   207%   207%   193%   193%
     96       100%   163%   169%   171%   168%   165%   166%
    128        99%   144%   142%   146%   144%   149%   146%

Those are clearly much better results, so I guess the default number of
partitions may be too low.

What bothers me is that this seems like a very narrow benchmark. I mean,
few systems are doing concurrent seqscans putting this much pressure on
buffer replacement. And once the plans start to do other stuff, the
contention on clock sweep seems to go down substantially (as shown by
the read-only pgbench). So the question is - is this really worth it?

> 0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)
> 

Should be fixed by the attached patches. The 0006 patch has an issue
with mbind too, but it was visible only when the buffers were not a nice
multiple of memory pages (and multiples of 1GB are fine).

This also moves the memset() until after placing the PGPROC partitions
to different NUMA nodes.

The results above are from v20251121. I'll rerun the tests with the nw
version of the patches. But it can only change the 0006/0007 results, of
course. The 0001-0005 are the same.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Fixes bug in strlower_libc_sb()
Next
From: Daniel Gustafsson
Date:
Subject: Re: The pgperltidy diffs in HEAD