Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | 05df16f8-025a-43cd-9636-3194464012ed@vondra.me Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| List | pgsql-hackers |
On 11/25/25 15:12, Jakub Wartak wrote:
> Hi Tomas!
>
> [..]
>> Which I think is mostly the same thing you're saying, and you have the maps to support it.
>
> Right, the thread is kind of long, you were right back then, well but
> at least we've got a solid explanation with data.
>
>> Here's an updated version of the patch series.
>
> Just for double confirmation, I've used those ones (v20251121*) and
> they indeed interleaved parts of shm memory.
>
>> It fixes a bunch of issues in pg_buffercache_pages.c - duplicate attnums
>> and a incorrect array length.
>
> You'll need to rebase again, pg_buffercache_numa got updated again on
> Monday and clashes with 0006.
>
Rebased patch series attached.
>> The main change is in 0006 - it sets the default allocation policy for
>> shmem to interleaving, before doing the explicit partitioning for shared
>> buffers. It does it by calling numa_set_membind before the mmap(), and
>> then numa_interleave_memory() on the allocated shmem. It does this to
>> allow using MAP_POPULATE - but that's commented out by default.
>>
>> This does seem to solve the SIGBUS failures for me. I still think there
>> might be a small chance of hitting that, because of locating an extra
>> "boundary" page on one of the nodes. But it should be solvable by
>> reserving a couple more pages.
>
> I can confirm, never got any SIGBUS during the later described
> benchmarks, so it's much better now.
>
Good!
>> Jakub, what do you think?
>
> On one side not using MAP_POPULATE gives instant startup, but on the
> other it gives much better predictability latencies especially fresh
> after starting up (this might matter to folks who like to benchmark --
> us?, but initially I've just used it as a simple hack to touch
> memory). I would be wary of using MAP_POPULATE with s_b when it would
> be sized in hundreths of GBs, it could take minutes in startup, which
> would be terrible if someone would hit SIGSEGV on production and
> expect restart_after_crash=true to save him. I mean WAL redo crash
> would be terrible, but that would be terrible * 2. Also pretty
> long-term with DIO, we'll get much bigger s_b anyway (hopefully), so
> it would hurt even more, so I think that would be a bad path(?)
>
I think the MAP_POPULATE should be optional, enabled by GUC.
> I've benchmarked the thing in two scenarios (readonly pgbench < s_b
> size across variations of code and connections and 2nd one with
> seqconcurrrentscans) in solid stable conditions: 4s32c64t == 4 NUMA
> nodes, 128GB RAM, 31GB shared_buffers dbsize ~29GB, 6.14.x, no idle
> CPU states, no turbo boost, and so on, literally great home heater
> when there's -3C outside!)
>
> The data is baseline "100%" for master along with HP on/off (so it's
> showing diff % from respective HP setting):
>
> scenario I: pgbench -S
>
> connections
> branch HP 1 8 64 128 1024
> master off 100.00% 100.00% 100.00% 100.00% 100.00%
> master on 100.00% 100.00% 100.00% 100.00% 100.00%
> numa16 off 99.13% 100.46% 99.66% 99.44% 89.60%
> numa16 on 101.80% 100.89% 99.36% 99.89% 93.43%
> numa4 off 96.82% 100.61% 99.37% 99.92% 94.41%
> numa4 on 101.83% 100.61% 99.35% 99.69% 101.48%
> pgproc16 off 99.13% 100.84% 99.38% 99.85% 91.15%
> pgproc16 on 101.72% 101.40% 99.72% 100.14% 95.20%
> pgproc4 off 98.63% 101.44% 100.05% 100.14% 90.97%
> pgproc4 on 101.05% 101.46% 99.92% 100.31% 97.60%
> sweep16 off 99.53% 101.14% 100.71% 100.75% 101.52%
> sweep16 on 97.63% 102.49% 100.42% 100.75% 105.56%
> sweep4 off 99.43% 101.59% 100.06% 100.45% 104.63%
> sweep4 on 97.69% 101.59% 100.70% 100.69% 104.70%
>
> I would consider everything +/- 3% as noise (technically each branch
> was a different compilation/ELF binary, as changing this #define
> required to do so to get 4 vs 16; please see attached script). I miss
> the explanation why without HP it deteriorates so much with for c=1024
> with the patches.
I wouldn't expect a big difference for "pgbench -S". That workload has
so much other fairly expensive stuff (e.g. initializing index scans
etc.), the cost of buffer replacement is going to be fairly limited.
The regressions for numa/pgproc patches with 1024 clients are annoying,
but how realistic is such scenario? With 32/64 CPUs, having 1024 active
connections is a substantial overload. If we can fix this, great. But I
think such regression may be OK if we get benefits for reasonable setups
(with fewer clients).
I don't know why it's happening, though. I haven't been testing cases
with so many clients (compared to number of CPUs).
>
> scenario II: pgbench -f seqconcurrscans.pgb; 64 partitions from
> pgbench --partitions=64 -i -s 2000 [~29GB] being hammered in modulo
> without PQ by:
> \set num (:client_id % 8) + 1
> select sum(octet_length(filler)) from pgbench_accounts_:num;
>
> connections
> branch HP 1 8 64 128
> master off 100.00% 100.00% 100.00% 100.00%
> master on 100.00% 100.00% 100.00% 100.00%
> numa16 off 115.62% 108.87% 101.08% 111.56%
> numa16 on 107.68% 104.90% 102.98% 105.51%
> numa4 off 113.55% 111.41% 101.45% 113.10%
> numa4 on 107.90% 106.60% 103.68% 106.98%
> pgproc16 off 111.70% 108.27% 98.69% 109.36%
> pgproc16 on 106.98% 100.69% 101.98% 103.42%
> pgproc4 off 112.41% 106.15% 100.03% 112.03%
> pgproc4 on 106.73% 105.77% 103.74% 101.13%
> sweep16 off 100.63% 100.38% 98.41% 103.46%
> sweep16 on 109.03% 99.15% 101.17% 99.19%
> sweep4 off 102.04% 101.16% 101.71% 91.86%
> sweep4 on 108.33% 101.69% 97.14% 100.92%
>
> The benefit varies with like +3-10% depending on connection count.
> Quite frankly I was expecting a little bit more, especially after
> re-reading [1]. Maybe you preloaded it there using pg_prewarm? (here
> I've randomly warmed it using pgbench). Probably it's something with
> my test, I'll take yet another look hopefully soon. The good thing is
> that it never crashed and I haven't seen any errors like "Bad address"
> probably related to AIO as you saw in [1], perhaps I wasn't using
> uring.
>
Hmmm. I'd have expected better results for this workload. So I tried
re-running my seqscan benchmark on the 176-core instance, and I got this:
clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 43 35 40 53 53 46 45
96 55 54 42 47 57 58 53 53
128 59 59 46 50 58 58 57 60
clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 98% 79% 92% 122% 122% 105% 104%
96 99% 76% 86% 104% 105% 97% 97%
128 99% 77% 84% 98% 98% 97% 101%
I did the benchmark for individual parts of the patch series. There's a
clear (~20%) speedup for 0005, but 0006 and 0007 make it go away. The
0002/0003 regress it quite a bit. And with 128 clients there's no
improvement at all.
This was with the default number of partitions (i.e. 4). If I increase
the number to 16, I get this:
clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 43 69 82 87 87 78 79
96 55 54 65 85 91 91 86 86
128 59 59 66 77 83 83 82 86
clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 99% 158% 189% 199% 199% 180% 180%
96 100% 119% 156% 167% 167% 157% 158%
128 99% 112% 130% 140% 140% 139% 145%
And with 32 partitions, I get this:
clients master 0001 0002 0003 0004 0005 0006 0007
-----------------------------------------------------------------
64 44 44 88 91 90 90 84 84
96 55 54 89 93 93 92 90 91
128 59 59 85 84 86 85 88 87
clients 0001 0002 0003 0004 0005 0006 0007
--------------------------------------------------------
64 100% 202% 208% 207% 207% 193% 193%
96 100% 163% 169% 171% 168% 165% 166%
128 99% 144% 142% 146% 144% 149% 146%
Those are clearly much better results, so I guess the default number of
partitions may be too low.
What bothers me is that this seems like a very narrow benchmark. I mean,
few systems are doing concurrent seqscans putting this much pressure on
buffer replacement. And once the plans start to do other stuff, the
contention on clock sweep seems to go down substantially (as shown by
the read-only pgbench). So the question is - is this really worth it?
> 0007 (PROCs) still complains with "mbind: Invalid argument" (aligment issue)
>
Should be fixed by the attached patches. The 0006 patch has an issue
with mbind too, but it was visible only when the buffers were not a nice
multiple of memory pages (and multiples of 1GB are fine).
This also moves the memset() until after placing the PGPROC partitions
to different NUMA nodes.
The results above are from v20251121. I'll rerun the tests with the nw
version of the patches. But it can only change the 0006/0007 results, of
course. The 0001-0005 are the same.
regards
--
Tomas Vondra
Attachment
- v20251126-0009-mbind-procs.patch
- v20251126-0008-NUMA-partition-PGPROC.patch
- v20251126-0007-mbind-buffers.patch
- v20251126-0006-NUMA-shared-buffers-partitioning.patch
- v20251126-0005-clock-sweep-weighted-balancing.patch
- v20251126-0004-clock-sweep-scan-all-partitions.patch
- v20251126-0003-clock-sweep-balancing-of-allocations.patch
- v20251126-0002-clock-sweep-basic-partitioning.patch
- v20251126-0001-Infrastructure-for-partitioning-shared-buf.patch
pgsql-hackers by date: