Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | CAKZiRmzRFGtGcdVxROVipG2OmvLwsKbt6ivdgyuLsehykihfJA@mail.gmail.com Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| Responses |
Re: Adding basic NUMA awareness
|
| List | pgsql-hackers |
On Tue, Jan 13, 2026 at 2:13 AM Tomas Vondra <tomas@vondra.me> wrote: > > On 1/13/26 01:24, Andres Freund wrote: > > Hi, > > > > On 2026-01-12 19:10:00 -0500, Andres Freund wrote: > >> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote: > >>> On 1/10/26 02:42, Andres Freund wrote: > >>>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUPBY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROMpgbench_accounts;' > >> > >>> And then I initialized pgbench with scale that is much larger than > >>> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And > >>> then I ran > >>> > >>> select * from pgbench_accounts offset 1000000000; > >>> > >>> which does a sequential scan with the circular buffer you mention abobe > >> > >> Did you try it with the query I suggested? One plausible reason why you did > >> not see an effect with your query is that with a huge offset you actually > >> never deform the tuple, which is an important and rather latency sensitive > >> path. > > > > Btw, this doesn't need anywhere close to as much data, it should be visible as > > soon as you're >> L3. > > > > To show why > > SELECT * FROM pgbench_accounts OFFSET 100000000 > > doesn't show an effect but > > SELECT sum(abalance) FROM pgbench_accounts; > > > > does, just look at the difference using the perf command I posted. Here on a > > scale 200. > > > > OK, I tried with smaller scale (and larger shared buffers, to make the > data set smaller than NBuffers/4). > > On the azure VM (scale 200, 32GB sb), there's still no difference: > [..] > and on xeon (scale 100, 8GB sb), there's a bit of a difference: > [..] > > So roughly 20%. There's also a bigger difference in the perf, about > 5944.3 MB/s vs. 5202.3 MB/s. > > > > > Interestingly I do see a performance difference, albeit a smaller one, even > > with OFFSET. I see similar numbers on two different 2 socket machines. > > > > I wonder how significant is the number of sockets. The Azure is a single > socket with 2 NUMA nodes, so maybe the latency differences are not > significant enough to affect this kind of tests. > > The xeon is a 2-socket machine, but it's also older (~10y). My 0.02$ from intentionally having even more slow hardware with even more sockets where some effects are even more visible (maybe this helps the $thread) There are two ways we could benefit from NUMA multi-socket systems: a) potentially getting lower latency b) avoid hitting interconnect max bandwidth limitations - so it started with offline discussion yesterday that earlier we have seen some yield from pgbench -S results, however I could not reproduce the yield from [1] NUMAv2 -- I was trying to replicate that result "I did some quick perf testing on my old xeon machine (2 NUMA nodes), and the results are encouraging. For a read-only pgbench (2x shared buffers, within RAM), I saw an increase from 1.1M tps to 1.3M." [1]. I was using numa_buffers_interleave=on,numa_localalloc=on,numa_partition_freelist=node,numa_procs_interleave=on,numa_procs_pin=off, no HPs (due to "Bad address" bug in that patchset), two -i pgbench scales (2000, 500), with and without checksums, with/without -M prepared. More or less it's always like that: jul17__64__off_pgbench.log:tps = 564153.278183 (without initial connection time) numav2__64__off_pgbench.log:tps = 562068.655263 (without initial connection time) so that probably rules out that we have recently introduced a bug into the patchset (but that may mean that 1.1->1.3M boost was something else?) - so per above and in my opinion, both on master or all patchsets here, classic OLTP pgbench (-S) is way too CPU computation heavy even with -M prepared to see NUMA latency effects. Affects *both* NUMAv2 (Aug2025)[1] and Nov 2025 patchset versions. Even getting 0.5M tps on pgbench scale -i 500 ends up using way less traffic before hitting the QPI (interconnect) max link bandwidth (just 3.8-4.2 GB/s in my case, but pgbench -S can consume just max 1.2GB/s assuming proper interleaving). - the single seqscan "SELECT sum(abalance) FROM pgbench_accounts;" problem (or lack of it -- with single session) is that with standard master, you may end up having data on just a single NUMA node. If that system is idle and running just 1 instance of this query, I'm getting like ~750MB/s of memory reads from the socket where the data is located (again , much below the limit of the interconnect [3.8-4.2GB/s]) - so yes when I do "synchronize_seqscans=off", and I start throwing those seqscans from *other* sockets [--cpubindnodes 1,2,3 (while that relation is hosted on node 0's DRAM)], of course I can see choke point on the interconnect, (so it can feed data @ ~4.2 GB/s and even more for short periods of time, but that's it). That's the problem1(optimization1) realistically we can solve here I think, with interleaving and that has been shown here multiple times: I can get simply much more juice in the above scenario, because I have access to better aggregated bandwidth for seqscans like that and simply put more CPU cores (from CPUs on nodes 1,2,3) to feed that data from RAM on node #0. It also helps achieve less deviation in latency in such conditions. To add more fun Linux kernel likes to put randomly that shm segments (e.g. if it fits single NUMA free hugepages, it will be put there OR it uses some spanning across nodes , but not interleaving and depending on relation you have it here or there). So far, so good as far interleaving is concerned. - however the above might be simply not true on single-socket NUMA systems (EPYC) or just more modern multi-socket (but still same chassis NUMA systems) - so EPYC again?(~500GB/s wild interconnect)? - as the interconnect there might be not present (1s, BIOS/UEFI may have setting to expose group of cores [CCD] as NUMA) OR have the bandwidth and there's no realistic chokepoint for us there, so it wouldn't make sense to benchmark using such hardware to spot the effects. - So the remaining question remains, what about having better locality/latency of let's say ~100ns on local DRAM vs 3x as much as the remote [3][4] - how much we can gain? (BTW this seems to be a pretty universal rule of thumb -> 3x factor for most hardware, see [5). If just shm is remote to the CPU for this query I'm getting: remote: Time: 1925.820 ms local: Time: 1796.678 ms so 1.071x. and in some different situations if both shm and heap memory are remote to the CPU for this query I was getting: local: Time: 9773.179 ms (00:09.773) remote: Time: 12154.636 ms (00:12.155) +/- 300ms (+ I can see more traffic on sockets). So technically I should be getting this ~7%..22% profit due to lower latency if I would be fetching just ONLY local memory (but with NUMA we are not doing it right? we are interleaving - so we hit all sockets most of the time to fetch data). So think about benchmarking: maybe we would have to be using multiple pgbench_accounts_N (not just one, but each per backend/CPU pgbench_accounts_$CPU?), together larger than s_b to cause natural evictions, as this would then cause the partitioned clocksweep to make it more likely that data is on the local NUMA node and keep it hot for some time while avoiding any access to VFS cache (so stick to reading only s_b)? (and we could measure local vs remote access ratio too along the way). Also just for research, we should disable clocksweep balancing. It should be enough to demonstrate that the patch is working, right? (so 1 backend should be just reading from 1 NUMA node mostly and that should be producing some yield and it should be having high local access ratio and just build up from there if necessary rather than enabling all of the v20251126* patches) XXX but eviction would be followed up by something like pread() from VFS, and what about if that VFS cache is also on another node? XXX those BAS strategies are just pain and yet another thing to care about, couldn't we have some debug GUC to turn them off (debug_benchmarking=true)? -J. [1] - https://www.postgresql.org/message-id/3223cdcd-6d16-4e90-a3a6-b957f762dc5a%40vondra.me [2] - mlc --bandwidth_matrix: //that's sequential memory reads, but with -U it is pretty close (90%) to the below ones Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Numa node Numa node 0 1 2 3 0 22677.4 3784.2 3720.6 4199.1 1 3855.7 22463.1 4226.8 3732.0 2 3713.7 4228.3 21816.6 3886.3 3 4190.7 3692.9 3796.3 22673.0 [3] mlc --latency_matrix: Measuring idle latencies (in ns)... Numa node Numa node 0 1 2 3 0 96.7 296.0 300.7 292.5 1 289.7 96.9 289.7 303.8 2 309.9 289.3 97.0 291.0 3 300.0 303.7 296.5 97.1 [4] numactl --hardware lies a little bit (real latency vs below numbers): node distances: node 0 1 2 3 0: 10 20 30 20 1: 20 10 20 30 2: 30 20 10 20 3: 20 30 20 10 [5] https://github.com/nviennot/core-to-core-latency?tab=readme-ov-file#dual-sockets-results
pgsql-hackers by date: