Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmzRFGtGcdVxROVipG2OmvLwsKbt6ivdgyuLsehykihfJA@mail.gmail.com
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
On Tue, Jan 13, 2026 at 2:13 AM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 1/13/26 01:24, Andres Freund wrote:
> > Hi,
> >
> > On 2026-01-12 19:10:00 -0500, Andres Freund wrote:
> >> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
> >>> On 1/10/26 02:42, Andres Freund wrote:
> >>>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa
GROUPBY 1;' && perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance)
FROMpgbench_accounts;' 
> >>
> >>> And then I initialized pgbench with scale that is much larger than
> >>> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
> >>> then I ran
> >>>
> >>>   select * from pgbench_accounts offset 1000000000;
> >>>
> >>> which does a sequential scan with the circular buffer you mention abobe
> >>
> >> Did you try it with the query I suggested? One plausible reason why you did
> >> not see an effect with your query is that with a huge offset you actually
> >> never deform the tuple, which is an important and rather latency sensitive
> >> path.
> >
> > Btw, this doesn't need anywhere close to as much data, it should be visible as
> > soon as you're >> L3.
> >
> > To show why
> >   SELECT * FROM pgbench_accounts OFFSET 100000000
> > doesn't show an effect but
> >   SELECT sum(abalance) FROM pgbench_accounts;
> >
> > does, just look at the difference using the perf command I posted. Here on a
> > scale 200.
> >
>
> OK, I tried with smaller scale (and larger shared buffers, to make the
> data set smaller than NBuffers/4).
>
> On the azure VM (scale 200, 32GB sb), there's still no difference:
>
[..]
> and on xeon (scale 100, 8GB sb), there's a bit of a difference:
>
[..]
>
> So roughly 20%. There's also a bigger difference in the perf, about
> 5944.3 MB/s vs. 5202.3 MB/s.
>
> >
> > Interestingly I do see a performance difference, albeit a smaller one, even
> > with OFFSET. I see similar numbers on two different 2 socket machines.
> >
>
> I wonder how significant is the number of sockets. The Azure is a single
> socket with 2 NUMA nodes, so maybe the latency differences are not
> significant enough to affect this kind of tests.
>
> The xeon is a 2-socket machine, but it's also older (~10y).

My 0.02$ from intentionally having even more slow hardware with even
more sockets where some effects are even more visible (maybe this
helps the $thread)

There are two ways we could benefit from NUMA multi-socket systems:
a) potentially getting lower latency
b) avoid hitting interconnect max bandwidth limitations

- so it started with offline discussion yesterday that earlier we have
seen some yield from pgbench -S results, however I could not reproduce
the yield from [1] NUMAv2 -- I was trying to replicate that result "I
did some quick perf testing on my old xeon machine (2 NUMA nodes), and
the results are encouraging. For a read-only pgbench (2x shared
buffers, within RAM), I saw an increase from 1.1M tps to 1.3M." [1]. I
was using
numa_buffers_interleave=on,numa_localalloc=on,numa_partition_freelist=node,numa_procs_interleave=on,numa_procs_pin=off,
no HPs (due to "Bad address" bug in that patchset), two -i pgbench
scales (2000, 500), with and without checksums, with/without -M
prepared. More or less it's always like that:
  jul17__64__off_pgbench.log:tps = 564153.278183 (without initial
connection time)
  numav2__64__off_pgbench.log:tps = 562068.655263 (without initial
connection time)
 so that probably rules out that we have recently introduced a bug
into the patchset (but that may mean that 1.1->1.3M boost was
something else?)

- so per above and in my opinion, both on master or all patchsets
here, classic OLTP pgbench (-S) is way too CPU computation heavy even
with -M prepared to see NUMA latency effects. Affects *both* NUMAv2
(Aug2025)[1] and Nov 2025 patchset versions. Even getting 0.5M tps on
pgbench scale -i 500 ends up using way less traffic before hitting the
QPI (interconnect) max link bandwidth (just 3.8-4.2 GB/s in my case,
but pgbench -S can consume just max 1.2GB/s assuming proper
interleaving).

- the single seqscan "SELECT sum(abalance) FROM pgbench_accounts;"
problem (or lack of it -- with single session) is that with standard
master, you may end up having data on just a single NUMA node. If that
system is idle and running just 1 instance of this query, I'm getting
like ~750MB/s of memory reads from the socket where the data is
located (again , much below the limit of the interconnect
[3.8-4.2GB/s])

- so yes when I do "synchronize_seqscans=off", and I start throwing
those seqscans from *other* sockets [--cpubindnodes 1,2,3 (while that
relation is hosted on node 0's DRAM)], of course I can see choke point
on the interconnect, (so it can feed data @ ~4.2 GB/s and even more
for short periods of time, but that's it). That's the
problem1(optimization1) realistically we can solve here I think, with
interleaving and that has been shown here multiple times: I can get
simply much more juice in the above scenario, because I have access to
better aggregated bandwidth for seqscans like that and simply put more
CPU cores (from CPUs on nodes 1,2,3) to feed that data from RAM on
node #0. It also helps achieve less deviation in latency in such
conditions. To add more fun Linux kernel likes to put randomly that
shm segments (e.g. if it fits single NUMA free hugepages, it will be
put there OR it uses some spanning across nodes , but not interleaving
and depending on relation you have it here or there). So far, so good
as far interleaving is concerned.

- however the above might be simply not true on single-socket NUMA
systems (EPYC) or just more modern multi-socket (but still same
chassis NUMA systems) - so EPYC again?(~500GB/s wild interconnect)? -
as the interconnect there might be not present (1s, BIOS/UEFI may have
setting to expose group of cores [CCD] as NUMA) OR have the bandwidth
and there's no realistic chokepoint for us there, so it wouldn't make
sense to benchmark using such hardware to spot the effects.

- So the remaining question remains, what about having better
locality/latency of let's say ~100ns on local DRAM vs 3x as much as
the remote [3][4] - how much we can gain?  (BTW this seems to be a
pretty universal rule of thumb -> 3x factor for most hardware, see
[5). If just shm is remote to the CPU for this query I'm getting:
remote: Time: 1925.820 ms
local:  Time: 1796.678 ms
so 1.071x.

and in some different situations if both shm and heap memory are
remote to the CPU for this query I was getting:
local:  Time: 9773.179 ms (00:09.773)
remote: Time: 12154.636 ms (00:12.155) +/- 300ms (+ I can see more
traffic on sockets).

So technically I should be getting this ~7%..22% profit due to lower
latency if I would be fetching just ONLY local memory (but with NUMA
we are not doing it right? we are interleaving - so we hit all sockets
most of the time to fetch data). So think about benchmarking: maybe we
would have to be using multiple pgbench_accounts_N (not just one, but
each per backend/CPU pgbench_accounts_$CPU?), together larger than s_b
to cause natural evictions, as this would then cause the partitioned
clocksweep to make it more likely that data is on the local NUMA node
and keep it hot for some time while avoiding any access to VFS cache
(so stick to reading only s_b)? (and we could measure local vs remote
access ratio too along the way). Also just for research, we should
disable clocksweep balancing. It should be enough to demonstrate that
the patch is working, right? (so 1 backend should be just reading from
1 NUMA node mostly and that should be producing some yield and it
should be having high local access ratio and just build up from there
if necessary rather than enabling all of the v20251126* patches)

XXX but eviction would be followed up by something like pread() from
VFS, and what about if that VFS cache is also on another node?

XXX those BAS strategies are just pain and yet another thing to care
about, couldn't we have some debug GUC to turn them off
(debug_benchmarking=true)?

-J.

[1] - https://www.postgresql.org/message-id/3223cdcd-6d16-4e90-a3a6-b957f762dc5a%40vondra.me

[2] - mlc --bandwidth_matrix: //that's sequential memory reads, but
with -U it is pretty close (90%) to the below ones
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        22677.4  3784.2  3720.6  4199.1
       1         3855.7 22463.1  4226.8  3732.0
       2         3713.7  4228.3 21816.6  3886.3
       3         4190.7  3692.9  3796.3 22673.0

[3] mlc --latency_matrix:
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1       2       3
       0          96.7   296.0   300.7   292.5
       1         289.7    96.9   289.7   303.8
       2         309.9   289.3    97.0   291.0
       3         300.0   303.7   296.5    97.1

[4] numactl --hardware lies a little bit (real latency vs below numbers):
node distances:
node   0   1   2   3
  0:  10  20  30  20
  1:  20  10  20  30
  2:  30  20  10  20
  3:  20  30  20  10

[5] https://github.com/nviennot/core-to-core-latency?tab=readme-ov-file#dual-sockets-results



pgsql-hackers by date:

Previous
From: Kevin K Biju
Date:
Subject: Re: Allowing ALTER COLUMN TYPE for columns in publication column lists
Next
From: Filip Janus
Date:
Subject: Re: Proposal: Adding compression of temporary files