Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id clx4zzd7kau4vvh5ynu5ssxg3jqfqzurgcbtotytzgzkhb3nis@qfl5xwv44yad
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
Hi,

On 2026-01-13 02:13:40 +0100, Tomas Vondra wrote:
> On the azure VM (scale 200, 32GB sb), there's still no difference:

One possibility is that the host is configured with memory interleaving. That
configures the memory so that physical memory addresses interleave between the
different NUMA nodes, instead of really being node local. That can help avoid
bad performance characteristics for NUMA naive applications.

I don't quite know how to figure that out though, particularly from within a
VM :(.  Even something like https://github.com/nviennot/core-to-core-latency
or intel's mlc will not necessarily be helpful, because it depends on which
node the measured cacheline ends up on.

But I'd probably still test it, just to see whether you're observing very
different latencies between the systems.


> > Interestingly I do see a performance difference, albeit a smaller one, even
> > with OFFSET. I see similar numbers on two different 2 socket machines.
> >
>
> I wonder how significant is the number of sockets. The Azure is a single
> socket with 2 NUMA nodes, so maybe the latency differences are not
> significant enough to affect this kind of tests.

Ah, yes, a single socket machine might not show that much of an increase, at
least in simpler cases.  One of my workstations has two sockets, but each
socket has two numa nodes, the latency difference between the same numa node
and the other numa node in the same socket is small, but the difference to the
other socket is ~1.5x.

Using intel's mlc:

Measuring idle latencies for sequential access (in ns)...
        Numa node
Numa node         0         1         2         3
       0      98.6     106.9     157.6     167.9
       1     105.8      99.4     158.4     170.5
       2     157.2     167.4     103.6     105.6
       3     158.4     171.2     104.5     104.3

So there's a about a 2-10ns latency difference between 0,1 and 2,3, but about
a 50-60ns diffence across sockets...


> The xeon is a 2-socket machine, but it's also older (~10y).

It's perhaps worth noting that memory access latency has been *in*creasing in
the last generation or two of hardware...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: [PATCH} Move instrumentation structs
Next
From: Peter Eisentraut
Date:
Subject: how to gate experimental features (SQL/PGQ)