Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | w2fqzrcwo6ofjy56e5pd7hsjdnlhc5tckgpsio77sqtgcylbvx@eeknrzc7o7ov Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
Hi, On 2026-01-15 00:26:47 +0100, Tomas Vondra wrote: > D96 (v6): > > Numa node > Numa node 0 1 > 0 129.9 129.9 > 1 128.3 128.1 I wonder if D96 has turned on memory interleaving... These numbers are so close to each other that they're a tad hard to believe. > > HB176 (v4): > > Numa node > Numa node 0 1 2 3 > 0 107.3 116.8 207.3 207.0 > 1 120.5 110.6 207.5 207.1 > 2 207.0 207.2 107.8 116.8 > 3 204.4 204.7 117.7 107.9 > > I guess this confirms that D96 is mostly useless for evaluation of the > NUMA patches. This is a single-socket machine, with one NUMA node per > chiplet (I assume), and there's about no difference in latency. > For HB176 there clearly seems to be a difference of ~90ns between the > sockets, i.e. the latency about doubles in some cases. Each socket has > two chiplets - and there the story is about the same as on D96. It looks to me like within a socket there is a latency difference of about 10ns? Only when going between sockets there's no difference between which of the remote nodes is accessed - which makes sense to me. For newer single-node EPYC https://chipsandcheese.com/p/amds-epyc-9355p-inside-a-32-core has some numbers for within socket latencies. They also see about 10ns between inside-socket-local and inside-socket-remote. > I did this on my old-ish Xeon too, and it's somewhere in between. There > clearly is difference between the sockets, but it's smaller than on > HB176. Which matches with your observation that the latency is really > increasing over time. FWIW https://chipsandcheese.com/p/a-look-into-intel-xeon-6s-memory has some numbers in the "NUMA/Chiplet Characteristics" too. One aspect in it caught my eye: > Thus accesses to a remote NUMA node are only cached by the remote die’s > L3. Accessing the L3 on an adjacent die increases latency by about 24 > ns. Crossing two die boundaries adds a similar penalty, increasing latency > to nearly 80 ns for a L3 hit Afaict that translates to an L3 hit consistently taking 80ns when accessing remote memory, that's quite something. > I doubt the interleaving mode is enabled. It clearly is not enabled on > the HB176 machine (otherwise we wouldn't see the difference, I think), > and the smaller instance can be explained by having a single socket. As you say, there obviously is no interleaving on the HB176. I do wonder about the D96, but ... I wonder if the configuration is somehow visible in MSRs... > The numbers are timings per query (avg latency reported by pgbench). I > think this mostly aligns with the mlc results - the D96 shows no > difference, while HB176 shows clear differences when memory/cpu get > pinned to different sockets (but not chiplets in the same socket). Yea, that makes sense. > But there are some interesting details too, particularly when it comes > to behavior of the two queries. The "offset" query is affected by > latency even with no parallelism (max_parallel_workers_per_gather=0), > and it shows ~30% hit for cross-socket runs. But for "agg" there's no > difference in that case, and the hit is visible only with 4 or 8 > workers. That's interesting. Huh, that *is* interesting. I guess the hardware prefetchers are good enough to prefetch of the tuple headers in this case, possibly because the tuples are small and regular enough that the hardware prefetchers manage to prefetch everything in time? E.g. https://docs.amd.com/api/khub/documents/goX~9ubv8i5r60A_Qrp3Rw/content documents "L1 Stride Prefetcher" as > The prefetcher uses the L1 cache memory access history of individual > instructions to fetch additional lines when each access is a constant > distance from the previous. Greetings, Andres Freund
pgsql-hackers by date: