Re: NUMA shared memory interleaving - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: NUMA shared memory interleaving |
Date | |
Msg-id | CAKZiRmxqebtjknDdvfJbBEscWNnP0KkTehVq_k51tWdRUpTe1w@mail.gmail.com Whole thread Raw |
In response to | NUMA shared memory interleaving (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
Responses |
Re: NUMA shared memory interleaving
|
List | pgsql-hackers |
On Mon, Jun 30, 2025 at 9:23 PM Tomas Vondra <tomas@vondra.me> wrote: > > I wasn't suggesting to do "numactl --interleave=all". My argument was > simply that doing numa_interleave_memory() has most of the same issues, > because it's oblivious to what's stored in the shared memory. Sure, the > fact that local memory is not interleaved too is an improvement. ... and that's enough for me to start this ;) > But I just don't see how this could be 0001, followed by some later > improvements. ISTM the improvements would have to largely undo 0001 > first, and it would be nontrivial if an optimization needs to do that > only for some part of the shared memory. OK, maybe I'll back-off a bit to see Your ideas first. It seems you are thinking about having multiple separate shared memory segments. > I certainly agree it'd be good to improve the NUMA support, otherwise I > wouldn't be messing with Andres' PoC patches myself. Yup, cool, let's stick to that. > > * I've raised this question in the first post "How to name this GUC > > (numa or numa_shm_interleave) ?" I still have no idea, but `numa`, > > simply looks better, and we could just add way more stuff to it over > > time (in PG19 or future versions?). Does that sound good? > > > > I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for > different parts of the shared memory. But that's mostly for development, > to allow easy experimentation. [..] > I don't have a clear idea what UX should look like. Later (after research/experiments), I could still imagine sticking to one big `numa` switch like it's today in v4-0001, but maybe with additional 1-2 more `really_advanced_numa=stuff` (but not lots of them, I would imagine e.g. that NUMA for analytics could be different setup that NUMA for OLTP -- AKA do we want to optimize for interconnect bandwidth or latency?). > >> That's something numa_interleave_memory simply can't do for us, and I > >> suppose it might also have other downsides on large instances. I mean, > >> doesn't it have to create a separate mapping for each memory page? > >> Wouldn't that be a bit inefficient/costly for big instances? > > > > No? Or what kind of mapping do you have in mind? I think our shared > > memory on the kernel side is just a single VMA (contiguous memory > > region), on which technically we execute mbind() (libnuma is just a > > wrapper around it). I have not observed any kind of regressions, > > actually quite the opposite. Not sure what you also mean by 'big > > instances' (AFAIK 1-2TB shared_buffers might even fail to start). > > > > Something as simple as giving a contiguous chunk of to each NUMA node. That would actually be multiple separate VMAs/shared memory regions (main, and specific ones for in case of NUMA - per structure) and potentially - speculating here - slower fork()? Related, the only complaint about memory allocated via mmap(MAP_SHARED|MAP_HUGETLB) with NUMA, I have so far is that if the per-zone HP free memory is too small (especially with HP=on), it starts to spill over to the others nodes without interleaving and without notification, you may have the same problem here unless it is strict allocation. > Essentially 1/nodes goes to the first NUMA node, and so on. I haven't > looked into the details of how NUMA interleaving works, but from the > discussions I had about it, I understood it might be expensive. Not > sure, maybe that's wrong. I would really like to hear the argument how NUMA interleaving is expensive on the kernel side. It's literally bandwidth over latency. I can buy the argument that e.g. by having dedicated mmap(MAP_SHARED) for ProcArray[] (assuming we are HP=on/2MB), and by having a smaller page size just for this stuct (sizeof() =~ 832b? so let's assume even wasting 4kB per entry), could better enable kernel's NUMA autobalancing to better relocate those necessary pages closer to the active processes (warning: I'm making lots of assumptions, haven't really checked memory access patterns for this struct). No idea how bad it would be on the CPU caches too, though by making it so big here in this theoretical context. But the easy counter argument also could be: smaller page size = no HP available --> potentially making it __swap-able__ and potentially causing worse dTLB hit-rates? ... and we are just discussing a single shared memory entry and there are 73 :) My take is that by doing it is - as an opt-in - basic interleaving is safe, proven and and gives *more* predictable latency that without it (of course as You mention we could do better with some allocation for specific structures, but how do You know where CPU scheduler puts stuff?) I think we would need to limit ourselves to just optimizing the most crucial (hot) stuff, like ProcArray[], but probably doesn't make sense for investigating structures like multixacts/substractions in this attempt. E.g. I could even imagine that we could boost standby's NUMA-awareness too, just by putting most used memory (eg.g. XLOG) to the same NODE that is used by startup/recovery and walreciever (by CPU pinning), not sure is it worth the effort though in this attempt and the problem would be: what to do with those low-level/optimized allocations after pg_promote() to primary? In theory this quickly escalates to calling interleave on that struct again, so maybe let's put it aside. To sum up, my problem is that optimization possibilities are quite endless, so we need to settle on something realistic, right? > But the other reason for a simpler mapping is that it seems useful to be > able to easily calculate which NUMA node a buffer belongs to. Because > then you can do NUMA-aware freelists, clocksweep, etc. Yay, sounds pretty advanced! > +1 to collaboration, absolutely. I was actually planning to ping you > once I have something workable. I hope I'll be able to polish the WIP > patches a little bit and post them sometime this week. Cool. -J.
pgsql-hackers by date: