Re: NUMA shared memory interleaving - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: NUMA shared memory interleaving
Date
Msg-id CAKZiRmxqebtjknDdvfJbBEscWNnP0KkTehVq_k51tWdRUpTe1w@mail.gmail.com
Whole thread Raw
In response to NUMA shared memory interleaving  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses Re: NUMA shared memory interleaving
List pgsql-hackers
On Mon, Jun 30, 2025 at 9:23 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> I wasn't suggesting to do "numactl --interleave=all". My argument was
> simply that doing numa_interleave_memory() has most of the same issues,
> because it's oblivious to what's stored in the shared memory. Sure, the
> fact that local memory is not interleaved too is an improvement.

... and that's enough for me to start this ;)

> But I just don't see how this could be 0001, followed by some later
> improvements. ISTM the improvements would have to largely undo 0001
> first, and it would be nontrivial if an optimization needs to do that
> only for some part of the shared memory.

OK, maybe I'll back-off a bit to see Your ideas first. It seems you
are thinking about having multiple separate shared memory segments.

> I certainly agree it'd be good to improve the NUMA support, otherwise I
> wouldn't be messing with Andres' PoC patches myself.

Yup, cool, let's stick to that.

> > * I've raised this question in the first post "How to name this GUC
> > (numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
> > simply looks better, and we could just add way more stuff to it over
> > time (in PG19 or future versions?). Does that sound good?
> >
>
> I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
> different parts of the shared memory. But that's mostly for development,
> to allow easy experimentation.
[..]
> I don't have a clear idea what UX should look like.

Later (after research/experiments), I could still imagine sticking to
one big `numa` switch like it's today in v4-0001, but maybe with
additional 1-2 more `really_advanced_numa=stuff` (but not lots of
them, I would imagine e.g. that NUMA for analytics could be different
setup that NUMA for OLTP -- AKA do we want to optimize for
interconnect bandwidth or latency?).

> >> That's something numa_interleave_memory simply can't do for us, and I
> >> suppose it might also have other downsides on large instances. I mean,
> >> doesn't it have to create a separate mapping for each memory page?
> >> Wouldn't that be a bit inefficient/costly for big instances?
> >
> > No? Or what kind of mapping do you have in mind? I think our shared
> > memory on the kernel side is just a single VMA (contiguous memory
> > region), on which technically we execute mbind() (libnuma is just a
> > wrapper around it). I have not observed any kind of regressions,
> > actually quite the opposite. Not sure what you also mean by 'big
> > instances' (AFAIK 1-2TB shared_buffers might even fail to start).
> >
>
> Something as simple as giving a contiguous chunk of to each NUMA node.

That would actually be multiple separate VMAs/shared memory regions
(main, and specific ones for in case of NUMA - per structure) and
potentially - speculating here - slower fork()?

Related, the only complaint about memory allocated via
mmap(MAP_SHARED|MAP_HUGETLB) with NUMA, I have so far is that if the
per-zone HP free memory is too small (especially with HP=on), it
starts to spill over to the others nodes without interleaving and
without notification, you may have the same problem here unless it is
strict allocation.

> Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
> looked into the details of how NUMA interleaving works, but from the
> discussions I had about it, I understood it might be expensive. Not
> sure, maybe that's wrong.

I would really like to hear the argument how NUMA interleaving is
expensive on the kernel side. It's literally bandwidth over latency. I
can buy the argument that e.g. by having dedicated mmap(MAP_SHARED)
for ProcArray[] (assuming we are HP=on/2MB), and by having a smaller
page size just for this stuct (sizeof() =~ 832b? so let's assume even
wasting 4kB per entry), could better enable kernel's NUMA
autobalancing to better relocate those necessary pages closer to the
active processes (warning: I'm making lots of assumptions, haven't
really checked memory access patterns for this struct). No idea how
bad it would be on the CPU caches too, though by making it so big here
in this theoretical context. But the easy counter argument also could
be: smaller page size = no HP available --> potentially making it
__swap-able__ and potentially causing worse dTLB hit-rates? ... and we
are just discussing a single shared memory entry and there are 73 :)

My take is that by doing it is - as an opt-in - basic interleaving is
safe, proven and and gives *more* predictable latency that without it
(of course as You mention we could do better with some allocation for
specific structures, but how do You know where CPU scheduler puts
stuff?) I think we would need to limit ourselves to just optimizing
the most crucial (hot) stuff, like ProcArray[], but probably doesn't
make sense for investigating structures like multixacts/substractions
in this attempt.

E.g. I could even imagine that we could boost standby's NUMA-awareness
too, just by putting most used memory (eg.g. XLOG) to the same NODE
that is used by startup/recovery and walreciever (by CPU pinning), not
sure is it worth the effort though in this attempt and the problem
would be: what to do with those low-level/optimized allocations after
pg_promote() to primary? In theory this quickly escalates to calling
interleave on that struct again, so maybe let's put it aside.

To sum up, my problem is that optimization possibilities are quite
endless, so we need to settle on something realistic, right?

> But the other reason for a simpler mapping is that it seems useful to be
> able to easily calculate which NUMA node a buffer belongs to. Because
> then you can do NUMA-aware freelists, clocksweep, etc.

Yay, sounds pretty advanced!

> +1 to collaboration, absolutely. I was actually planning to ping you
> once I have something workable. I hope I'll be able to polish the WIP
> patches a little bit and post them sometime this week.

Cool.

-J.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: C11 / VS 2019
Next
From: Dilip Kumar
Date:
Subject: Re: Conflict detection for update_deleted in logical replication