Home > mailing lists
Re: NUMA shared memory interleaving - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: NUMA shared memory interleaving
Date	July 1 17:02:39
Msg-id	90a4f403-2ea3-4867-a714-3e59b09d6595@vondra.me Whole thread Raw
In response to	Re: NUMA shared memory interleaving (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Responses	Re: NUMA shared memory interleaving
List	pgsql-hackers
Tree view

On 7/1/25 11:04, Jakub Wartak wrote:
> On Mon, Jun 30, 2025 at 9:23 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> I wasn't suggesting to do "numactl --interleave=all". My argument was
>> simply that doing numa_interleave_memory() has most of the same issues,
>> because it's oblivious to what's stored in the shared memory. Sure, the
>> fact that local memory is not interleaved too is an improvement.
> 
> ... and that's enough for me to start this ;)
> 
>> But I just don't see how this could be 0001, followed by some later
>> improvements. ISTM the improvements would have to largely undo 0001
>> first, and it would be nontrivial if an optimization needs to do that
>> only for some part of the shared memory.
> 
> OK, maybe I'll back-off a bit to see Your ideas first. It seems you
> are thinking about having multiple separate shared memory segments.
> 
>> I certainly agree it'd be good to improve the NUMA support, otherwise I
>> wouldn't be messing with Andres' PoC patches myself.
> 
> Yup, cool, let's stick to that.
> 
>>> * I've raised this question in the first post "How to name this GUC
>>> (numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
>>> simply looks better, and we could just add way more stuff to it over
>>> time (in PG19 or future versions?). Does that sound good?
>>>
>>
>> I'm not sure. In my WIP patch I have a bunch of numa_ GUCs, for
>> different parts of the shared memory. But that's mostly for development,
>> to allow easy experimentation.
> [..]
>> I don't have a clear idea what UX should look like.
> 
> Later (after research/experiments), I could still imagine sticking to
> one big `numa` switch like it's today in v4-0001, but maybe with
> additional 1-2 more `really_advanced_numa=stuff` (but not lots of
> them, I would imagine e.g. that NUMA for analytics could be different
> setup that NUMA for OLTP -- AKA do we want to optimize for
> interconnect bandwidth or latency?).
> 

Maybe. I have no clear idea yet, but I'd like to keep the number of new
GUCs as low as possible.

>>>> That's something numa_interleave_memory simply can't do for us, and I
>>>> suppose it might also have other downsides on large instances. I mean,
>>>> doesn't it have to create a separate mapping for each memory page?
>>>> Wouldn't that be a bit inefficient/costly for big instances?
>>>
>>> No? Or what kind of mapping do you have in mind? I think our shared
>>> memory on the kernel side is just a single VMA (contiguous memory
>>> region), on which technically we execute mbind() (libnuma is just a
>>> wrapper around it). I have not observed any kind of regressions,
>>> actually quite the opposite. Not sure what you also mean by 'big
>>> instances' (AFAIK 1-2TB shared_buffers might even fail to start).
>>>
>>
>> Something as simple as giving a contiguous chunk of to each NUMA node.
> 
> That would actually be multiple separate VMAs/shared memory regions
> (main, and specific ones for in case of NUMA - per structure) and
> potentially - speculating here - slower fork()?
> 

I may be confused about what you mean by VMA, but it certainly does not
require creating separate shared memory segments. Interleaving also does
not require that. You can move a certain range of memory to a particular
NUMA node, and that's it.

We may end up with separate shared memory segments for different parts
of the shared memory (instead of having a single segment like now), e.g.
to support dynamic shared_buffers resizing. But even with that we'd have
a shared memory segment for each part, split between NUMA nodes.

Well, we'd probably want separate segments, because for some parts it's
not great to have 2MB pages, because it's too coarse. And you can only
use huge pages for the whole segment.

> Related, the only complaint about memory allocated via
> mmap(MAP_SHARED|MAP_HUGETLB) with NUMA, I have so far is that if the
> per-zone HP free memory is too small (especially with HP=on), it
> starts to spill over to the others nodes without interleaving and
> without notification, you may have the same problem here unless it is
> strict allocation.
> 
>> Essentially 1/nodes goes to the first NUMA node, and so on. I haven't
>> looked into the details of how NUMA interleaving works, but from the
>> discussions I had about it, I understood it might be expensive. Not
>> sure, maybe that's wrong.
> 
> I would really like to hear the argument how NUMA interleaving is
> expensive on the kernel side. It's literally bandwidth over latency. I
> can buy the argument that e.g. by having dedicated mmap(MAP_SHARED)
> for ProcArray[] (assuming we are HP=on/2MB), and by having a smaller
> page size just for this stuct (sizeof() =~ 832b? so let's assume even
> wasting 4kB per entry), could better enable kernel's NUMA
> autobalancing to better relocate those necessary pages closer to the
> active processes (warning: I'm making lots of assumptions, haven't
> really checked memory access patterns for this struct). No idea how
> bad it would be on the CPU caches too, though by making it so big here
> in this theoretical context. But the easy counter argument also could
> be: smaller page size = no HP available --> potentially making it
> __swap-able__ and potentially causing worse dTLB hit-rates? ... and we
> are just discussing a single shared memory entry and there are 73 :)
> 

I admit I don't recall the exact details of why exactly interleaving
would be expensive on the kernel side. I've been told by smarter people
that might be the case, but I don't remember the exact explanation. And
maybe it isn't measurably more expensive ...

I've been focusing on the aspect that it makes certain things more
difficult, or even impossible ...

> My take is that by doing it is - as an opt-in - basic interleaving is
> safe, proven and and gives *more* predictable latency that without it
> (of course as You mention we could do better with some allocation for
> specific structures, but how do You know where CPU scheduler puts
> stuff?) I think we would need to limit ourselves to just optimizing
> the most crucial (hot) stuff, like ProcArray[], but probably doesn't
> make sense for investigating structures like multixacts/substractions
> in this attempt.
> 

My argument is that if we allocate the structs "well" then we can do
something smart later, like pick a PGPROC placed on the current NUMA
node (at connection time), and possibly even pin it to that NUMA node so
that it doesn't get migrated.

This means not just the PROPROC itself, but also stuff like fast-path
locking arrays (which are stored separately). And interleaving could
easily place them on a different NUMA node.

> E.g. I could even imagine that we could boost standby's NUMA-awareness
> too, just by putting most used memory (eg.g. XLOG) to the same NODE
> that is used by startup/recovery and walreciever (by CPU pinning), not
> sure is it worth the effort though in this attempt and the problem
> would be: what to do with those low-level/optimized allocations after
> pg_promote() to primary? In theory this quickly escalates to calling
> interleave on that struct again, so maybe let's put it aside.
> 
> To sum up, my problem is that optimization possibilities are quite
> endless, so we need to settle on something realistic, right?
> 

Perhaps. But maybe we should explore the possibilities first, before
just settling out on something at the very beginning. To make an
informed decision we need to know what the costs/benefits are, and we
need to understand what the "advanced" solution might look like, so that
we don't pick a design that makes that impossible.

>> But the other reason for a simpler mapping is that it seems useful to be
>> able to easily calculate which NUMA node a buffer belongs to. Because
>> then you can do NUMA-aware freelists, clocksweep, etc.
> 
> Yay, sounds pretty advanced!
> 
>> +1 to collaboration, absolutely. I was actually planning to ping you
>> once I have something workable. I hope I'll be able to polish the WIP
>> patches a little bit and post them sometime this week.
> 
> Cool.
> 

regards

-- 
Tomas Vondra
pgsql-hackers by date:
From: Japin Li
Date: 01 July, 17:00:22
Subject: Re: Inconsistent LSN format in pg_waldump output
From: Andres Freund
Date: 01 July, 17:06:55
Subject: Re: Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared locks
Re: NUMA shared memory interleaving - Mailing list pgsql-hackers

Previous

Next