Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmy7VW5b0OKwHXtRSudCBFB48-kYnOEXeqrLfbTQ=5qQrA@mail.gmail.com
Whole thread Raw
In response to Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi!

> 1) v1-0001-NUMA-interleaving-buffers.patch
[..]
> It's a bit more complicated, because the patch distributes both the
> blocks and descriptors, in the same way. So a buffer and it's descriptor
> always end on the same NUMA node. This is one of the reasons why we need
> to map larger chunks, because NUMA works on page granularity, and the
> descriptors are tiny - many fit on a memory page.

Oh, now I get it! OK, let's stick to this one.

> I don't think the splitting would actually make some things simpler, or
> maybe more flexible - in particular, it'd allow us to enable huge pages
> only for some regions (like shared buffers), and keep the small pages
> e.g. for PGPROC. So that'd be good.

You have made assumption that this is good, but small pages (4KB) are
not hugetlb, and are *swappable* (Transparent HP are swappable too,
manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The
most frequent problem I see these days are OOMs, and it makes me
believe that making certain critical parts of shared memory being
swappable just to make pagesize granular is possibly throwing the baby
out with the bathwater. I'm thinking about bad situations like: some
wrong settings of vm.swapiness that people keep (or distros keep?) and
general inability of PG to restrain from allocating more memory in
some cases.

> The other thing I haven't thought about very much is determining on
> which CPUs/nodes the instance is allowed to run. I assume we'd start by
> simply inherit/determine that at the start through libnuma, not through
> some custom PG configuration (which the patch [2] proposed to do).

0. I think that we could do better, some counter arguments to
no-configuration-at-all:

a. as Robert & Bertrand already put it there after review: let's say I
want just to run on NUMA #2 node, so here I would need to override
systemd's script ExecStart= to include that numactl (not elegant?). I
could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even
less friendly. Also it probably requires root to edit/reload systemd,
while having GUC for this like in my proposal makes it more smooth (I
think?)

b. wouldn't it be better if that stayed as drop-in rather than always
on? What if there's a problem, how do you disable those internal
optimizations if they do harm in some cases?  (or let's say I want to
play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean
numa_buffers_interleave would be nice?

c. What if I want my standby (walreceiver+startup/recovery) to run
with NUMA affinity to get better performance (I'm not going to hack
around systemd script every time, but I could imagine changing
numa=X,Y,Z after restart/before promotion)

d. Now if I would be forced for some reason to do that numactl(1)
voodoo, and use the those above mentioned overrides and PG wouldn't be
having GUC (let's say I would use `numactl
--weighted-interleave=0,1`), then:

> 2) v1-0002-NUMA-localalloc.patch
> This simply sets "localalloc" when initializing a backend, so that all
> memory allocated later is local, not interleaved. Initially this was
> necessary because the patch set the allocation policy to interleaving
> before initializing shared memory, and we didn't want to interleave the
> private memory. But that's no longer the case - the explicit mapping to
> nodes does not have this issue. I'm keeping the patch for convenience,
> it allows experimenting with numactl etc.

.. .is not accurate anymore and we would require to have that in
(still with GUC) ?
Thoughts? I can add that mine part into Your's patches if you want.

Way too quick review and some very fast benchmark probes, I've
concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt
would be too new topic for me), but let's start:

1. normal pgbench -S (still with just s_b@4GB), done many tries,
consistent benefit for the patch with like +8..10% boost on generic
run:

   numa_buffers_interleave=off  numa_pgproc_interleave=on(due that
always on "if"), s_b just on 1 NUMA node (might happen)
        latency average = 0.373 ms
        latency stddev = 0.237 ms
        initial connection time = 45.899 ms
        tps = 160242.147877 (without initial connection time)

   numa_buffers_interleave=on   numa_pgproc_interleave=on
        latency average = 0.345 ms
        latency stddev = 0.373 ms
        initial connection time = 44.485 ms
        tps = 177564.686094 (without initial connection time)

2. Tested it the same way as I did for mine(problem#2 from Andres's
presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with
seqconcurrscans.pgb as earlier)
   default/numa_buffers_interleave=off
        latency average = 1375.478 ms
        latency stddev = 1141.423 ms
        initial connection time = 46.104 ms
        tps = 45.868075 (without initial connection time)

   numa_buffers_interleave=on
        latency average = 838.128 ms
        latency stddev = 498.787 ms
        initial connection time = 43.437 ms
        tps = 75.413894 (without initial connection time)

    and i've repeated the the same test (identical conditions) with my
patch, got me slightly more juice:
        latency average = 727.717 ms
        latency stddev = 410.767 ms
        initial connection time = 45.119 ms
        tps = 86.844161 (without initial connection time)

    (but mine didn't get that boost from normal pgbench as per #1
pgbench -S -- my numa='all' stays @ 160k TPS just as
numa_buffers_interleave=off), so this idea is clearly better.
    So should I close https://commitfest.postgresql.org/patch/5703/
and you'll open a new one or should I just edit the #5703 and alter it
and add this thread too?

3. Patch is not calling interleave on PQ shmem, do we want to add that
in as some next item like v1-0007? Question is whether OS interleaving
makes sense there ? I believe it does there, please see my thread
(NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are
being spawned by postmaster and may end up on different NUMA nodes
randomly, so actually OS-interleaving that memory reduces jitter there
(AKA bandwidth-over-latency). My thinking is that one cannot expect
static/forced CPU-to-just-one-NUMA-node assignment for backend and
it's PQ workers, because it is impossible have always available CPU
power there in that NUMA node, so it might be useful to interleave
that shared mem there too (as separate patch item?)

4 In BufferManagerShmemInit() you call numa_num_configured_nodes()
(also in v1-0005). My worry is should we may put some
known-limitations docs (?) from start and mention that
if the VM is greatly resized and NUMA numa nodes appear, they might
not be used until restart?

5. In v1-0001, pg_numa_interleave_memory()

+                * XXX no return value, to make this fail on error, has to use
+                * numa_set_strict

Yes, my patch has those numa_error() and numa_warn() handlers too in
pg_numa. Feel free to use it for better UX.

+                * XXX Should we still touch the memory first, like
with numa_move_pages,
+                * or is that not necessary?

It's not necessary to touch after numa_tonode_memory() (wrapper around
numa_interleave_memory()), if it is going to be used anyway it will be
correctly placed to best of my knowledge.

6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c

Accidental indents (also fails to apply)

7. We miss the pg_numa_* shims, but for sure that's for later and also
avoid those Linux specific #ifdef USE_LIBNUMA and so on?

8. v1-0005 2x + /* if (numa_procs_interleave) */

   Ha! it's a TRAP! I've uncommented it because I wanted to try it out
without it (just by setting GUC off) , but "MyProc->sema" is NULL :

    2025-07-04 12:31:08.103 CEST [28754] LOG:  starting PostgreSQL
19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit
    [..]
    2025-07-04 12:31:08.109 CEST [28754] LOG:  io worker (PID 28755)
was terminated by signal 11: Segmentation fault
    2025-07-04 12:31:08.109 CEST [28754] LOG:  terminating any other
active server processes
    2025-07-04 12:31:08.114 CEST [28754] LOG:  shutting down because
"restart_after_crash" is off
    2025-07-04 12:31:08.116 CEST [28754] LOG:  database system is shut down

    [New LWP 28755]
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    Core was generated by `postgres: io worker                     '.
    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
    136     ./nptl/sem_waitcommon.c: No such file or directory.
    (gdb) where
    #0  __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0)
at ./nptl/sem_waitcommon.c:136
    #1  __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81
    #2  0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at
../src/backend/port/posix_sema.c:302
    #3  0x0000556191970553 in InitAuxiliaryProcess () at
../src/backend/storage/lmgr/proc.c:992
    #4  0x00005561918e51a2 in AuxiliaryProcessMainCommon () at
../src/backend/postmaster/auxprocess.c:65
    #5  0x0000556191940676 in IoWorkerMain (startup_data=<optimized
out>, startup_data_len=<optimized out>) at
../src/backend/storage/aio/method_worker.c:393
    #6  0x00005561918e8163 in postmaster_child_launch
(child_type=child_type@entry=B_IO_WORKER, child_slot=20086,
startup_data=startup_data@entry=0x0,
        startup_data_len=startup_data_len@entry=0,
client_sock=client_sock@entry=0x0) at
../src/backend/postmaster/launch_backend.c:290
    #7  0x00005561918ea09a in StartChildProcess
(type=type@entry=B_IO_WORKER) at
../src/backend/postmaster/postmaster.c:3973
    #8  0x00005561918ea308 in maybe_adjust_io_workers () at
../src/backend/postmaster/postmaster.c:4404
    [..]
    (gdb) print *MyProc->sem
    Cannot access memory at address 0x0

9. v1-0006: is this just a thought or serious candidate? I can imagine
it can easily blow-up with some backends somehow requesting CPUs only
from one NUMA node, while the second node being idle. Isn't it better
just to leave CPU scheduling, well, to the CPU scheduler? The problem
is that you have tools showing overall CPU usage, even mpstat(1) per
CPU , but no tools for per-NUMA node CPU util%, so it would be hard
for someone to realize that this is happening.

-J.

[1] - https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Next
From: Nikita Malakhov
Date:
Subject: Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)