Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | CAKZiRmy7VW5b0OKwHXtRSudCBFB48-kYnOEXeqrLfbTQ=5qQrA@mail.gmail.com Whole thread Raw |
In response to | Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On Tue, Jul 1, 2025 at 9:07 PM Tomas Vondra <tomas@vondra.me> wrote: Hi! > 1) v1-0001-NUMA-interleaving-buffers.patch [..] > It's a bit more complicated, because the patch distributes both the > blocks and descriptors, in the same way. So a buffer and it's descriptor > always end on the same NUMA node. This is one of the reasons why we need > to map larger chunks, because NUMA works on page granularity, and the > descriptors are tiny - many fit on a memory page. Oh, now I get it! OK, let's stick to this one. > I don't think the splitting would actually make some things simpler, or > maybe more flexible - in particular, it'd allow us to enable huge pages > only for some regions (like shared buffers), and keep the small pages > e.g. for PGPROC. So that'd be good. You have made assumption that this is good, but small pages (4KB) are not hugetlb, and are *swappable* (Transparent HP are swappable too, manually allocated ones as with mmap(MMAP_HUGETLB) are not)[1]. The most frequent problem I see these days are OOMs, and it makes me believe that making certain critical parts of shared memory being swappable just to make pagesize granular is possibly throwing the baby out with the bathwater. I'm thinking about bad situations like: some wrong settings of vm.swapiness that people keep (or distros keep?) and general inability of PG to restrain from allocating more memory in some cases. > The other thing I haven't thought about very much is determining on > which CPUs/nodes the instance is allowed to run. I assume we'd start by > simply inherit/determine that at the start through libnuma, not through > some custom PG configuration (which the patch [2] proposed to do). 0. I think that we could do better, some counter arguments to no-configuration-at-all: a. as Robert & Bertrand already put it there after review: let's say I want just to run on NUMA #2 node, so here I would need to override systemd's script ExecStart= to include that numactl (not elegant?). I could also use `CPUAffinity=1,3,5,7..` but that's all, and it is even less friendly. Also it probably requires root to edit/reload systemd, while having GUC for this like in my proposal makes it more smooth (I think?) b. wouldn't it be better if that stayed as drop-in rather than always on? What if there's a problem, how do you disable those internal optimizations if they do harm in some cases? (or let's say I want to play with MPOL_INTERLEAVE_WEIGHTED?). So at least boolean numa_buffers_interleave would be nice? c. What if I want my standby (walreceiver+startup/recovery) to run with NUMA affinity to get better performance (I'm not going to hack around systemd script every time, but I could imagine changing numa=X,Y,Z after restart/before promotion) d. Now if I would be forced for some reason to do that numactl(1) voodoo, and use the those above mentioned overrides and PG wouldn't be having GUC (let's say I would use `numactl --weighted-interleave=0,1`), then: > 2) v1-0002-NUMA-localalloc.patch > This simply sets "localalloc" when initializing a backend, so that all > memory allocated later is local, not interleaved. Initially this was > necessary because the patch set the allocation policy to interleaving > before initializing shared memory, and we didn't want to interleave the > private memory. But that's no longer the case - the explicit mapping to > nodes does not have this issue. I'm keeping the patch for convenience, > it allows experimenting with numactl etc. .. .is not accurate anymore and we would require to have that in (still with GUC) ? Thoughts? I can add that mine part into Your's patches if you want. Way too quick review and some very fast benchmark probes, I've concentrated only on v1-0001 and v1-0005 (efficiency of buffermgmt would be too new topic for me), but let's start: 1. normal pgbench -S (still with just s_b@4GB), done many tries, consistent benefit for the patch with like +8..10% boost on generic run: numa_buffers_interleave=off numa_pgproc_interleave=on(due that always on "if"), s_b just on 1 NUMA node (might happen) latency average = 0.373 ms latency stddev = 0.237 ms initial connection time = 45.899 ms tps = 160242.147877 (without initial connection time) numa_buffers_interleave=on numa_pgproc_interleave=on latency average = 0.345 ms latency stddev = 0.373 ms initial connection time = 44.485 ms tps = 177564.686094 (without initial connection time) 2. Tested it the same way as I did for mine(problem#2 from Andres's presentation): 4s32c128t, s_b=4GB (on 128GB), prewarm test (with seqconcurrscans.pgb as earlier) default/numa_buffers_interleave=off latency average = 1375.478 ms latency stddev = 1141.423 ms initial connection time = 46.104 ms tps = 45.868075 (without initial connection time) numa_buffers_interleave=on latency average = 838.128 ms latency stddev = 498.787 ms initial connection time = 43.437 ms tps = 75.413894 (without initial connection time) and i've repeated the the same test (identical conditions) with my patch, got me slightly more juice: latency average = 727.717 ms latency stddev = 410.767 ms initial connection time = 45.119 ms tps = 86.844161 (without initial connection time) (but mine didn't get that boost from normal pgbench as per #1 pgbench -S -- my numa='all' stays @ 160k TPS just as numa_buffers_interleave=off), so this idea is clearly better. So should I close https://commitfest.postgresql.org/patch/5703/ and you'll open a new one or should I just edit the #5703 and alter it and add this thread too? 3. Patch is not calling interleave on PQ shmem, do we want to add that in as some next item like v1-0007? Question is whether OS interleaving makes sense there ? I believe it does there, please see my thread (NUMA_pq_cpu_pinning_results.txt), the issue is that PQ workers are being spawned by postmaster and may end up on different NUMA nodes randomly, so actually OS-interleaving that memory reduces jitter there (AKA bandwidth-over-latency). My thinking is that one cannot expect static/forced CPU-to-just-one-NUMA-node assignment for backend and it's PQ workers, because it is impossible have always available CPU power there in that NUMA node, so it might be useful to interleave that shared mem there too (as separate patch item?) 4 In BufferManagerShmemInit() you call numa_num_configured_nodes() (also in v1-0005). My worry is should we may put some known-limitations docs (?) from start and mention that if the VM is greatly resized and NUMA numa nodes appear, they might not be used until restart? 5. In v1-0001, pg_numa_interleave_memory() + * XXX no return value, to make this fail on error, has to use + * numa_set_strict Yes, my patch has those numa_error() and numa_warn() handlers too in pg_numa. Feel free to use it for better UX. + * XXX Should we still touch the memory first, like with numa_move_pages, + * or is that not necessary? It's not necessary to touch after numa_tonode_memory() (wrapper around numa_interleave_memory()), if it is going to be used anyway it will be correctly placed to best of my knowledge. 6. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c Accidental indents (also fails to apply) 7. We miss the pg_numa_* shims, but for sure that's for later and also avoid those Linux specific #ifdef USE_LIBNUMA and so on? 8. v1-0005 2x + /* if (numa_procs_interleave) */ Ha! it's a TRAP! I've uncommented it because I wanted to try it out without it (just by setting GUC off) , but "MyProc->sema" is NULL : 2025-07-04 12:31:08.103 CEST [28754] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-12.2.0, 64-bit [..] 2025-07-04 12:31:08.109 CEST [28754] LOG: io worker (PID 28755) was terminated by signal 11: Segmentation fault 2025-07-04 12:31:08.109 CEST [28754] LOG: terminating any other active server processes 2025-07-04 12:31:08.114 CEST [28754] LOG: shutting down because "restart_after_crash" is off 2025-07-04 12:31:08.116 CEST [28754] LOG: database system is shut down [New LWP 28755] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: io worker '. Program terminated with signal SIGSEGV, Segmentation fault. #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 136 ./nptl/sem_waitcommon.c: No such file or directory. (gdb) where #0 __new_sem_wait_fast (definitive_result=1, sem=sem@entry=0x0) at ./nptl/sem_waitcommon.c:136 #1 __new_sem_trywait (sem=sem@entry=0x0) at ./nptl/sem_wait.c:81 #2 0x00005561918e0cac in PGSemaphoreReset (sema=0x0) at ../src/backend/port/posix_sema.c:302 #3 0x0000556191970553 in InitAuxiliaryProcess () at ../src/backend/storage/lmgr/proc.c:992 #4 0x00005561918e51a2 in AuxiliaryProcessMainCommon () at ../src/backend/postmaster/auxprocess.c:65 #5 0x0000556191940676 in IoWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at ../src/backend/storage/aio/method_worker.c:393 #6 0x00005561918e8163 in postmaster_child_launch (child_type=child_type@entry=B_IO_WORKER, child_slot=20086, startup_data=startup_data@entry=0x0, startup_data_len=startup_data_len@entry=0, client_sock=client_sock@entry=0x0) at ../src/backend/postmaster/launch_backend.c:290 #7 0x00005561918ea09a in StartChildProcess (type=type@entry=B_IO_WORKER) at ../src/backend/postmaster/postmaster.c:3973 #8 0x00005561918ea308 in maybe_adjust_io_workers () at ../src/backend/postmaster/postmaster.c:4404 [..] (gdb) print *MyProc->sem Cannot access memory at address 0x0 9. v1-0006: is this just a thought or serious candidate? I can imagine it can easily blow-up with some backends somehow requesting CPUs only from one NUMA node, while the second node being idle. Isn't it better just to leave CPU scheduling, well, to the CPU scheduler? The problem is that you have tools showing overall CPU usage, even mpstat(1) per CPU , but no tools for per-NUMA node CPU util%, so it would be hard for someone to realize that this is happening. -J. [1] - https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
pgsql-hackers by date: