Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | CAKZiRmx=0C5k3Qs0DdHZw9cL+72sX_ZH_RXdUW-7U1-978Kvnw@mail.gmail.com Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas@vondra.me> wrote: Hi Tomas, > > 0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in > > bigint and not hex? I've wanted to adjust that to TEXTOID, but instead > > I've thought it is going to be simpler to use to_hex() -- see 0009 > > attached. > > > > I don't know. I added simply because it might be useful for development, > but we probably don't want to expose these pointers at all. > > > 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better > > called pg_shm_pgproc? > > > > Right. It does not belong to pg_buffercache at all, I just added it > there because I've been messing with that code already. Please keep them in for at least for some time (perhaps standalone patch marked as not intended to be commited would work?). I find the view extermely useful as it will allow us pinpointing local-vs-remote NUMA fetches (we need to know the addres). > > 0007c with check_numa='buffers,procs' throws 'mbind Invalid argument' > > during start: > > > > 2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA: > > pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000 > > num_procs 2523 node 0 > > 2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA: > > pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000 > > num_procs 2523 node 1 > > 2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA: > > pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000 > > num_procs 2523 node 2 > > 2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA: > > pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000 > > num_procs 2523 node 3 > > 2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA: > > pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0 > > num_procs 38 node -1 > > mbind: Invalid argument > > mbind: Invalid argument > > mbind: Invalid argument > > mbind: Invalid argument > > > > I'll take a look, but I don't recall seeing such errors. > Alexy also reported this earlier, here https://www.postgresql.org/message-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd%40postgrespro.ru (just use HP, set some high max_connections). I've double checked this too , numa_tonode_memory() len needs to HP size. > > 0007d: so we probably need numa_warn()/numa_error() wrappers (this was > > initially part of NUMA observability patches but got removed during > > the course of action), I'm attaching 0008. With that you'll get > > something a little more up to our standards: > > 2025-11-04 10:27:07.140 CET [59696] DEBUG: > > fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr = > > 0x7f4f4d4b1660 > > 2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind > > > > Not sure. Any particular objections? We need to somehow emit them into the logs. > > 0007f: The "mbind: Invalid argument"" issue itself with the below addition: [..] > > > > but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 = > > 0xB1660 = 726624 bytes, but if adjust blindly endptr in that > > fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;" > > (HP) it doesn't complain anymore and I get success: [..] > > Hmm, so it seems like another hugepage-related issue. The mbind manpage > says this about "len": > > EINVAL An invalid value was specified for flags or mode; or addr + len > was less than addr; or addr is not a multiple of the system page size. > > I don't think that requires (addr+len) to be a multiple of page size, > but maybe that is required. I do think that 'system page size' means above HP page size, but this time it's just for fastpath_partition_init(), the earlier one seems to aligned fine (?? -- i havent really checked but there's no error) > > 0006d: I've got one SIGBUS during a call to select > > pg_buffercache_numa_pages(); and it looks like that memory accessed is > > simply not mapped? (bug) > > > > Program received signal SIGBUS, Bus error. > > pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at > > ../contrib/pg_buffercache/pg_buffercache_pages.c:386 > > 386 pg_numa_touch_mem_if_required(ptr); > > (gdb) print ptr > > $1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000> > > (gdb) where > > #0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at > > ../contrib/pg_buffercache/pg_buffercache_pages.c:386 > > #1 0x0000561a672a0efe in ExecMakeFunctionResultSet > > (fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8, > > argContext=0x561a97ec62a0, isNull=0x561a97e8e578, > > isDone=isDone@entry=0x561a97e8e5c0) at > > ../src/backend/executor/execSRF.c:624 > > [..] > > > > Postmaster had still attached shm (visible via smaps), and if you > > compare closely 0x7f4ed0200000 against sorted smaps: > > > > 7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111 > > /anon_hugepage (deleted) > > 7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111 > > /anon_hugepage (deleted) > > 7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111 > > /anon_hugepage (deleted) > > 7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111 > > /anon_hugepage (deleted) > > 7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111 > > /anon_hugepage (deleted) > > > > it's NOT there at all (there's no mmap region starting with > > 0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not > > aware of this new mmaped() regions and instead does simple loop over > > all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr += > > os_page_size)"? > > > > I'm confused. How could that mapping be missing? Was this with huge > pages / how many did you reserve on the nodes? OK I made and error and paritally got it correct (it crashes reliably) and partially mislead You, appologies, let me explain. There were two questions for me: a) why we make single mmap() and after numa_tonode_memory() we get plenty of mappings b) why we get SIGBUS (I've thought they are not continus, but they are after triple-checking) ad a) My testing shows that on HP,as stated initially ("all of this was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you get single mmaps() (resulting in single entry in smaps), but afte noda_tonode_memory() there's many of them. Even on laptop: System has 1 NUMA nodes (0 to 0). Attempting to allocate 8.000000 MB of HugeTLB memory... Successfully allocated HugeTLB memory at 0x755828800000, smaps before: 755828800000-755829000000 rw-s 00000000 00:11 259808 /anon_hugepage (deleted) Pinning first part (from 0x755828800000) to NUMA node 0... smaps after: 755828800000-755828c00000 rw-s 00000000 00:11 259808 /anon_hugepage (deleted) 755828c00000-755829000000 rw-s 00400000 00:11 259808 /anon_hugepage (deleted) Pinning second part (from 0x755828c00000) to NUMA node 0... smaps after: 755828800000-755828c00000 rw-s 00000000 00:11 259808 /anon_hugepage (deleted) 755828c00000-755829000000 rw-s 00400000 00:11 259808 /anon_hugepage (deleted) It gets even more funny, below I have 8MB HP=on, but just issue 2x numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for ptr, second time in half of that): System has 1 NUMA nodes (0 to 0). Attempting to allocate 8.000000 MB of HugeTLB memory... Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before: 7302dda00000-7302de200000 rw-s 00000000 00:11 284859 /anon_hugepage (deleted) Pinning first part (from 0x7302dda00000) to NUMA node 0... smaps after: 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859 /anon_hugepage (deleted) 7302ddc00000-7302de200000 rw-s 00200000 00:11 284859 /anon_hugepage (deleted) Pinning second part (from 0x7302dde00000) to NUMA node 0... smaps after: 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859 /anon_hugepage (deleted) 7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859 /anon_hugepage (deleted) 7302dde00000-7302de000000 rw-s 00400000 00:11 284859 /anon_hugepage (deleted) 7302de000000-7302de200000 rw-s 00600000 00:11 284859 /anon_hugepage (deleted) Why 4 instead of 1? Because some mappings are now "default" becauswe their policy was not altered: $ grep huge /proc/$(pidof testnumammapsplit)/numa_maps 7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge 7302ddc00000 default file=/anon_hugepage\040(deleted) huge 7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge 7302de000000 default file=/anon_hugepage\040(deleted) huge Back to originnal error, they are consecutive regions and earlier problem is error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000> start: 0x7f4921400000 end: 0x7f4f4c000000 so it fits into that range (that was my mistate earlier, using just grep not checking are they really within that), but... > Maybe there were not enough huge pages left on one of the nodes? ad b) right, something like that. I've investigated that SIGBUS there (it's going to be long): with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C shared_memory_size_in_huge_pages returns), right after startup, but no touch: Program received signal SIGBUS, Bus error. pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at ../contrib/pg_buffercache/pg_buffercache_pages.c:386 386 pg_numa_touch_mem_if_required(ptr); (gdb) where #0 pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at ../contrib/pg_buffercache/pg_buffercache_pages.c:386 #1 0x00005571f54ddb7d in ExecMakeTableFunctionResult (setexpr=0x557203870d40, econtext=0x557203870ba8, argContext=<optimized out>, expectedDesc=0x557203870f80, randomAccess=false) at ../src/backend/executor/execSRF.c:234 [..] (gdb) print ptr $1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000> (gdb) then it shows?! no available hugepage on one of the nodes (while gdb is hanging and preving autorestart): root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo node0/meminfo:Node 0 HugePages_Free: 299 node1/meminfo:Node 1 HugePages_Free: 299 node2/meminfo:Node 2 HugePages_Free: 299 node3/meminfo:Node 3 HugePages_Free: 0 but they are equal in terms of size: node0/meminfo:Node 0 HugePages_Total: 4429 node1/meminfo:Node 1 HugePages_Total: 4429 node2/meminfo:Node 2 HugePages_Total: 4429 node3/meminfo:Node 3 HugePages_Total: 4428 smaps shows that this address (7f6cf8400000) is mapped in this mapping: 7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064 /anon_hugepage (deleted) numa_maps for this region shows this is this mapping on node3 (notice N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free): 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444 N3=3444 kernelpagesize_kB=2048 the surrounding area of this looks like that: 7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096 N0=4096 kernelpagesize_kB=2048 7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096 N1=4096 kernelpagesize_kB=2048 7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096 N2=4096 kernelpagesize_kB=2048 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444 N3=3444 kernelpagesize_kB=2048 <-- this is the one 7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107 mapmax=6 N3=107 kernelpagesize_kB=2048 Notice it's just N3=3444, while the others are much larger. So something was using that hugepages memory on N3: # grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po N[0-4]=[0-9]+ | sort N0=2 N0=4096 N1=2 N1=4096 N2=2 N2=4096 N3=1 N3=1 N3=1 N3=1 N3=107 N3=13 N3=3 N3=3444 So per above it's not there (at least not as 2MB HP). But the number of mappings is wild there! (node where it is failing has plenty of memory, no hugepage memory left, but it has like 40k+ of small mappings!) # grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c 17 N0= 10 N1= 3 N2= 40434 N3= most of them are `anon_inode:[io_uring]` (and I had max_connections=10k). You may ask why in spite of Andres optimization for reducing number segments for uring, it's not working for me ? Well I've just noticed way too silent failure to active this (altough I'm on 6.14.x): 2025-11-06 13:34:49.128 CET [1658] DEBUG: can't use combined memory mapping for io_uring, kernel or liburing too old and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM apparently on liburing-2.3 (Debian's default). See [1] for more info (fix is not commited yet sadly). Next try, now with io_method = worker and right before start: root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total node*/meminfo node0/meminfo:Node 0 HugePages_Total: 4429 node1/meminfo:Node 1 HugePages_Total: 4429 node2/meminfo:Node 2 HugePages_Total: 4429 node3/meminfo:Node 3 HugePages_Total: 4428 and HugePages_Free were 100% (if postgresql was down). After start (but without doing anything else): root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo node0/meminfo:Node 0 HugePages_Free: 4393 node1/meminfo:Node 1 HugePages_Free: 4395 node2/meminfo:Node 2 HugePages_Free: 4395 node3/meminfo:Node 3 HugePages_Free: 3446 So sadly the picture is the same (something stole my HP on N3 and it's PostgreSQL on it's own). After some time of investigating that ("who stole my hugepage across whole OS"), I've just added MAP_POPULATE to the mix of PG_MMAP_FLAGS and got this after start: root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo node0/meminfo:Node 0 HugePages_Free: 0 node1/meminfo:Node 1 HugePages_Free: 0 node2/meminfo:Node 2 HugePages_Free: 0 node3/meminfo:Node 3 HugePages_Free: 1 and then the SELECT to pg_buffercache_numa works fine(!). Another ways that I have found to eliminate that SIGBUS a. Would be to throw much more HugePages (so that node does not run to HugePages_Free), but that's not real option. b. Then I've reminded myself that I could be running custom kernel with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses tranparently with specially linked PG; will double check exact stuff later), so I've thrown never into /sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes , disabled THP) and with that -- drumroll -- that SELECT works. The very same PG picture after startup (where earlier it would crash), now after SELECT it looks like that: root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo node0/meminfo:Node 0 HugePages_Free: 83 node1/meminfo:Node 1 HugePages_Free: 0 node2/meminfo:Node 2 HugePages_Free: 81 node3/meminfo:Node 3 HugePages_Free: 82 Hope that helps a little. To me it sounds like THP used that memory somehow and we've also wanted to use. With numa_interleave_ptr() that wouldn't be a problem because probably it would something else available, but not here as we indicated exact node. > > 0006e: > > I'm seeking confirmation, but is this the issue we have discussed > > on PgconfEU related to lack of detection of Mems_allowed, right? e.g. > > $ numactl --membind="0,1" --cpunodebind="0,1" > > /usr/pgsql19/bin/pg_ctl -D /path start > > still shows 4 NUMA nodes used. Current patches use > > numa_num_configured_nodes(), but it says 'This count includes any > > nodes that are currently DISABLED'. So I was wondering if I could help > > by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()? > > It's the same as You wrote earlier to Alexy? > > > > If "mems_allowed" refers to nodes allowing memory allocation, then yes, > this would be one way to get into that issue. Oh, is this what happened > in 0006d? OK, thanks for confirmation. No, 0006d was about normal numactl run, without --membind. > I did get a couple of "operation canceled" failures, but only on fairly > old kernel versions (6.1 which came as default with the VM). OK, I'll try to see that later too. btw QQ regarding partitioned clockwise as I had thought: does this opens a road towards multiple BGwriters? (outside of this $thread/v1/PoC) -J. [1] - https://www.postgresql.org/message-id/CAKZiRmzxj6Lt1w2ffDoUmN533TgyDeYVULEH1PQFLRyBJSFP6w%40mail.gmail.com
pgsql-hackers by date: