Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: Adding basic NUMA awareness
Date
Msg-id CAKZiRmx=0C5k3Qs0DdHZw9cL+72sX_ZH_RXdUW-7U1-978Kvnw@mail.gmail.com
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
List pgsql-hackers
On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas@vondra.me> wrote:

Hi Tomas,

> > 0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in
> > bigint and not hex? I've wanted to adjust that to TEXTOID, but instead
> > I've thought it is going to be simpler to use to_hex() -- see 0009
> > attached.
> >
>
> I don't know. I added simply because it might be useful for development,
> but we probably don't want to expose these pointers at all.
>
> > 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better
> > called pg_shm_pgproc?
> >
>
> Right. It does not belong to pg_buffercache at all, I just added it
> there because I've been messing with that code already.

Please keep them in for at least for some time (perhaps standalone
patch marked as not intended to be commited would work?). I find the
view extermely useful as it will allow us pinpointing local-vs-remote
NUMA fetches (we need to know the addres).

> > 0007c with check_numa='buffers,procs' throws 'mbind Invalid argument'
> > during start:
> >
> >     2025-11-04 10:02:27.055 CET [58464] DEBUG:  NUMA:
> > pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000
> > num_procs 2523 node 0
> >     2025-11-04 10:02:27.057 CET [58464] DEBUG:  NUMA:
> > pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000
> > num_procs 2523 node 1
> >     2025-11-04 10:02:27.059 CET [58464] DEBUG:  NUMA:
> > pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000
> > num_procs 2523 node 2
> >     2025-11-04 10:02:27.061 CET [58464] DEBUG:  NUMA:
> > pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000
> > num_procs 2523 node 3
> >     2025-11-04 10:02:27.062 CET [58464] DEBUG:  NUMA:
> > pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0
> > num_procs 38 node -1
> >     mbind: Invalid argument
> >     mbind: Invalid argument
> >     mbind: Invalid argument
> >     mbind: Invalid argument
> >
>
> I'll take a look, but I don't recall seeing such errors.
>

Alexy also reported this earlier, here
https://www.postgresql.org/message-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd%40postgrespro.ru
(just use HP, set some high max_connections). I've double checked this
too , numa_tonode_memory() len needs to HP size.

> > 0007d: so we probably need numa_warn()/numa_error() wrappers (this was
> > initially part of NUMA observability patches but got removed during
> > the course of action), I'm attaching 0008. With that you'll get
> > something a little more up to our standards:
> >     2025-11-04 10:27:07.140 CET [59696] DEBUG:
> > fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr =
> > 0x7f4f4d4b1660
> >     2025-11-04 10:27:07.140 CET [59696] WARNING:  libnuma: ERROR: mbind
> >
>
> Not sure.

Any particular objections? We need to somehow emit them into the logs.

> > 0007f: The "mbind: Invalid argument"" issue itself with the below  addition:
[..]
> >
> >     but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 =
> > 0xB1660 = 726624 bytes, but if adjust blindly endptr in that
> > fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;"
> > (HP) it doesn't complain anymore and I get success:
[..]
>
> Hmm, so it seems like another hugepage-related issue. The mbind manpage
> says this about "len":
>
>   EINVAL An invalid value was specified for flags or mode; or addr + len
>   was less than addr; or addr is not a multiple of the system page size.
>
> I don't think that requires (addr+len) to be a multiple of page size,
> but maybe that is required.

I do think that 'system page size' means above HP page size, but this
time it's just for fastpath_partition_init(), the earlier one seems to
aligned fine (?? -- i havent really checked but there's no error)

> > 0006d: I've got one SIGBUS during a call to select
> > pg_buffercache_numa_pages(); and it looks like that memory accessed is
> > simply not mapped? (bug)
> >
> >     Program received signal SIGBUS, Bus error.
> >     pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
> > ../contrib/pg_buffercache/pg_buffercache_pages.c:386
> >     386                                     pg_numa_touch_mem_if_required(ptr);
> >     (gdb) print ptr
> >     $1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
> >     (gdb) where
> >     #0  pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at
> > ../contrib/pg_buffercache/pg_buffercache_pages.c:386
> >     #1  0x0000561a672a0efe in ExecMakeFunctionResultSet
> > (fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8,
> > argContext=0x561a97ec62a0, isNull=0x561a97e8e578,
> > isDone=isDone@entry=0x561a97e8e5c0) at
> > ../src/backend/executor/execSRF.c:624
> >     [..]
> >
> >     Postmaster had still attached shm (visible via smaps), and if you
> > compare closely 0x7f4ed0200000 against sorted smaps:
> >
> >     7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111
> >       /anon_hugepage (deleted)
> >     7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111
> >       /anon_hugepage (deleted)
> >     7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111
> >       /anon_hugepage (deleted)
> >     7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111
> >       /anon_hugepage (deleted)
> >     7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111
> >       /anon_hugepage (deleted)
> >
> >     it's NOT there at all (there's no mmap region starting with
> > 0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not
> > aware of this new mmaped() regions and instead does simple loop over
> > all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr +=
> > os_page_size)"?
> >
>
> I'm confused. How could that mapping be missing? Was this with huge
> pages / how many did you reserve on the nodes?


OK I made and error and paritally got it correct (it crashes reliably)
and partially mislead You, appologies, let me explain. There were two
questions for me:
a) why we make single mmap() and after numa_tonode_memory() we get
plenty of mappings
b) why we get SIGBUS (I've thought they are not continus, but they are
after triple-checking)

ad a) My testing shows that on HP,as stated initially ("all of this
was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you
get single mmaps() (resulting in single entry in smaps), but afte
noda_tonode_memory() there's many of them. Even on laptop:

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x755828800000, smaps before:
755828800000-755829000000 rw-s 00000000 00:11 259808
  /anon_hugepage (deleted)
Pinning first part (from 0x755828800000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
  /anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
  /anon_hugepage (deleted)
Pinning second part (from 0x755828c00000) to NUMA node 0...
smaps after:
755828800000-755828c00000 rw-s 00000000 00:11 259808
  /anon_hugepage (deleted)
755828c00000-755829000000 rw-s 00400000 00:11 259808
  /anon_hugepage (deleted)

It gets even more funny, below I have 8MB HP=on, but just issue 2x
numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for
ptr, second time in half of that):

System has 1 NUMA nodes (0 to 0).
Attempting to allocate 8.000000 MB of HugeTLB memory...
Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before:
7302dda00000-7302de200000 rw-s 00000000 00:11 284859
  /anon_hugepage (deleted)
Pinning first part (from 0x7302dda00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
  /anon_hugepage (deleted)
7302ddc00000-7302de200000 rw-s 00200000 00:11 284859
  /anon_hugepage (deleted)
Pinning second part (from 0x7302dde00000) to NUMA node 0...
smaps after:
7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859
  /anon_hugepage (deleted)
7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859
  /anon_hugepage (deleted)
7302dde00000-7302de000000 rw-s 00400000 00:11 284859
  /anon_hugepage (deleted)
7302de000000-7302de200000 rw-s 00600000 00:11 284859
  /anon_hugepage (deleted)

Why 4 instead of 1? Because some mappings are now "default" becauswe
their policy was not altered:

$ grep huge /proc/$(pidof testnumammapsplit)/numa_maps
7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302ddc00000 default file=/anon_hugepage\040(deleted) huge
7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge
7302de000000 default file=/anon_hugepage\040(deleted) huge

Back to originnal error, they are consecutive regions and earlier problem is

error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000>
start: 0x7f4921400000
end:   0x7f4f4c000000

so it fits into that range (that was my mistate earlier, using just
grep not checking are they really within that), but...

> Maybe there were not enough huge pages left on one of the nodes?

ad b) right, something like that. I've investigated that SIGBUS there
(it's going to be long):

with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C
shared_memory_size_in_huge_pages returns), right after startup, but no
touch:

Program received signal SIGBUS, Bus error.
pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
386                                     pg_numa_touch_mem_if_required(ptr);
(gdb) where
#0  pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at
../contrib/pg_buffercache/pg_buffercache_pages.c:386
#1  0x00005571f54ddb7d in ExecMakeTableFunctionResult
(setexpr=0x557203870d40, econtext=0x557203870ba8,
argContext=<optimized out>, expectedDesc=0x557203870f80,
randomAccess=false) at ../src/backend/executor/execSRF.c:234
[..]
(gdb) print ptr
$1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000>
(gdb)


then it shows?! no available hugepage on one of the nodes (while gdb
is hanging and preving autorestart):

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free:    299
node1/meminfo:Node 1 HugePages_Free:    299
node2/meminfo:Node 2 HugePages_Free:    299
node3/meminfo:Node 3 HugePages_Free:      0

but they are equal in terms of size:
node0/meminfo:Node 0 HugePages_Total:  4429
node1/meminfo:Node 1 HugePages_Total:  4429
node2/meminfo:Node 2 HugePages_Total:  4429
node3/meminfo:Node 3 HugePages_Total:  4428

smaps shows that this address (7f6cf8400000) is mapped in this mapping:
7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064
  /anon_hugepage (deleted)

numa_maps for this region shows this is this mapping on node3 (notice
N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free):
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048

the surrounding area of this looks like that:

7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096
N0=4096 kernelpagesize_kB=2048
7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096
N1=4096 kernelpagesize_kB=2048
7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096
N2=4096 kernelpagesize_kB=2048
7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444
N3=3444 kernelpagesize_kB=2048 <-- this is the one
7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107
mapmax=6 N3=107 kernelpagesize_kB=2048

Notice it's just N3=3444, while the others are much larger. So
something was using that hugepages memory on N3:

# grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po
N[0-4]=[0-9]+ | sort
N0=2
N0=4096
N1=2
N1=4096
N2=2
N2=4096
N3=1
N3=1
N3=1
N3=1
N3=107
N3=13
N3=3
N3=3444

So per above it's not there (at least not as 2MB HP). But the number
of mappings is wild there! (node where it is failing has plenty of
memory, no hugepage memory left, but it has like 40k+ of small
mappings!)

# grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c
     17 N0=
     10 N1=
      3 N2=
  40434 N3=

most of them are `anon_inode:[io_uring]` (and I had
max_connections=10k). You may ask why in spite of Andres optimization
for reducing number segments for uring, it's not working for me ? Well
I've just noticed way too silent failure to active this (altough I'm
on 6.14.x):
    2025-11-06 13:34:49.128 CET [1658] DEBUG:  can't use combined
memory mapping for io_uring, kernel or liburing too old
and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM
apparently on liburing-2.3 (Debian's default). See [1] for more info
(fix is not commited yet sadly).

Next try, now with io_method = worker and right before start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total
node*/meminfo
node0/meminfo:Node 0 HugePages_Total:  4429
node1/meminfo:Node 1 HugePages_Total:  4429
node2/meminfo:Node 2 HugePages_Total:  4429
node3/meminfo:Node 3 HugePages_Total:  4428
and HugePages_Free were 100% (if postgresql was down). After start
(but without doing anything else):
root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free:   4393
node1/meminfo:Node 1 HugePages_Free:   4395
node2/meminfo:Node 2 HugePages_Free:   4395
node3/meminfo:Node 3 HugePages_Free:   3446

So sadly the picture is the same (something stole my HP on N3 and it's
PostgreSQL on it's own). After some time of investigating that ("who
stole my hugepage across whole OS"), I've just added MAP_POPULATE to
the mix of PG_MMAP_FLAGS and got this after start:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free:      0
node1/meminfo:Node 1 HugePages_Free:      0
node2/meminfo:Node 2 HugePages_Free:      0
node3/meminfo:Node 3 HugePages_Free:      1

and then the SELECT to pg_buffercache_numa works fine(!).

Another ways that I have found to eliminate that SIGBUS
a. Would be to throw much more HugePages (so that node does not run to
HugePages_Free), but that's not real option.
b. Then I've reminded myself that I could be running custom kernel
with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses
tranparently with specially linked PG; will double check exact stuff
later), so I've thrown never into
/sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes ,
disabled THP) and with that -- drumroll -- that SELECT works. The very
same PG picture after startup (where earlier it would crash), now
after SELECT it looks like that:

root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo
node0/meminfo:Node 0 HugePages_Free:     83
node1/meminfo:Node 1 HugePages_Free:      0
node2/meminfo:Node 2 HugePages_Free:     81
node3/meminfo:Node 3 HugePages_Free:     82

Hope that helps a little. To me it sounds like THP used that memory
somehow and we've also wanted to use. With numa_interleave_ptr() that
wouldn't be a problem because probably it would something else
available, but not here as we indicated exact node.

> > 0006e:
> >     I'm seeking confirmation, but is this the issue we have discussed
> > on PgconfEU related to lack of detection of Mems_allowed, right? e.g.
> >     $ numactl --membind="0,1" --cpunodebind="0,1"
> > /usr/pgsql19/bin/pg_ctl -D /path start
> >     still shows 4 NUMA nodes used. Current patches use
> > numa_num_configured_nodes(), but it says 'This count includes any
> > nodes that are currently DISABLED'. So I was wondering if I could help
> > by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()?
> > It's the same as You wrote earlier to Alexy?
> >
>
> If "mems_allowed" refers to nodes allowing memory allocation, then yes,
> this would be one way to get into that issue. Oh, is this what happened
> in 0006d?

OK, thanks for confirmation. No, 0006d was about normal numactl run,
without --membind.

> I did get a couple of "operation canceled" failures, but only on fairly
> old kernel versions (6.1 which came as default with the VM).

OK, I'll try to see that later too.

btw QQ regarding partitioned clockwise as I had thought: does this
opens a road towards multiple BGwriters? (outside of this
$thread/v1/PoC)

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzxj6Lt1w2ffDoUmN533TgyDeYVULEH1PQFLRyBJSFP6w%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [PATCH] O_CLOEXEC not honored on Windows - handle inheritance chain
Next
From: Bryan Green
Date:
Subject: Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects