Thread: Re: pgsql: Introduce pg_shmem_allocations_numa view

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

13 June, 00:16:19

Re: Tomas Vondra
> Introduce pg_shmem_allocations_numa view

This is acting up on Debian's 32-bit architectures, namely i386, armel
and armhf:

--- /build/reproducible-path/postgresql-18-18~beta1+20250612/src/test/regress/expected/numa.out    2025-06-12
12:21:21.000000000+0000
 
+++ /build/reproducible-path/postgresql-18-18~beta1+20250612/build/src/test/regress/results/numa.out    2025-06-12
20:20:33.124292694+0000
 
@@ -6,8 +6,4 @@
 -- switch to superuser
 \c -
 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
- ok
-----
- t
-(1 row)
-
+ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -14

The diff is the same on all architectures.

-14 seems to be -EFAULT, and move_pages(2) says:

   Page states in the status array
       The following values can be returned in each element of the status array.

       -EFAULT
              This is a zero page or the memory area is not mapped by the process.

https://buildd.debian.org/status/logs.php?pkg=postgresql-18&ver=18%7Ebeta1%2B20250612-1

https://buildd.debian.org/status/fetch.php?pkg=postgresql-18&arch=armel&ver=18%7Ebeta1%2B20250612-1&stamp=1749759646&raw=0

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

23 June, 17:42:26

Re: To Tomas Vondra
> This is acting up on Debian's 32-bit architectures, namely i386, armel
> and armhf:

... and x32 (x86_64 instruction set with 32-bit pointers).

>  SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> +ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -14
> 
> -14 seems to be -EFAULT, and move_pages(2) says:
>        -EFAULT
>               This is a zero page or the memory area is not mapped by the process.

I did some debugging on i386 and made it print the page numbers:

 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 35
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 36
...
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32768
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32769

So it works for the first few pages and then the rest is EFAULT.

I think the pg_numa_touch_mem_if_required() hack might not be enough
to force the pages to be allocated. Changing that to a memcpy() didn't
help. Is there some optimization that zero pages aren't allocated
until being written to?

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.
Why not just skip any page where the status is <0 ?

The attached patch removes that logic. Regression tests pass, but we
probably have to think about whether to report these negative numbers
as-is or perhaps convert them to NULL.

Christoph

Attachment

0001-Don-t-force-allocate-pages-for-pg_get_shmem_allocati.patch

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

23 June, 18:20:12

Re: Andres Freund
> > > Why do we try to force the pages to be allocated at all? This is just
> > > a monitoring function, it should not change the actual system state.
> 
> The problem is that the kernel function just gives bogus results for pages
> that *are* present in memory but that have only touched in another process
> that has mapped the same range of memory.

Ok, so we leave the touching in, but still defend against negative
status values?

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Andres Freund

Date:

23 June, 19:26:22

Hi,

On 2025-06-23 17:59:24 +0200, Christoph Berg wrote:
> Re: To Andres Freund
> > Ok, so we leave the touching in, but still defend against negative
> > status values?
> 
> v2 attached.

How confident are we that this isn't actually because we passed a bogus
address to the kernel or such? With this patch, are *any* pages recognized as
valid on the machines that triggered the error?

I wonder if we ought to report the failures as a separate "numa node"
(e.g. NULL as node id) instead ...

Greetings,

Andres Freund

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

23 June, 23:10:46

On 6/23/25 21:57, Christoph Berg wrote:
> Re: Andres Freund
>> How confident are we that this isn't actually because we passed a bogus
>> address to the kernel or such? With this patch, are *any* pages recognized as
>> valid on the machines that triggered the error?
> 
> See upthread - the first 35 pages were ok, then a lot of -14.
> 
>> I wonder if we ought to report the failures as a separate "numa node"
>> (e.g. NULL as node id) instead ...
> 
> Did that now, using N+1 (== 1 here) for errors in this Debian i386
> environment (chroot on an amd64 host):
> 
> select * from pg_shmem_allocations_numa \crosstabview 
> 
>                       name                      │    0     │    1
> ────────────────────────────────────────────────┼──────────┼──────────
>  multixact_offset                               │    69632 │    65536
>  subtransaction                                 │   139264 │   131072
>  notify                                         │   139264 │        0
>  Shared Memory Stats                            │   188416 │   131072
>  serializable                                   │   188416 │    86016
>  PROCLOCK hash                                  │     4096 │        0
>  FinishedSerializableTransactions               │     4096 │        0
>  XLOG Ctl                                       │  2117632 │  2097152
>  Shared MultiXact State                         │     4096 │        0
>  Proc Header                                    │     4096 │        0
>  Archiver Data                                  │     4096 │        0
> .... more 0s in the last column ...
>  AioHandleData                                  │  1429504 │        0
>  Buffer Blocks                                  │ 67117056 │ 67108864
>  Buffer IO Condition Variables                  │   266240 │        0
>  Proc Array                                     │     4096 │        0
> .... more 0s
> (73 rows)
> 
> 
> There is something fishy with pg_buffercache. If I restart PG, I'm
> getting "Bad address" (errno 14), this time as return value of
> move_pages().
> 
> postgres =# select * from pg_buffercache_numa;
> DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:383
> 2025-06-23 19:41:41.315 UTC [1331894] ERROR:  failed NUMA pages inquiry: Bad address
> 2025-06-23 19:41:41.315 UTC [1331894] STATEMENT:  select * from pg_buffercache_numa;
> ERROR:  XX000: failed NUMA pages inquiry: Bad address
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:394
> 
> Repeated calls are fine.
> 

Huh. So it's only the first call that does this?

Can you maybe print the addresses passed to pg_numa_query_pages? I
wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I'll create a 32-bit VM so that I can try reproducing this.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

23 June, 23:51:49

Re: Tomas Vondra
> Didn't you say the first ~35 addresses succeed, right? What about the
> addresses after that?

That was pg_shmem_allocations_numa. The pg_numa_query_pages() in there
works (does not return -1), but then some of the status[] values are
-14.

When pg_buffercache_numa fails, pg_numa_query_pages() itself
returns -14.

The printed os_page_ptrs[] contents are the same for the failing and
non-failing calls, so the problem is probably elsewhere.

        /* Fill pointers for all the memory pages. */
        idx = 0;
        for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
        {
+           if (idx < 50)
+               elog(DEBUG1, "os_page_ptrs idx %d = %p", idx, ptr);
            os_page_ptrs[idx++] = ptr;


20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
2025-06-23 20:47:41.827 UTC [1368080] ERROR:  failed NUMA pages inquiry: Bad address
2025-06-23 20:47:41.827 UTC [1368080] STATEMENT:  select * from pg_buffercache_numa;
ERROR:  XX000: failed NUMA pages inquiry: Bad address
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:396
Time: 92.757 ms

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
DEBUG:  00000: NUMA: page-faulting the buffercache for proper NUMA readouts
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:444
Time: 24.547 ms
20:47 myon@postgres =#


Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

24 June, 12:20:15

On 6/24/25 10:24, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:
>> On 6/23/25 23:47, Tomas Vondra wrote:
>>> ...
>>>
>>> Or maybe the 32-bit chroot on 64-bit host matters and confuses some
>>> calculation.
>>>
>>
>> I think it's likely something like this.
> 
> I think the same.
> 
>> I noticed that if I modify
>> pg_buffercache_numa_pages() to query the addresses one by one, it works.
>> And when I increase the number, it stops working somewhere between 16k
>> and 17k items.
> 
> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
> than 16 pages.
> 
> It's also confirmed by test_chunk_size.c attached:
> 
> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
> $ ./test_chunk_size
>  1 pages: SUCCESS (0 errors)
>  2 pages: SUCCESS (0 errors)
>  3 pages: SUCCESS (0 errors)
>  4 pages: SUCCESS (0 errors)
>  5 pages: SUCCESS (0 errors)
>  6 pages: SUCCESS (0 errors)
>  7 pages: SUCCESS (0 errors)
>  8 pages: SUCCESS (0 errors)
>  9 pages: SUCCESS (0 errors)
> 10 pages: SUCCESS (0 errors)
> 11 pages: SUCCESS (0 errors)
> 12 pages: SUCCESS (0 errors)
> 13 pages: SUCCESS (0 errors)
> 14 pages: SUCCESS (0 errors)
> 15 pages: SUCCESS (0 errors)
> 16 pages: SUCCESS (0 errors)
> 17 pages: 1 errors
> Threshold: 17 pages
> 
> No error if -m32 is not used.
> 
>> It may be a coincidence, but I suspect it's related to the sizeof(void
>> *) being 8 in the kernel, but only 4 in the chroot. So the userspace
>> passes an array of 4-byte items, but kernel interprets that as 8-byte
>> items. That is, we call
>>
>> long move_pages(int pid, unsigned long count, void *pages[.count], const
>> int nodes[.count], int status[.count], int flags);
>>
>> Which (I assume) just passes the parameters to kernel. And it'll
>> interpret them per kernel pointer size.
>>
> 
> I also suspect something in this area...
> 
>> If this is what's happening, I'm not sure what to do about it ...
> 
> We could work by chunks (16?) on 32 bits but would probably produce performance
> degradation (we mention it in the doc though). Also would always 16 be a correct
> chunk size? 

I don't see how this would solve anything?

AFAICS the problem is the two places are confused about how large the
array elements are, and get to interpret that differently. Using a
smaller array won't solve that. The pg function would still allocate
array of 16 x 32-bit pointers, and the kernel would interpret this as 16
x 64-bit pointers. And that means the kernel will (a) write into memory
beyond the allocated buffer - a clear buffer overflow, and (b) see bogus
pointers, because it'll concatenate two 32-bit pointers.

I don't see how using smaller array makes this correct. That it works is
more a matter of luck, and also a consequence of still allocating the
whole array, so there's no overflow (at least I kept that, not sure how
you did the chunks).

If I fix the code to make the entries 64-bit (by treating the pointers
as int64), it suddenly starts working - no bad addresses, etc. Well,
almost, because I get this

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |       -14
        2 |           2 |         0
        2 |           3 |       -14
        3 |           4 |         0
        3 |           5 |       -14
        4 |           6 |         0
        4 |           7 |       -14
        ...

The -14 status is interesting, because that's the same value Christoph
reported as the other issue (in pg_shmem_allocations_numa).

I did an experiment and changed os_page_status to be declared as int64,
not just int. And interestingly, that produced this:

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |         0
        2 |           2 |         0
        2 |           3 |         0
        3 |           4 |         0
        3 |           5 |         0
        4 |           6 |         0
        4 |           7 |         0
        ...

But I don't see how this makes any sense, because "int" should be 4B in
both cases (in 64-bit kernel and 32-bit chroot).

FWIW I realized this applies to "official" systems with 32-bit user
space on 64-bit kernels, like e.g. rpi5 with RPi OS 32-bit. (Fun fact,
rpi5 has 8 NUMA nodes, with all CPUs attached to all NUMA nodes.)

I'm starting to think we need to disable NUMA for setups like this,
mixing 64-bit kernels with 32-bit chroot. Is there a good way to detect
those, so that we can error-out?

FWIW this doesn't explain the strange valgrind issue, though.

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

24 June, 15:33:59

On 6/24/25 13:10, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote:
>> On 6/24/25 10:24, Bertrand Drouvot wrote:
>>> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
>>> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
>>> than 16 pages.
>>>
>>> It's also confirmed by test_chunk_size.c attached:
>>>
>>> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
>>> $ ./test_chunk_size
>>>  1 pages: SUCCESS (0 errors)
>>>  2 pages: SUCCESS (0 errors)
>>>  3 pages: SUCCESS (0 errors)
>>>  4 pages: SUCCESS (0 errors)
>>>  5 pages: SUCCESS (0 errors)
>>>  6 pages: SUCCESS (0 errors)
>>>  7 pages: SUCCESS (0 errors)
>>>  8 pages: SUCCESS (0 errors)
>>>  9 pages: SUCCESS (0 errors)
>>> 10 pages: SUCCESS (0 errors)
>>> 11 pages: SUCCESS (0 errors)
>>> 12 pages: SUCCESS (0 errors)
>>> 13 pages: SUCCESS (0 errors)
>>> 14 pages: SUCCESS (0 errors)
>>> 15 pages: SUCCESS (0 errors)
>>> 16 pages: SUCCESS (0 errors)
>>> 17 pages: 1 errors
>>> Threshold: 17 pages
>>>
>>> No error if -m32 is not used.
>>>
>>> We could work by chunks (16?) on 32 bits but would probably produce performance
>>> degradation (we mention it in the doc though). Also would always 16 be a correct
>>> chunk size? 
>>
>> I don't see how this would solve anything?
>>
>> AFAICS the problem is the two places are confused about how large the
>> array elements are, and get to interpret that differently.
> 
>> I don't see how using smaller array makes this correct. That it works is
>> more a matter of luck,
> 
> Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch
> size is <= 16.
> 
> So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here
> for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);").
> 
> So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded
> "#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:
> 
> "
>         pages += chunk_nr;
>         status += chunk_nr;
> "
> 
> is done but has no effect since nr_pages will exit the loop if we use a batch
> size <= 16.
> 
> So if this pointer arithmetic is not correct, (it seems that it should advance
> by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
> size is <= 16.
> 
> Does test_chunk_size also fails at 17 for you?

Yes, it fails for me at 17 too. So you're saying the access within each
chunk of 16 elements is OK, but that maybe advancing to the next chunk
is not quite right? In which case limiting the access to 16 entries
might be a workaround.

In any case, this sounds like a kernel bug, right? I don't have much
experience with the kernel code, so don't want to rely too much on my
interpretation of it.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

24 June, 15:42:43

On 6/24/25 13:10, Andres Freund wrote:
> Hi,
> 
> On 2025-06-24 03:43:19 +0200, Tomas Vondra wrote:
>> FWIW while looking into this, I tried running this under valgrind (on a
>> regular 64-bit system, not in the chroot), and I get this report:
>>
>> ==65065== Invalid read of size 8
>> ==65065==    at 0x113B0EBE: pg_buffercache_numa_pages
>> (pg_buffercache_pages.c:380)
>> ==65065==    by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
>> ==65065==    by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
>> ==65065==    by 0x6B6ACA: ExecScanFetch (execScan.h:126)
>> ==65065==    by 0x6B6B31: ExecScanExtended (execScan.h:170)
>> ==65065==    by 0x6B6C9D: ExecScan (execScan.c:59)
>> ==65065==    by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
>> ==65065==    by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
>> ==65065==    by 0x6A6F56: ExecProcNode (executor.h:313)
>> ==65065==    by 0x6A9533: ExecutePlan (execMain.c:1679)
>> ==65065==    by 0x6A7422: standard_ExecutorRun (execMain.c:367)
>> ==65065==    by 0x6A7330: ExecutorRun (execMain.c:304)
>> ==65065==    by 0x934EF0: PortalRunSelect (pquery.c:921)
>> ==65065==    by 0x934BD8: PortalRun (pquery.c:765)
>> ==65065==    by 0x92E4CD: exec_simple_query (postgres.c:1273)
>> ==65065==    by 0x93301E: PostgresMain (postgres.c:4766)
>> ==65065==    by 0x92A88B: BackendMain (backend_startup.c:124)
>> ==65065==    by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
>> ==65065==    by 0x860111: BackendStartup (postmaster.c:3580)
>> ==65065==    by 0x85DE6F: ServerLoop (postmaster.c:1702)
>> ==65065==  Address 0x7b6c000 is in a rw- anonymous segment
>>
>>
>> This fails here (on the pg_numa_touch_mem_if_required call):
>>
>>     for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
>>     {
>>         os_page_ptrs[idx++] = ptr;
>>
>>         /* Only need to touch memory once per backend process */
>>         if (firstNumaTouch)
>>             pg_numa_touch_mem_if_required(touch, ptr);
>>     }
> 
> That's because we mark unpinned pages as inaccessible / mark them as
> accessible when pinning. See logic related to that in PinBuffer():
> 
>                 /*
>                  * Assume that we acquired a buffer pin for the purposes of
>                  * Valgrind buffer client checks (even in !result case) to
>                  * keep things simple.  Buffers that are unsafe to access are
>                  * not generally guaranteed to be marked undefined or
>                  * non-accessible in any case.
>                  */
> 
> 
>> The 0x7b6c000 is the very first pointer, and it's the only pointer that
>> triggers this warning.
> 
> I suspect that that's because valgrind combines different reports or such.
> 

Thanks. It probably is something like that, although I made sure to not
use any such options when running valgrind (so --error-limit=no). But
maybe there's something else, hiding the reports.

I guess there are two ways to address this - make sure the buffers are
marked as accessible/defined, or add a valgrind suppression. I think the
suppression is the right approach here, otherwise we'd need to worry
about already pinned buffers etc. Which seems not great, the functions
don't even care about buffers right now, they mostly work with memory
pages (especially pg_shmem_allocations_numa).

Barring objections, I'll fix it this way.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

24 June, 18:30:02

Re: Tomas Vondra
> If it's a reliable fix, then I guess we can do it like this. But won't
> that be a performance penalty on everyone? Or does the system split the
> array into 16-element chunks anyway, so this makes no difference?

There's still the overhead of the syscall itself. But no idea how
costly it is to have this 16-step loop in user or kernel space.

We could claim that on 32-bit systems, shared_buffers would be smaller
anyway, so there the overhead isn't that big. And the step size should
be larger (if at all) on 64-bit.

> Anyway, maybe we should start by reporting this to the kernel people. Do
> you want me to do that, or shall one of you take care of that? I suppose
> that'd be better, as you already wrote a fix / know the code better.

Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

24 June, 23:32:25

On 6/24/25 17:30, Christoph Berg wrote:
> Re: Tomas Vondra
>> If it's a reliable fix, then I guess we can do it like this. But won't
>> that be a performance penalty on everyone? Or does the system split the
>> array into 16-element chunks anyway, so this makes no difference?
> 
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
> 
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.
> 
>> Anyway, maybe we should start by reporting this to the kernel people. Do
>> you want me to do that, or shall one of you take care of that? I suppose
>> that'd be better, as you already wrote a fix / know the code better.
> 
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
> 

Thanks! Now we wait ...

Attached is a minor tweak of the valgrind suppresion rules, to add the
two places touching the memory. I was hoping I could add a single rule
for pg_numa_touch_mem_if_required, but that does not work - it's a
macro, not a function. So I had to add one rule for both functions,
querying the NUMA. That's a bit disappointing, because it means it'll
hide all other failues (of Memcheck:Addr8 type) in those functions.

Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a
proper (inlined) function, at least with USE_VALGRIND defined. Something
like the v2 patch - needs more testing to ensure the inlined function
doesn't break the touching or something silly like that.

regards

-- 
Tomas Vondra

Attachment

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Bertrand Drouvot

Date:

25 June, 09:05:21

Hi,

On Tue, Jun 24, 2025 at 04:41:33PM +0200, Christoph Berg wrote:
> Re: Bertrand Drouvot
> > Yes, I think compat_uptr_t usage is missing in do_pages_stat() (while it's used
> > in do_pages_move()).
> 
> I was also reading the kernel source around that place but you spotted
> the problem before me. This patch resolves the issue here:
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8cf0f9c9599..542c81ec3ed 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2444,7 +2444,13 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
>          if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
>              break;
> 
> -        pages += chunk_nr;
> +        if (in_compat_syscall()) {
> +            compat_uptr_t __user *pages32 = (compat_uptr_t __user *)pages;
> +
> +            pages32 += chunk_nr;
> +            pages = (const void __user * __user *) pages32;
> +        } else
> +            pages += chunk_nr;
>          status += chunk_nr;
>          nr_pages -= chunk_nr;
>      }
> 

Thanks! Yeah, I had the same kind of patch idea in mind.

> > +               #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> > +
> > +               for (uint64 chunk_start = 0; chunk_start < shm_ent_page_count; chunk_start +=
NUMA_QUERY_CHUNK_SIZE){
 
> 
> Perhaps optimize it to this:
> 
> #if sizeof(void *) == 4
> #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> #else
> #define NUMA_QUERY_CHUNK_SIZE 1024
> #endif
> 
> ... or some other bigger number.

I had in mind to split the batch size on the PG side only for 32-bits, what about
the attached?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Work-around-Linux-kernel-bug-in-do_pages_stat.patch

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Bertrand Drouvot

Date:

25 June, 09:11:21

Hi,

On Tue, Jun 24, 2025 at 05:30:02PM +0200, Christoph Berg wrote:
> Re: Tomas Vondra
> > If it's a reliable fix, then I guess we can do it like this. But won't
> > that be a performance penalty on everyone? Or does the system split the
> > array into 16-element chunks anyway, so this makes no difference?
> 
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
> 
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.

Right, and we already mention in the doc that using those views is "very slow"
or "can take a noticeable amount of time".

> > Anyway, maybe we should start by reporting this to the kernel people. Do
> > you want me to do that, or shall one of you take care of that? I suppose
> > that'd be better, as you already wrote a fix / know the code better.
> 
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2

Thanks! I had in mind to look at how to report such a bug and provide a patch
but you beat me to it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Bertrand Drouvot

Date:

25 June, 09:45:55

Hi,

On Tue, Jun 24, 2025 at 10:32:25PM +0200, Tomas Vondra wrote:
> 
> Attached is a minor tweak of the valgrind suppresion rules,

Thanks!

> to add the
> two places touching the memory. I was hoping I could add a single rule
> for pg_numa_touch_mem_if_required, but that does not work - it's a
> macro, not a function. So I had to add one rule for both functions,
> querying the NUMA. That's a bit disappointing, because it means it'll
> hide all other failues (of Memcheck:Addr8 type) in those functions.
>

Shouldn't we add 2 rules for Memcheck:Addr4 too?

> Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a
> proper (inlined) function, at least with USE_VALGRIND defined.

Yeah I think that's probably better to reduce the scope to what we really want to.

> Something
> like the v2 patch -

yeah, maybe:

- add a rule for Memcheck:Addr4?
- have the same parameters name for the macro and the function?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Jakub Wartak

Date:

25 June, 10:15:02

On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg <myon@debian.org> wrote:
>
> Re: Tomas Vondra
> > If it's a reliable fix, then I guess we can do it like this. But won't
> > that be a performance penalty on everyone? Or does the system split the
> > array into 16-element chunks anyway, so this makes no difference?
>
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
>
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.
>
> > Anyway, maybe we should start by reporting this to the kernel people. Do
> > you want me to do that, or shall one of you take care of that? I suppose
> > that'd be better, as you already wrote a fix / know the code better.
>
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>

Hi all, I'm quite late to the party (just noticed the thread), but
here's some addition context: it technically didn't make any sense to
me to have NUMA on 32-bit due too small amount of addressable memory
(after all, NUMA is about big iron, probably not even VMs), so in the
first versions of the patchset I've excluded 32-bit (and back then for
some reason I couldn't even find libnuma i386, but Andres pointed to
me that it exists, so we re-added it probably just to stay
consistent). The thread has kind of snowballed since then, but I still
believe that NUMA on 32-bit does not make a lot of sense.

Even assuming future shm interleaving one day in future version,
allocation of small s_b sizes will usually fit a single NUMA node.

-J.

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Christoph Berg

Date:

25 June, 12:00:38

Re: Bertrand Drouvot
> +/*
> + * Work around Linux kernel bug in 32-bit compat mode: do_pages_stat() has
> + * incorrect pointer arithmetic for more than DO_PAGES_STAT_CHUNK_NR pages.
> + */
> +#if SIZEOF_SIZE_T == 4

I was also missing it in my suggested patch draft, but this should
probably include #ifdef __linux__.


Re: Tomas Vondra
> +#ifdef USE_VALGRIND
> +
> +static inline void
> +pg_numa_touch_mem_if_required(uint64 tmp, char *ptr)

Stupid question, if this function gets properly inlined, why not
always use it as there should be no performance difference vs using a
macro?

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

25 June, 12:31:36

On 6/25/25 09:15, Jakub Wartak wrote:
> On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg <myon@debian.org> wrote:
>>
>> Re: Tomas Vondra
>>> If it's a reliable fix, then I guess we can do it like this. But won't
>>> that be a performance penalty on everyone? Or does the system split the
>>> array into 16-element chunks anyway, so this makes no difference?
>>
>> There's still the overhead of the syscall itself. But no idea how
>> costly it is to have this 16-step loop in user or kernel space.
>>
>> We could claim that on 32-bit systems, shared_buffers would be smaller
>> anyway, so there the overhead isn't that big. And the step size should
>> be larger (if at all) on 64-bit.
>>
>>> Anyway, maybe we should start by reporting this to the kernel people. Do
>>> you want me to do that, or shall one of you take care of that? I suppose
>>> that'd be better, as you already wrote a fix / know the code better.
>>
>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>>
> 
> Hi all, I'm quite late to the party (just noticed the thread), but
> here's some addition context: it technically didn't make any sense to
> me to have NUMA on 32-bit due too small amount of addressable memory
> (after all, NUMA is about big iron, probably not even VMs), so in the
> first versions of the patchset I've excluded 32-bit (and back then for
> some reason I couldn't even find libnuma i386, but Andres pointed to
> me that it exists, so we re-added it probably just to stay
> consistent). The thread has kind of snowballed since then, but I still
> believe that NUMA on 32-bit does not make a lot of sense.
> 
> Even assuming future shm interleaving one day in future version,
> allocation of small s_b sizes will usually fit a single NUMA node.
> 

Not sure. I thought NUMA doesn't matter very much on 32-bit systems too,
exactly because those systems tend to use small amounts of memory. But
then while investigating this issue I realized even rpi5 has NUMA, in
fact it has a whopping 8 nodes:

debian@raspberry-32:~ $ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 981 MB
node 0 free: 882 MB
node 1 cpus: 0 1 2 3
node 1 size: 1007 MB
node 1 free: 936 MB
node 2 cpus: 0 1 2 3
node 2 size: 1007 MB
node 2 free: 936 MB
node 3 cpus: 0 1 2 3
node 3 size: 943 MB
node 3 free: 873 MB
node 4 cpus: 0 1 2 3
node 4 size: 1007 MB
node 4 free: 936 MB
node 5 cpus: 0 1 2 3
node 5 size: 1007 MB
node 5 free: 935 MB
node 6 cpus: 0 1 2 3
node 6 size: 1007 MB
node 6 free: 936 MB
node 7 cpus: 0 1 2 3
node 7 size: 990 MB
node 7 free: 918 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  10  10  10  10  10  10  10
  1:  10  10  10  10  10  10  10  10
  2:  10  10  10  10  10  10  10  10
  3:  10  10  10  10  10  10  10  10
  4:  10  10  10  10  10  10  10  10
  5:  10  10  10  10  10  10  10  10
  6:  10  10  10  10  10  10  10  10
  7:  10  10  10  10  10  10  10  10


This is with the 32-bit system (which AFAICS means 64-bit kernel and
32-bit user space). I'm not saying it's a particularly interesting NUMA
system, considering all the costs are 10, and it's not like it's
critical to get the best performance on rpi5. But it's NUMA, and maybe
there are some other (more practical) systems. I find it interesting
mostly for testing purposes.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

26 June, 11:53:26

On 6/26/25 08:00, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 10:32:25PM +0200, Tomas Vondra wrote:
>> On 6/24/25 17:30, Christoph Berg wrote:
>>> Re: Tomas Vondra
>>>> If it's a reliable fix, then I guess we can do it like this. But won't
>>>> that be a performance penalty on everyone? Or does the system split the
>>>> array into 16-element chunks anyway, so this makes no difference?
>>>
>>> There's still the overhead of the syscall itself. But no idea how
>>> costly it is to have this 16-step loop in user or kernel space.
>>>
>>> We could claim that on 32-bit systems, shared_buffers would be smaller
>>> anyway, so there the overhead isn't that big. And the step size should
>>> be larger (if at all) on 64-bit.
>>>
>>>> Anyway, maybe we should start by reporting this to the kernel people. Do
>>>> you want me to do that, or shall one of you take care of that? I suppose
>>>> that'd be better, as you already wrote a fix / know the code better.
>>>
>>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>>>
>>
>> Thanks! Now we wait ...
> 
> It looks like that the bug is "confirmed" and that it will be fixed:
> https://marc.info/?l=linux-kernel&m=175088392116841&w=2
> 

Yay! I like how the first response is "you sent the patch wrong" ;-)


cheers

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

27 June, 17:52:08

Here's three small patches, that should handle the issue


0001 - Adds the batching into pg_numa_query_pages, so that the callers
don't need to do anything.

The batching doesn't seem to cause any performance regression. 32-bit
systems can't use that much memory anyway, and on 64-bit systems the
batch is sufficiently large (1024).


0002 - Silences the valgrind about the memory touching. It replaces the
macro with a static inline function, and adds suppressions for both
32-bit and 64-bits. The 32-bit may be a bit pointless, because on my
rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't
hurt.

The function now looks like this:

  static inline void
  pg_numa_touch_mem_if_required(void *ptr)
  {
      volatile uint64 touch pg_attribute_unused();
      touch = *(volatile uint64 *) ptr;
  }

I did a lot of testing on multiple systems to check replacing the macro
with a static inline function still works - and it seems it does. But if
someone thinks the function won't work, I'd like to know.


0003 - While working on these patches, it occurred to me we could/should
add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take
quite a bit of time, so letting people to interrupt it seems reasonable.
It wasn't possible with just one call into the kernel, but with the
batching we can add a CFI.


Please, take a look.

regards

-- 
Tomas Vondra

Attachment

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Bertrand Drouvot

Date:

01 July, 07:06:31

Hi,

On Mon, Jun 30, 2025 at 08:56:43PM +0200, Tomas Vondra wrote:
> In particular it now uses "chunking" instead of "batching". I believe
> bathing is "combining multiple requests into a single one", but we're
> doing exactly the opposite - splitting a large request into smaller
> ones. Which is what "chunking" does.

I do agree that "chuncking" is more appropriate here.

> I plan to push this tomorrow morning.

Thanks!

LGTM, just 2 nit about the commit messages:

For 0001:

Is it worth to add a link to the Kernel Bug report or mentioned it can be
found in the discussion?

For 0003:

"
But with the chunking, introduced to work around the do_pages_stat()
bug"

Do you have in mind to quote the hex commit object name that will be generated
by 0001?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

From

Tomas Vondra

Date:

01 July, 14:03:28

On 7/1/25 06:06, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 30, 2025 at 08:56:43PM +0200, Tomas Vondra wrote:
>> In particular it now uses "chunking" instead of "batching". I believe
>> bathing is "combining multiple requests into a single one", but we're
>> doing exactly the opposite - splitting a large request into smaller
>> ones. Which is what "chunking" does.
> 
> I do agree that "chuncking" is more appropriate here.
> 
>> I plan to push this tomorrow morning.
> 
> Thanks!
> 
> LGTM, just 2 nit about the commit messages:
> 
> For 0001:
> 
> Is it worth to add a link to the Kernel Bug report or mentioned it can be
> found in the discussion?
> 
> For 0003:
> 
> "
> But with the chunking, introduced to work around the do_pages_stat()
> bug"
> 
> Do you have in mind to quote the hex commit object name that will be generated
> by 0001?
> 

Thanks! Pushed, with both adjustments (link to kernel thread, adding the
commit hash).

But damn it, right after pushing I noticed two typos in the second
commit message :-/

-- 
Tomas Vondra