Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Adding basic NUMA awareness
Date
Msg-id 6636d842-e8a5-44bc-a206-6df415380a7b@vondra.me
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
Hi,

Here's a fresh version of the NUMA patch series. There's a number of
substantial improvements:

1) Rebase to current master, particularly on top of 2c7894052759 which
removed the freelist. The patch that partitioned the freelist is gone.

2) A bunch of fixes, and it now passes CI workflows on github, and all
other testing I did. Of course, more testing is needed.

3) It builds with/without libnuma support, and so on.

4) The separate GUCs were replaced by a single list GUC, similar to what
we do for debug_io_direct. The GUC is called debug_numa, and accepts
"buffets" and "procs", to partition the two areas.

I'm considering adding a "clock-sweep" option in a future patch. I
didn't do that here, because the buffer partitioning is already enabled
by "buffers", and the clock-sweep just builds on that (partitioning the
same way, pretty much).

5) It also works with EXEC_BACKEND, but this turned out to be a bit more
challenging than expected. The trouble is some of the parameters (e.g.
memory page size) are used both in the "size" and "init" phases, and we
need to be extra careful to not make something "inconsistent".

For example, we may get confused about the memory page size. The "size"
happens before allocation, and at that point we don't know if we succeed
in getting enough huge pages. When "init" happens, we already know that,
so our "memory page size" could be different. We must be careful, e.g.
to not need more memory than we requested.

This is a general problem, but the EXEC_BACKEND makes it a bit trickier.
In regular fork() case we can simply set some global variables in "size"
and use them later in "init", but that doesn't work for EXEC_BACKEND.

The current approach simply does the calculations twice, in a way that
should end with the same results. But I'm not 100% happy with it, and I
suspect it just confirms we should store the results in shmem memory
(which is what I called "NUMA registry" before). That's still a TODO.

6) BufferDescPadded is 64B everywhere. Originally, the padding was
applied only on 64-bit platforms, and on 32-bit systems the struct was
left at 60B. But that is not compatible with buffer partitioning, which
relies on the memory page being a multiple of BufferDescPadded.

Perhaps it could be relaxed (so that a BufferDesc might span two memory
pages), but this seems like the cleanes solution. I don't expect it to
make any measurable difference.

7) I've moved some of the code to BgBufferSyncPartition, to make review
easier (without all the indentation changes).

8) I've realized some of the TAP tests occasionally fail with

    ERROR: no unpinned buffers

and I think I know why. Some of the tests set shared_buffers to a very
low value - like 1MB or even 128kB, and StrategyGetBuffer() may search
only a single partition (but not always). We may run out of unpinned
buffers in that one partition.

This apparently happens more easily on rpi5, due to the weird NUMA
layout (there are 8 nodes with memory, but getcpu() reports node 0 for
all cores).

I suspect the correct fix is to ensure StrategyGetBuffer() scans all
partitions, if there are no unpinned buffers in the current one. On
realistic setups this shouldn't happen very often, I think.

The other issue I just realized is that StrategyGetBuffer() recalculates
the partition index over and over, which seems unnecessary (and possibly
expensive, due to the modulo). And it also does too many loops, because
it used NBuffers instead of the partition size. I'll fix those later.

9) I'm keeping the cloc-sweep balancing patches separate for now. In the
end all the clock-sweep patches should be merged, but it keeps the
changes easier to review this way, I think.

10) There's not many changes in the PGPROC partitioning patch. I merely
fixed issues that broke it on EXEC_BACKEND, and did some smaller tweaks.

11) I'm not including the experimental patches to pin backends to CPUs
(or nodes), and so on. It's clear those are unlikely to go in, so it'd
be just a distraction.

12) What's the right / portable way to determine the current CPU for a
process? The clock-sweep / PGPROC patches need this to pick the right
partition, but it's not clear to me which API is the most portable. In
the end I used sched_getcpu(), and then numa_node_of_cpu(), but maybe
there's a better way.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: Add support for specifying tables in pg_createsubscriber.
Next
From: Kirill Reshke
Date:
Subject: Re: Display is_prev_bucket_same_wrt of xl_hash_squeeze_page