Hi!
Here's a slightly improved version of the patch series.
The main improvement is related to rebalancing partitions of different
sizes (which can happen because the sizes have to be a multiple of some
minimal "chunk" determined by memory page size etc.). Part 0009 deals
with that by adjusting the allocations by partition size. It works OK,
but it's also true it matters less as the shared_buffers size increases
(as the relative difference between large/small partition gets smaller).
The other improvements are related to the pg_buffercache_partitions
view, showing the weights and (running) totals of allocations.
I plan to take a break from this patch series for a while, so this would
be a good time to take a look, do a review, run some tests etc. ;-)
One detail about the balancing I forgot to mention in my last message is
how the patch "distributes" allocations to match the balancing weights.
Consider for example the example weights from that message:
P1: [ 55, 45]
P2: [ 0, 100]
Imagine a backend located on P1 requests allocation of a buffer. The
weights say 55% buffers should be allocated from P1, and 45% should be
redirected to P2. One way to achieve that would be generating a random
number in [1, 100], and if it's [1,55] then P1, otherwise P2.
The patch does a much simpler thing - treat the weight as a "budget",
i.e. number of buffers to allocate before proceeding to the "next"
partition. So it allocates 55 buffers from P1, then 45 buffers from P2,
and then goes back to P1 in a round-robin way. The advantage is it can
do away without a PRNG.
There's two things I'm not entirely sure about:
1) memory model - I'm not quite sure the current code ensures updates to
weights are properly "communicated" to the other processes. That is, if
the bgwriter recalculates the weights, will the other backends see the
new weights right away? Using a stale weights won't cause "failures",
the consequence is just a bit of imbalance. But it shouldn't stay like
that for too long, so maybe it'd be good to add some memory barriers or
something like that.
2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
assumes each core / piece of RAM is assigned to a particular NUMA node.
For the buffer partitioning the patch mostly cares about memory, as it
"locates" the buffers on different NUMA nodes. Which works mostly OK
(ignoring the issues with huge pages described in previous message).
But it also cares about the cores (and the node for each core), because
it uses that to pick the right partition for a backend. And here the
situation is less clear, because the CPUs don't need to be assigned to a
particular node, even on a NUMA system. Consider the rpi5 NUMA layout:
$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 992 MB
node 0 free: 274 MB
node 1 cpus: 0 1 2 3
node 1 size: 1019 MB
node 1 free: 327 MB
node 2 cpus: 0 1 2 3
node 2 size: 1019 MB
node 2 free: 321 MB
node 3 cpus: 0 1 2 3
node 3 size: 955 MB
node 3 free: 251 MB
node 4 cpus: 0 1 2 3
node 4 size: 1019 MB
node 4 free: 332 MB
node 5 cpus: 0 1 2 3
node 5 size: 1019 MB
node 5 free: 342 MB
node 6 cpus: 0 1 2 3
node 6 size: 1019 MB
node 6 free: 352 MB
node 7 cpus: 0 1 2 3
node 7 size: 1014 MB
node 7 free: 339 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 10 10 10 10 10 10 10
1: 10 10 10 10 10 10 10 10
2: 10 10 10 10 10 10 10 10
3: 10 10 10 10 10 10 10 10
4: 10 10 10 10 10 10 10 10
5: 10 10 10 10 10 10 10 10
6: 10 10 10 10 10 10 10 10
7: 10 10 10 10 10 10 10 10
This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
are not assigned to particular nodes - each core is mapped to all 8 NUMA
nodes. I'm not sure what to do about this (or how getcpu() or libnuma
handle this). And can the situation be even more complicated?
regards
--
Tomas Vondra