Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id rsjxzhnxxfq5i5yxv66mhinb42o3vzmqpkbfpexpkk5prreh2l@jyp73gsyyfzn
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Andres Freund <andres@anarazel.de>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
Hi,

On 2026-01-12 19:10:00 -0500, Andres Freund wrote:
> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
> > On 1/10/26 02:42, Andres Freund wrote:
> > > psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa
GROUPBY 1;' && perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance)
FROMpgbench_accounts;'
 
>
> > And then I initialized pgbench with scale that is much larger than
> > shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
> > then I ran
> >
> >   select * from pgbench_accounts offset 1000000000;
> >
> > which does a sequential scan with the circular buffer you mention abobe
>
> Did you try it with the query I suggested? One plausible reason why you did
> not see an effect with your query is that with a huge offset you actually
> never deform the tuple, which is an important and rather latency sensitive
> path.

Btw, this doesn't need anywhere close to as much data, it should be visible as
soon as you're >> L3.

To show why
  SELECT * FROM pgbench_accounts OFFSET 100000000
doesn't show an effect but
  SELECT sum(abalance) FROM pgbench_accounts;

does, just look at the difference using the perf command I posted. Here on a
scale 200.


numactl --membind 0 --cpunodebind 0

offset:

S0        6         47,138,135      UNC_M_CAS_COUNT.WR               #   3884.1 MB/s  memory_bandwidth_write
S0       20        780,343,577      duration_time
S0        6         61,685,331      UNC_M_CAS_COUNT.RD               #   5082.8 MB/s  memory_bandwidth_read
S0       20        780,353,818      duration_time
S1        6          1,238,568      UNC_M_CAS_COUNT.WR               #    102.1 MB/s  memory_bandwidth_write
S1        6          1,475,224      UNC_M_CAS_COUNT.RD               #    121.6 MB/s  memory_bandwidth_read

       0.776715450 seconds time elapsed


agg:

S0        6         53,145,706      UNC_M_CAS_COUNT.WR               #   2000.8 MB/s  memory_bandwidth_write
S0       20      1,706,046,493      duration_time
S0        6        111,390,488      UNC_M_CAS_COUNT.RD               #   4193.5 MB/s  memory_bandwidth_read
S0       20      1,706,057,341      duration_time
S1        6          3,968,454      UNC_M_CAS_COUNT.WR               #    149.4 MB/s  memory_bandwidth_write
S1        6          4,026,212      UNC_M_CAS_COUNT.RD               #    151.6 MB/s  memory_bandwidth_read



numactl --membind 0 --cpunodebind 1

offset:

S0        6         91,982,003      UNC_M_CAS_COUNT.WR               #   7036.4 MB/s  memory_bandwidth_write
S0       20        842,785,290      duration_time
S0        6        113,076,316      UNC_M_CAS_COUNT.RD               #   8650.1 MB/s  memory_bandwidth_read
S0       20        842,797,430      duration_time
S1        6          1,545,612      UNC_M_CAS_COUNT.WR               #    118.2 MB/s  memory_bandwidth_write
S1        6          2,354,087      UNC_M_CAS_COUNT.RD               #    180.1 MB/s  memory_bandwidth_read

       0.836623794 seconds time elapsed

agg:

S0        6        133,267,754      UNC_M_CAS_COUNT.WR               #   3980.9 MB/s  memory_bandwidth_write
S0       20      2,146,221,284      duration_time
S0        6        159,951,549      UNC_M_CAS_COUNT.RD               #   4777.9 MB/s  memory_bandwidth_read
S0       20      2,146,233,675      duration_time
S1        6         71,543,708      UNC_M_CAS_COUNT.WR               #   2137.1 MB/s  memory_bandwidth_write
S1        6         49,584,957      UNC_M_CAS_COUNT.RD               #   1481.2 MB/s  memory_bandwidth_read

       2.142535432 seconds time elapsed


Note how much bigger the absolute numbers of reads and writes are for the
aggregate compared to the offset.

Interestingly I do see a performance difference, albeit a smaller one, even
with OFFSET. I see similar numbers on two different 2 socket machines.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: docs: clarify ALTER TABLE behavior on partitioned tables
Next
From: Andres Freund
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)