Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: Adding basic NUMA awareness |
| Date | |
| Msg-id | c73a8abb-afd8-41f5-abcc-77f9a27d99d2@vondra.me Whole thread Raw |
| In response to | Re: Adding basic NUMA awareness (Andres Freund <andres@anarazel.de>) |
| List | pgsql-hackers |
On 1/13/26 02:08, Andres Freund wrote: > Hi, > > On 2026-01-13 01:51:09 +0100, Tomas Vondra wrote: >> On 1/13/26 01:10, Andres Freund wrote: >>> Hi, >>> >>> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote: >>>> On 1/10/26 02:42, Andres Freund wrote: >>>>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, sum(size) FROM pg_shmem_allocations_numa GROUPBY 1;' && perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 'SELECT sum(abalance) FROMpgbench_accounts;' >>> >>>> And then I initialized pgbench with scale that is much larger than >>>> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And >>>> then I ran >>>> >>>> select * from pgbench_accounts offset 1000000000; >>>> >>>> which does a sequential scan with the circular buffer you mention abobe >>> >>> Did you try it with the query I suggested? One plausible reason why you did >>> not see an effect with your query is that with a huge offset you actually >>> never deform the tuple, which is an important and rather latency sensitive >>> path. >>> >> >> I did try with the agg query too, and there's still no difference on >> either machine. > > Could you provide numactl --hardware for both? There may be more than two > numa nodes on a system with 2 sockets, due to one socket being split into two > - in which case the latency between 0,1 might be a lot lower than say 0 and 3. > xeon: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 node 0 size: 32066 MB node 0 free: 13081 MB node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 node 1 size: 32210 MB node 1 free: 17764 MB node distances: node 0 1 0: 10 21 1: 21 10 azure/epyc: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 193412 MB node 0 free: 147949 MB node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 193513 MB node 1 free: 151577 MB node distances: node 0 1 0: 10 11 1: 11 10 > >> I can't do the perf on the Azure VM, because the Ubuntu is image is >> borked and does not allow installing the package. But on my xeon I can >> do the perf, and that gives me this: >> >> numactl --membind=0 --cpunodebind=0 ~/builds/master-test/bin/pg_ctl >> ----------------------------------------------------------------------- >> S0 1 24,677,226 UNC_M_CAS_COUNT.WR # 79.0 MB/s ... idth_write >> S0 1 20,001,829,522 ns duration_time ... >> S0 1 972,631,426 UNC_M_CAS_COUNT.RD # 3112.2 MB/s ... idth_read >> S0 1 20,001,822,807 ns duration_time ... >> S1 1 15,602,233 UNC_M_CAS_COUNT.WR # 49.9 MB/s ... idth_write >> S1 1 712,431,146 UNC_M_CAS_COUNT.RD # 2279.6 MB/s ... idth_read >> >> >> numactl --membind=0 --cpunodebind=1 ~/builds/master-test/bin/pg_ctl >> ----------------------------------------------------------------------- >> S0 1 47,931,019 UNC_M_CAS_COUNT.WR # 153.4 MB/s ... idth_write >> S0 1 20,002,933,380 ns duration_time ... >> S0 1 1,007,386,994 UNC_M_CAS_COUNT.RD # 3223.2 MB/s ... idth_read >> S0 1 20,002,927,341 ns duration_time ... >> S1 1 10,310,201 UNC_M_CAS_COUNT.WR # 33.0 MB/s ... idth_write >> S1 1 714,826,668 UNC_M_CAS_COUNT.RD # 2287.2 MB/s ... idth_read >> >> so there is a little bit of a difference for some stats, but not much. >> >> >> FWIW this is from >> >> perf stat --per-socket -M memory_bandwidth_read,memory_bandwidth_write >> -a -- sleep 20 >> >> while the agg query runs in a loop. > > FWIW doing one perf stat for each execution is preferrable for comparison, > because otherwise you can hide large differences in total number of memory > accesses if the runtimes for the queries in the two "numa configurations" are > different. > Good point, I'll do that next time. But in this case they are not all that different, I think. regards -- Tomas Vondra
pgsql-hackers by date: