Re: Adding basic NUMA awareness - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Adding basic NUMA awareness |
Date | |
Msg-id | uaqt63a2kz2g3qk2ujr55gdvmymnak3zsplcgrlh63ekbrytbc@a2wy22p3slge Whole thread Raw |
In response to | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Adding basic NUMA awareness
|
List | pgsql-hackers |
Hi, On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote: > On 7/2/25 13:37, Ashutosh Bapat wrote: > > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote: > >> > >> > >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch > >> > >> Minor optimization. Andres noticed we're tracking the tail of buffer > >> freelist, without using it. So the patch removes that. > >> > > > > The patches for resizing buffers use the lastFreeBuffer to add new > > buffers to the end of free list when expanding it. But we could as > > well add it at the beginning of the free list. Yea, I don't see any point in adding buffers to the tail instead of to the front. We probably want more recently used buffers at the front, since they (and the associated BufferDesc) are more likely to be in a CPU cache. > > This patch seems almost independent of the rest of the patches. Do you > > need it in the rest of the patches? I understand that those patches > > don't need to worry about maintaining lastFreeBuffer after this patch. > > Is there any other effect? > > > > If we are going to do this, let's do it earlier so that buffer > > resizing patches can be adjusted. > > > > My patches don't particularly rely on this bit, it would work even with > lastFreeBuffer. I believe Andres simply noticed the current code does > not use lastFreeBuffer, it just maintains is, so he removed that as an > optimization. Optimiziation / simplification. When building multiple freelists it was harder to maintain the tail pointer, and since it was never used... +1 to just applying that part. > I don't know how significant is the improvement, but if it's measurable we > could just do that independently of our patches. I doubt it's really an improvement in any realistic scenario, but it's also not a regression in any way, since it's never used... FWIW, I've started to wonder if we shouldn't just get rid of the freelist entirely. While clocksweep is perhaps minutely slower in a single thread than the freelist, clock sweep scales *considerably* better [1]. As it's rather rare to be bottlenecked on clock sweep speed for a single thread (rather then IO or memory copy overhead), I think it's worth favoring clock sweep. Also needing to switch between getting buffers from the freelist and the sweep makes the code more expensive. I think just having the buffer in the sweep, with a refcount / usagecount of zero would suffice. That seems particularly advantageous if we invest energy in making the clock sweep deal well with NUMA systems, because we don't need have both a NUMA aware freelist and a NUMA aware clock sweep. Greetings, Andres Freund [1] A single pg_prewarm of a large relation shows a difference between using the freelist and not that's around the noise level, whereas 40 parallel pg_prewarms of seperate relations is over 5x faster when disabling the freelist. For the test: - I modified pg_buffercache_evict_* to put buffers onto the freelist - Ensured all of shared buffers is allocated by querying pg_shmem_allocations_numa, as otherwise the workload is dominated by the kernel zeroing out buffers - used shared_buffers bigger than the data - data for single threaded is 9.7GB, data for the parallel case is 40 relations of 610MB each. - in the single threaded case I pinned postgres to a single core, to make sure core-to-core variation doesn't play a role - single threaded case c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_large');")-t1 concurrent case: c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');")-t1
pgsql-hackers by date: