Re: Adding basic NUMA awareness - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Adding basic NUMA awareness
Date
Msg-id uaqt63a2kz2g3qk2ujr55gdvmymnak3zsplcgrlh63ekbrytbc@a2wy22p3slge
Whole thread Raw
In response to Re: Adding basic NUMA awareness  (Tomas Vondra <tomas@vondra.me>)
Responses Re: Adding basic NUMA awareness
List pgsql-hackers
Hi,

On 2025-07-02 14:36:31 +0200, Tomas Vondra wrote:
> On 7/2/25 13:37, Ashutosh Bapat wrote:
> > On Wed, Jul 2, 2025 at 12:37 AM Tomas Vondra <tomas@vondra.me> wrote:
> >>
> >>
> >> 3) v1-0003-freelist-Don-t-track-tail-of-a-freelist.patch
> >>
> >> Minor optimization. Andres noticed we're tracking the tail of buffer
> >> freelist, without using it. So the patch removes that.
> >>
> >
> > The patches for resizing buffers use the lastFreeBuffer to add new
> > buffers to the end of free list when expanding it. But we could as
> > well add it at the beginning of the free list.

Yea, I don't see any point in adding buffers to the tail instead of to the
front. We probably want more recently used buffers at the front, since they
(and the associated BufferDesc) are more likely to be in a CPU cache.


> > This patch seems almost independent of the rest of the patches. Do you
> > need it in the rest of the patches? I understand that those patches
> > don't need to worry about maintaining lastFreeBuffer after this patch.
> > Is there any other effect?
> >
> > If we are going to do this, let's do it earlier so that buffer
> > resizing patches can be adjusted.
> >
>
> My patches don't particularly rely on this bit, it would work even with
> lastFreeBuffer. I believe Andres simply noticed the current code does
> not use lastFreeBuffer, it just maintains is, so he removed that as an
> optimization.

Optimiziation / simplification.  When building multiple freelists it was
harder to maintain the tail pointer, and since it was never used...

+1 to just applying that part.


> I don't know how significant is the improvement, but if it's measurable we
> could just do that independently of our patches.

I doubt it's really an improvement in any realistic scenario, but it's also
not a regression in any way, since it's never used...


FWIW, I've started to wonder if we shouldn't just get rid of the freelist
entirely. While clocksweep is perhaps minutely slower in a single thread than
the freelist, clock sweep scales *considerably* better [1]. As it's rather
rare to be bottlenecked on clock sweep speed for a single thread (rather then
IO or memory copy overhead), I think it's worth favoring clock sweep.

Also needing to switch between getting buffers from the freelist and the sweep
makes the code more expensive.  I think just having the buffer in the sweep,
with a refcount / usagecount of zero would suffice.

That seems particularly advantageous if we invest energy in making the clock
sweep deal well with NUMA systems, because we don't need have both a NUMA
aware freelist and a NUMA aware clock sweep.

Greetings,

Andres Freund

[1]

A single pg_prewarm of a large relation shows a difference between using the
freelist and not that's around the noise level, whereas 40 parallel
pg_prewarms of seperate relations is over 5x faster when disabling the
freelist.

For the test:
- I modified pg_buffercache_evict_* to put buffers onto the freelist

- Ensured all of shared buffers is allocated by querying
  pg_shmem_allocations_numa, as otherwise the workload is dominated by the
  kernel zeroing out buffers

- used shared_buffers bigger than the data

- data for single threaded is 9.7GB, data for the parallel case is 40
  relations of 610MB each.

- in the single threaded case I pinned postgres to a single core, to make sure
  core-to-core variation doesn't play a role

- single threaded case

  c=1 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_large');")-t1
 

  concurrent case:

  c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_:client_id');")-t1
 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: 回复:[Internet]Re: [PATCH] Prevent replacement of a function if it's used in an index expression and is not IMMUTABLE
Next
From: Greg Burd
Date:
Subject: Re: Adding basic NUMA awareness