Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock |
Date | |
Msg-id | 2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3@euyewe6tf3z3 Whole thread Raw |
In response to | [PATCH] Let's get rid of the freelist and the buffer_strategy_lock ("Greg Burd" <greg@burd.me>) |
List | pgsql-hackers |
Hi, On 2025-07-11 13:26:53 -0400, Greg Burd wrote: > In conversations [1] recently about considering how best to adapt the code > to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we > shouldn't just get rid of the freelist entirely" and because I'm a glutton > for punishment (and I think this idea has some merit) I took him up on this > task. > In freelist.c the function StrategyGetBuffer() currently tries first to use > the BufferAccessStrategy, if present, via GetBufferFromRing(). Failing > that, the second step is to check the "freelist" as defined by > StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains > any available buffers, and finally it will "Use the clock-sweep algorithm to > find a free buffer." The freelist was intended to be "a list of buffers > that are prime candidates for replacement" but I question the value of that > versus the overhead of managing it. Without the list some operations are > likely (I plan to measure this) faster due, as Anders points out in [1], > "just using clock sweep actually makes things like DROP TABLE perform better > because it doesn't need to maintain the freelist anymore." It may be the > case that with very large NBuffers where most are in use that this approach > is slower (I plan to measure this too), but in those cases I'd imagine the > freelist is likely empty and so the code will use the clock-sweep algorithm > anyway, so I'm not sure there is a performance penalty at all. > > This change does remove the have_free_buffer() function used by the > contrib/pg_prewarm module. On the surface this doesn't seem to cause any > issues, but honestly I've not thought too deeply on this one. I think we'll likely need something to replace it. TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite right. The goal of the use have_free_buffer() is obviously to stop prewarming shared buffers if doing so would just evict buffers. But it's not clear to me that we should just stop when there aren't any free buffers - what if the previous buffer contents aren't the right ones? It'd make more sense to me to stop autoprewarm once NBuffers have been prewarmed... > v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep. > > > Once removed [2] and tests passing [3] I took a long hard look at the > buffer_strategy_lock that used to serialize concurrent access to members of > BufferStrategyControl and I couldn't find a good reason to keep it around. > Let's review what it is guarding: > > completePasses: a count of the number of times the clock-sweep hand wraps > around. StrategySyncStart() provides this to the bgwriter which in turn > uses it to compute a strategic location at which to start scanning for pages > to evict. There's an interesting comment that indicates both a > "requirement" and an equivocal "but that's highly unlikely and wouldn't be > particularly harmful" statement conflicting with itself. I tried to find a > reason that nextVictimBuffers could overflow or that the use of the > completePasses value could somehow cause harm if off by one or more in the > bgwriter and either I missed it (please tell me) or there isn't one. > However, it does make sense to change completePasses into an atomic value so > that it is consistent across backends and in the bgwriter. I don't think it's *quite* that easy. If you just maintain nextVictimBuffer and completePasses as separate atomic counters, without a lock making the two consistent, StrategySyncStart(), as coded right now / in your patch, won't necessarily return reasonable values for the two. Which I think would lead to bgwriter suddenly becoming overly active for one cycle and then very inactive for a while after. With really large shared buffers that'd be rather rare to be hit, but with small shared buffers I think it'd be common enough to worry. The most obvious way around this would be to make the clock hand a 64bit atomic, which would avoid the need to have a separate tracker for the number of passes. Unfortunately doing so would require doing a modulo operation each clock tick, which I think would likely be too expensive on common platforms - on small shared_buffers I actually see existing, relatively rarely reached, modulo in ClockSweepTick() show up on a Skylake-X system. It'd be easier if we could rely on NBuffers to be a power of two, but that doesn't seem like a realistic requirement. I think the easiest way here would be to make StrategySyncStart(), a relatively rare operation, retry whenever it detects that it saw out-of-sync nextVictimBuffer. E.g. something roughly like while (true) { completePasses = atomic_read_u32(&StrategyControl->completePasses); pg_memory_barrier(); nextVictimBuffer = atomic_read_u32(&StrategyControl->nextVictimBuffer); pg_memory_barrier(); if (completePasses == atomic_read_u32(&StrategyControl->completePasses) && nextVictimBuffer <= atomic_read_u32(&StrategyControl->nextVictimBuffer)) break; } which I think would detect the case that we read a nextVictimBuffer value that was decreased while reading a completePasses value that was increased. I think while at it, we should make ClockSweepTick() decrement nextVictimBuffer by atomically subtracting NBuffers, rather than using CAS. I recently noticed that the CAS sometimes has to retry a fair number of times, which in turn makes the `victim % NBuffers` show up in profiles. > I'd like to hear ideas for worst cases to test and/or benchmark. I plan on > attempting what I saw Michael do with flamegraphs before/after/delta in a > follow up to this. If people feel this has merit I'll add it to CF/2. What I've benchmarked is both single threaded and concurrent clock sweep, by doing pg_prewarm() of per-pgbench-client relations. I used c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT pg_prewarm('copytest_:client_id');")-t1 (with c=1 for the single-threaded case obviously) The reason for the pg_shmem_allocations_numa is to ensure that shared_buffers is actually mapped, as otherwise the bottleneck will be the kernel zeroing out buffers. The reason for doing -t1 is that I wanted to compare freelist vs clock sweep, rather than clock sweep in general. Note that I patched EvictUnpinnedBufferInternal() to call StrategyFreeBuffer(), otherwise running this a second time won't actually measure the freelist. And the first time run after postmaster start will always be more noisy... Greetings, Andres Freund
pgsql-hackers by date: