Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock |
Date | |
Msg-id | 493bd9a1-788b-41ec-8848-f64949e52b33@vondra.me Whole thread Raw |
In response to | Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock (Greg Burd <greg@burd.me>) |
Responses |
Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock
|
List | pgsql-hackers |
Hi, I took a look at this thread / patches, mostly because it's somewhat related to the NUMA patch series [1]. Freelist is one of the things the NUMA patches tweak, to split it into per-node partitions. So removing the freelist would affect the NUMA patches, mostly simplifying it by making the freelist part irrelevant. On the whole the idea seems reasonable - I'm not against removing the freelist, assuming Andres is right and the freelist is not very useful anyway. So I decided to validate this assumption by running a couple of benchmarks ... benchmarks ---------- I did two tests: 1) run-pgbench.sh - read-only pgbench - scale set to different fraction of shared buffers (50%, 90%, 110%) - tracks tps 2) run-seqscan.sh - concurrect evic+seqscans on different tables - shared buffers set to different fractions of total dataset sizes (50%, 90%, 110%) - This is similar to the workload Andres suggested in [2], except that it uses a simple SELECT instead of pg_prewarm. The patches change how pg_prewarm decides to stop, and I didn't want to be affected by that. - tracks tps, and average latency for the evict + select The attached test scripts are "complete" - setup an instance. disable checksums, set a couple GUCs, etc. I wanted to see the effect of each patch, so tests were done on master, and then with patches applied one by one (after fixing the compile failures in 0001). I did this on two EPYC machines with 96 cores (lscpu attached), and the results are virtually the same. Attached are PDFs with results from one of the machines. On the whole, the impact of the patches is negligible it's within 1% in either direction, and I don't see any particular trend / systemic change in the behavior (e.g. regressions for some cases). Seems like a noise, which supports the assumption the freelist is not very useful. The one exception positive is "evict latency" which tracks latency of pg_buffercache_evict_relation(). That got consistently a little bit better, which makes sense as it does not need to maintain the freelist. freelist statistics? -------------------- There's one caveat, though. I find it tricky to "know" if the workload actually uses a freelist. Is there a good way to determine if a buffer was acquired from a freelist, or through the clocksweep? I'm worried the tests might actually have empty freelists, or maybe the freelists won't be used very much. In which case removing the freelist has understandably no impact. But it says nothing about cases getting buffers from freelists often ... I can probably think about each workload and deduce if there will be freelists. For the benchmarks described earlier, I think the situation is this: pgbench, shared buffers < 100% -> no freelists, uses clock-sweep pgbench, shared buffers > 100% -> freelists, but unused (no evictions) seqscan, shared buffers < 100% -> freelists, but clocksweep too seqscan, shared buffers > 100% -> freelists, no clocksweep But maybe I got it wrong for some cases? It'd be good to have a way to collect some stats for each test, to confirm this and also to quantify the effects. A test that gets 10% of buffers from a freelist may behave differently from a test that gets 99% buffers from a freelist. I propose we add a system view showing interesting information about freelists and buffer allocation, or perhaps extend an existing one (e.g. pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches I simply added a new view into pg_buffercache. I'm not suggesting this would get committed, at this point I'm more focused on the development. Whether the view would be useful outside the development is an open question. Also, if it adds stats about freelists (for current master), that'd be useless once we remove them. Now, a couple review comments about the individual patches: 0001 ---- 1) nitpick: adds a blank line at the end of buffer/README 2) gcc complains about missing have_free_buffer prototype, but it's no no longer needed and can be removed 3) freelist fails to compile, because of: freelist.c: In function ‘have_free_buffer’: freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’ from incompatible pointer type [-Wincompatible-pointer-types] 166 | uint64 hand = pg_atomic_read_u64(&StrategyControl->nextVictimBuffer); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | | pg_atomic_uint32 * That is, it should be accessed using pg_atomic_read_u32. CI/cfbot does not report this, because it applies all the patches, and 0002 fixes this. 0002 ---- 1) I think the buffer/README needs a bit more work. It now suddenly mentions completePasses, without explaining what it's for, and how it works with nextVictimBuffer. And then it's not mentioned again, so why even mention it? We don't mention various other fields ... It's true completePasses is not a new thing, and probably should have been already explained by the README. Still, it seems a bit confusing. 2) It'd be good if the README explained how we coordinate access to multiple atomic values (with the spinlock removed), why the current approach is safe, or how it's made safe. Or at least mention that we though about it, and where to look for details (e.g. in a comment for particular function). 3) I have no opinion on how "clock-sweep" should be spelled, but if we want to unify the spelling, it should be done in a separate preparatory patch, so that it does not mix with the actual changes. It adds quite a bit of changes. 3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes); Either it's unlikely but possible, and then it should be elog(ERROR), or we consider it impossible (and then assert is fine). Maybe just rephrase the comment to say we consider the overflow it impossible. Also, no need to split the passes_delta assignment, I think. It's fine to subtract before an assert - it'll be bogus/negative, but so what? It's not an undefined / invalid memory access or anything like that. 4) nitpick: freelist.c has incorrect formatting of "{" in a couple places, e.g. it should be on a newline in a struct definition, etc. pgindent would eventually fix this, but it bugs me. 5) I don't like the CLOCK_HAND_POSITION() naming very much, partially because it's not really clear what "clock hand" is. Intuitively I know what it's supposed to mean, also because I worked with this code before. But not everyone has that benefit, and may assume something else. IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter that only ever increases, we should call it something different. Say, "clockCweepCounter"? Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too? 6) I don't understand why the comment mentions the BufferDescriptor array at all? Sure, we may look at the descriptor, but why is this detail relevant in this place? Also, shouldn't it say "divide by NBuffers" at the end? 7) CLOCK_HAND_POSITION() should probably have a comment explaining why it does the masking, etc. And why it's better than simply doing the modulo on the counter directly. I mean, why is this better than just doing (counter % NBuffers)? It's going to work on uint64 anyway. 8) There was a discussion about doing the modulo in a better way, but I don't see any message explaining it clearly enough for me. And then there were some issues with it, and I guess the idea was abandoned. I'm asking because my NUMA patches do a modulo in a couple places in the clock-sweep part too, so if there's a better way, I'd like to know. 9) Why initialize "hand" to UINT64_MAX? Seems unnecessary. 10) nitpick: StrategySyncStart formatting is a bit wrong 0003 ---- 1) I don't quite understand the purpose of this part. How is this abstracting anything, compared to what we already have (or what the two earlier parts did)? And if it "abstracts the algorithm", what would be the purpose? Do we expect some alternative algorithms? And would 0003 really make it easier compared to just changing the code directly? I don't get it. 2) I don't understand why we need to explicitly pass the "clock" to every ClockSweepCycles/ClockSweepPosition call. I mean, there's only ever one instance anyway - at least without "my" NUMA patches. But even with the NUMA patches, the partition is determined inside those function, otherwise every caller would have to repeat that. I think this is unnecessary/inconvenient. regards [1] https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017%40vondra.me [2] https://www.postgresql.org/message-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3%40euyewe6tf3z3 -- Tomas Vondra
Attachment
pgsql-hackers by date: