Clock sweep not caching enough B-Tree leaf pages? - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Clock sweep not caching enough B-Tree leaf pages? |
Date | |
Msg-id | CAM3SWZRM7-qmOGMCzix5U49ndFbLS_WwTx6VJsja+bN_Li5v_A@mail.gmail.com Whole thread Raw |
Responses |
Re: Clock sweep not caching enough B-Tree leaf pages?
Re: Clock sweep not caching enough B-Tree leaf pages? Re: Clock sweep not caching enough B-Tree leaf pages? Re: Clock sweep not caching enough B-Tree leaf pages? Re: Clock sweep not caching enough B-Tree leaf pages? |
List | pgsql-hackers |
I have some theories about the PostgreSQL buffer manager/clock sweep. To motivate the reader to get through the material presented here, I present up-front a benchmark of a proof-of-concept patch of mine: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/ Test Set 4 represents the patches performance here. This shows some considerable improvements for a tpc-b workload, with 15 minute runs, where the buffer manager struggles with moderately intense cache pressure. shared_buffers is 8GiB, with 32GiB of system memory in total. The scale factor is 5,000 here, so that puts the primary index of the accounts table at a size that makes it impossible to cache entirely within shared_buffers, by a margin of couple of GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63 GB". Obviously the heap is much larger, since for that table heap tuples are several times the size of index tuples (the ratio here is probably well below the mean, if I can be permitted to make a vast generalization). PostgreSQL implements a clock sweep algorithm, which gets us something approaching an LRU for the buffer manager in trade-off for less contention on core structures. Buffers have a usage_count/"popularity" that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK algorithm only has one bit for what approximates our "usage_count" (so it's either 0 or 1). I think that at its core CLOCK is an algorithm that has some very desirable properties that I am sure must be preserved. Actually, I think it's more accurate to say we use a variant of clock pro, a refinement of the original CLOCK. In the past, various hackers have noted problems they've observed with this scheme. A common pathology is to see frantic searching for a victim buffer only to find all buffer usage_count values at 5. It may take multiple revolutions of the clock hand before a victim buffer is found, as usage_count is decremented for each and every buffer. Also, BufFreelistLock contention is considered a serious bottleneck [1], which is related. I think a very real problem that may be that approximating an LRU is bad because an actual LRU is bad, though not due to the problems that CLOCK/our clock sweep algorithm to its great credit ameliorates. I don't think that we need to trade-off between an LRU and MRU as Atri once suggested [2]; rather, I think we need a trade-off between an LRU and a LFU (very loosely speaking). Something like a pure LRU can make very bad choices for database workloads. This is described extensively in the literature. As I recently remarked upon for unrelated reasons [3], B+Trees perform excellently when partially cached. There is a remarkable asymmetry between how many pages we must have cached relative to how much I/O is avoided, particularly the random I/O that is characteristic of fully uncached B-Trees when scanned. Approximately 99% of pages in our nbtree structures can be expected to be leaf pages in many common cases. There is a very wide fan-out of B-Tree structures that makes this possible. A lot of material on the subject doesn't emphasize this basic aspect strongly enough in my view. But it also bears mentioning that B-Tree pages are, in an important sense, far denser than heap pages. Let's leave aside inner/root pages though, because they're so dramatically useful when in a primary index on a tpb-b table that they'll always be cached by any non-terrible algorithm. It beggars belief that the still relatively dense (and consequently *popular*) B+Tree leaf pages get so little credit for being of such long-term utility (in the view of our clock sweep algorithm as implemented). The algorithm has what could be loosely described as an excessively short-term perspective. There is clearly a better balance to be had here. I don't think the answer is that we have the B-Tree code give its pages some special treatment that makes them harder to evict, although I will admit that I briefly entertained the idea. I am aware of the false start that we had with ARC back in 2005. I believe that effort tried to address some of these problems. I am mindful of the pitfalls there. I'm inclined to think that the decision to create a CLOCK variant in preference to ARC was the right one at the time, because clock sweep really is at a considerable advantage with regard to lock contention. The 1993 paper "The LRU-K Page Replacement Algorithm For Database Disk Buffering" [4] proposes what is (roughly speaking) an algorithm that is an adaptive hybrid of LRU and LFU. Consider Example 1.1. from that paper; it describes a very simple scenario in which its possible to have slightly more heap pages cached than B-Tree pages. This scenario is essentially a "pgbench -S", a scenario compared to the simplistic tpc-a; it's nothing more than that. The authors aren't clear on this, but I assumed a uniform distribution (IIRC, the 2Q paper, which was published a year or two later, also explicitly assumes this of the very same earlier LRU-K/LRU-2 example). It is argued within that example, quite cogently in my opinion, that it's very bad that a simple LRU cache will see just under 50% of buffers used to cache B-Tree leaf pages, while over 50% are used for "data" (heap) pages. In actual fact, it is preferable to almost exclusively cache B-Tree pages. Using 50% of the cache on B-Tree pages when it is optimal to cache a number approaching 100% of the cache is a pretty large disparity, particularly for such a simple workload. This is literally the furthest possible thing from a contrived corner case. I believe that it isn't hard to get clock sweep to do just the same as predicted in the '93 paper, even with a "usage_count". It is nothing more than an LRU approximation, or at least that's what the relevant comments say. I did some experiments here, but it probably isn't worth sharing the raw data, given what I already have here. The example surrounding caching B-Tree leaf pages in the paper draws attention to a particularly acute pathology, but there are probably others. Leaf pages are in many representative cases far denser, and therefore can be expected to be accessed far more frequently to service queries. Our current failure to credit them in a way that weighs the frequency with which they're accessed is something I suggest thinking long and hard about. Has anyone thought about this in the last few years? I know that Tom examined the LRU-K paper back in 2000 [5], but was discouraged by some kind of contention or CPU overhead (although he did say he intended to revisit the question). Obviously a lot has changed in the past 14 years, particularly with regard to CPU characteristics. Anyway, having gone through the requisite background information, I'll get back to the subject of the pgbench-tools tpc-b benchmark that I started out with, and what I've actually done in the attached patch to improve matters. Let me preface this by saying: this is a rough prototype. The way I add a field to the buffer descriptor struct clearly isn't going to fly, since we have every reason to believe that that is itself performance critical in other scenarios (just look at Andres' recent work on the alignment of these structures in memory). Actually, it probably matters a lot in this scenario too, just not enough to mask the general outcome. I've arrived at this prototype through plenty of principled theorizing, and a bit of unprincipled frobbing. I'm a big fan of building simple prototypes to test theories. In the interest of reproducibility I have not attempted to clean up what I have here just yet, in case I accidentally invalidate the benchmark presented. The prototype patch: 1) Throttles incrementation of usage_count temporally. It becomes impossible to increment usage_count for any given buffer more frequently than every 3 seconds, while decrementing usage_count is totally unaffected. This is thought to give the algorithm a short to medium term "appreciation" of the frequency of access. It also prevents very close increments in usage_count due to pinning a buffer twice in the same query or transaction, just because the code in the executor happens to be accidentally structured such that that happens (as when inserting two tuples within a single INSERT DML query), or whatever. Clearly the tpc-b accounts table is accessed twice in succession in the same transaction by our tpb-c, creating a sort of misrepresentation of buffer popularity that is likely in and of itself undesirable. 2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not 5 as before. This gives us a more granular representation of the popularity/usefulness of each buffer, which I believe spans a dramatically large spectrum (i.e. my guess is that testing will show I didn't go far enough). This step on its own would be assumed extremely counter-productive by those in the know, but I believe that other measures ameliorate the downsides. I could be wrong about how true that is in other cases, but then the case helped here isn't what you'd call a narrow benchmark. It's the pgbench default (although I do use unlogged tables, in part because the I/O subsystem on this server is quite under-powered, even though it has plenty of memory). 3) Has buffer usage_count start at 3. I also tried 4 (and 1, the current value within master), but settled on 3 for now. Seemingly this makes a very large difference compared to only doing 1) and 2), since it gives each page a fair shot at proving its usefulness. Presumably the throttling in 1) makes frantic buffer scanning much less prevalent, since clock sweep has the upper hand, so to speak. A tug of war between clock sweep and backends pinning buffers can be disastrous, but that seems much less likely than before. Semi-magical constant amounts of time are something you see surprisingly often in the research. Probably most notably, the "five-minute rule" [6] argues that one should only cache randomly accessed pages that are re-used every 5 minutes or less (it's actually a bit more complicated these days, but not much). This rule is a mainstay of database caching research theory. Anyway, I'm not sure whether or not this delay should be exposed as a GUC. I lean towards "no". The amount of evidence it takes to validate something like this is generally enormous. Validating the just-in-time bgwriter strategy was the original reason that Greg Smith wrote pgbench-tools. Still, I'm confident that I've identified a real problem. The big picture here is that pages have a fair go at proving their worth, while allowing for certain pages to be recognized as being of dramatically higher long-term utility. Generally speaking, as a caching algorithm, a pure LFU is inappropriate for almost all use-cases, because it never forgets anything until it forgets everything about individual pages. There is a balance to be had here, in terms of the extent to which we allow pages to "rest on their laurels". I wouldn't be surprised to learn I have the balance wrong right now. pgbench-tools benchmark interpretation ============================= I also attach a LibreOffice spreadsheet comparing hit rates for each table (and its indexes) for each run that you see in the pgbench-tools report (except the final do-over of master baseline). I built this with the help of a small pgbench-tools hack, resetting pg_statio* views, measuring hit rates per pgbench invocation. The hit rate shown would probably be a lot more interesting if I'd used a rate limit, so the amount of work performed is consistent across test sets (that is, variants of the patch, and master). Greg Smith is very enthusiastic about how much further insight can be had while using that pgbench feature, and I'd say he's probably right to be. Right now, in terms of hit rate you mostly see a slightly smaller one for indexes, and a considerably larger one for heap buffers as compared to the master baseline. If you run down the individual runs in the pgbench-tools report, and consider what actually happens and when, you tend to see a degradation in performance over successive runs for different client counts for certain cases tested. It's hard to be sure, but I think that might be because earlier runs have a big advantage due to the index being well cached (pgbench-tools performs vacuuming at the start of each run not including the first run. So VACUUM reverses the situation initially seen, since we kill heap tuples last when vacuuming, rather than creating the index last). In contrast, the impressive performance of the patch (where all 3 of the above measures are implemented) shows more consistent throughput even with that noise, which seems to support my theories about what is going on here. Leaving aside cache effectiveness as such, I suspect that in general we currently pay a high price for all too frequently having buffers fall out of shared_buffers into the OS cache, and fall back in again. Having arrived at the conclusion that these optimizations were worth sharing here, I decided to "do over" the original master baseline (Test Set 1). I ended up with a result (Test set 5) that is actually quite different from the original result, without it being all that clear that it was better or worse than the first time I took a baseline (it was probably worse). This might call into question the accuracy of my results. However, to see what's really going on here, it is necessary to drill down to the individual TPS/latency figures for individual pgbench invocations. That's the real story here. In general, as I said, the I/O system here is quite under specified for this workload. This is dedicated physical hardware, but it happens to be what was immediately available to me. There are a couple of SATA 7200 rpm disks in RAID 1. The system specifications, as advertised by my hosting provider is detailed here: https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . As always with pgbench-tools, you can drill down to see more about each test (including kernel vm settings, etc). Inevitably, with this disk setup, which is I imagine particularly bad with random I/O, and also with so much memory, it's only a matter of time before things get ugly, even if there is an initial high performance burst (even on master) before we have to face the music, unsurprisingly. It seems like the pain is felt at slightly different times between Test Set 1 and Test Set 5 (i.e. for each test of master). With either of those 2 test sets, if you drill down to see the moment-to-moment throughput and latency, you'll find tests from each test set where both latency and throughput were on the floor for as long as a couple of minutes at a time, after which there is a sudden recovery. We're subject to the vagaries of kernel I/O scheduling to a much greater extent than with the good patched test sets, it seems. What is seen with master looks very much like sync phase checkpoint spikes. Once or twice, things are consistently very poor for almost an entire master invocation of pgbench. In general things are very inconsistent, although to be fair it's possible that at its best master has greater throughput than patched for brief moments (of course, the buffer descriptor bloating which I have yet to deal with may well be all that is to blame here). If you look at the test sets that this patch covers (with all the tricks applied), there are pretty good figures throughout. You can kind of see the pain towards the end, but there are no dramatic falls in responsiveness for minutes at a time. There are latency spikes, but they're *far* shorter, and much better hidden. Without looking at individual multiple minute spikes, at the macro level (all client counts for all runs) average latency is about half of what is seen on master. Does anyone have any high-end hardware that they could use to test this out? [1] http://www.postgresql.org/message-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com [2] http://www.postgresql.org/message-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com [3] http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com [4] http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf [5] http://www.postgresql.org/message-id/1601.967421129@sss.pgh.pa.us [6] http://en.wikipedia.org/wiki/Five-minute_rule -- Peter Geoghegan
Attachment
pgsql-hackers by date: