Re: Clock sweep not caching enough B-Tree leaf pages? - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Clock sweep not caching enough B-Tree leaf pages? |
Date | |
Msg-id | CAM3SWZTCNobsNvEyiipT1u2xWrOOt-TMRzqyPV=s-x6MG46KJA@mail.gmail.com Whole thread Raw |
In response to | Re: Clock sweep not caching enough B-Tree leaf pages? (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Clock sweep not caching enough B-Tree leaf pages?
|
List | pgsql-hackers |
Here is a benchmark that is similar to my earlier one, but with a rate limit of 125 tps, to help us better characterize how the prototype patch helps performance: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay-limit/ Again, these are 15 minute runs with unlogged tables at multiple client counts, and scale 5,000. Every test run should have managed to hit that limit, but in the case of master 1 test did not. I have included vmstat, iostat and meminfo OS instrumentation this time around, which is really interesting for this particular limit based benchmark. The prototype patch tested here is a slight refinement on my earlier prototype. Apart from reducing the number of gettimeofday() calls, and doing them out of the critical path, I increased the initial usage_count value to 6. I also set BM_MAX_USAGE_COUNT to 30. I guess I couldn't resist the temptation to tweak things, which I actually did very little of prior to publishing my initial results. This helps, but there isn't a huge additional benefit. The benchmark results show that master cannot even meet the 125 tps limit on 1 test out of 9. More interestingly, the background writer consistently cleans about 10,000 buffers per test run when testing the patch. At the same time, buffers are allocated at a very consistent rate of around 262,000 for the patched test runs. Leaving aside the first test run, with the patch there is only a tiny variance in the number cleaned between each test, a variance of just a few hundred buffers. In contrast, master has enormous variance. During just over half of the tests, the background writer does not clean even a single buffer. Then, on 2 tests out of 9, it cleans an enormous ~350,000 buffers. The second time this happens leads to master failing to even meet the 125 tps limit (albeit with only one client). If you drill down to individual test runs, a similar pattern is evident. You'll now find operating system information (meminfo dirty memory) graphed here. The majority of the time, master does not hold more than 1,000 kB of dirty memory at a time. Once or twice it's 0 kB for multiple minutes. However, during the test runs where we also see huge spikes in background-writer-cleaned pages, we also see huge spikes in the total amount of dirty memory (and correlated huge spikes in latency). It can get to highs of ~700,000 kB at one point. In contrast, the patched tests show very consistent amounts of dirty memory. Per test, it almost always tops out at 4,000 kB - 6,000 kB (there is a single 12,000 kB spike, though). There is consistently a distinct zig-zag pattern to the dirty memory graph with the patched tests, regardless of client count or where checkpoints occur. Master shows mountains and valleys for those two particularly problematic tests, correlating with a panicked background writer's aggressive feedback loop. Master also shows less rhythmic zig-zag patterns that only peak at about 600 kB - 1,000 kB for the entire duration of many individual test runs. Perhaps most notably, average and worst case latency is far improved with the patch. On average it's less than half of master with 1 client, and less than a quarter of master with 32 clients. I think that the rate limiting feature of pgbench is really useful for characterizing how work like this improves performance. I see a far smoother and more consistent pattern of I/O that superficially looks like Postgres is cooperating with the operating system much more than it does in the baseline. It sort of makes sense that the operating system cache doesn't care about frequency while Postgres does. If the OS cache did weigh frequency, it would surely not alter the outcome very much, since OS cached data has presumably not been accessed very frequently recently. I suspect that I've cut down on double buffering by quite a bit. I would like to come up with a simple way of measuring that, using something like pgfincore, but the available interfaces don't seem well-suited to quantifying how much of a problem this is and remains. I guess call pgfincore on the index segment files might be interesting, since shared_buffers mostly holds index pages. This has been verified using pg_buffercache. It would be great to test this out with something involving non-uniform distributions, like Gaussian and Zipfian distributions. The LRU-K paper tests Zipfian too. The uniform distribution pgbench uses here, while interesting, doesn't tell the full story at all and is less representative of reality (TPB-C is formally required to have a non-uniform distribution [1] for some things, for example). A Gaussian distribution might show essentially the same failure to properly credit pages with frequency of access in one additional dimension, so to speak. Someone should probably look at the TPC-C-like DBT-2, since in the past that was considered to be a problematic workload for PostgreSQL [2] due to the heavy I/O. Zipfian is a lot less sympathetic than uniform if the LRU-K paper is any indication, and so if we're looking for a worst-case, that's probably a good place to start. It's not obvious how you'd go about actually constructing a practical test case for either, though. I should acknowledge that I clearly have regressed one aspect that is evident from the benchmark: Cleanup time (which is recorded per test) takes more than twice as long. This makes intuitive sense, though. VACUUM first creates a list of tuples to kill by scanning the heap. It then goes to the indexes to kill them there first. It then returns to the heap, and kills heap tuples in a final scan. Clearly it is bad for VACUUM that shared_buffers mostly contains index pages when it begins. That said, I haven't actually considered the interactions with buffer access strategies here. It might well be more complicated than I've suggested. To be thorough, I've repeated each test set. There is a "do over" for both master and patched, which serves to show how repeatable the original test sets are. [1] Clause 2.1.6, TPC-C specification: http://www.tpc.org/tpcc/spec/tpcc_current.pdf [2] https://wiki.postgresql.org/wiki/PgCon_2011_Developer_Meeting#DBT-2_I.2FO_Performance -- Peter Geoghegan
pgsql-hackers by date: