Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | 20151111073857.GX32209@alap3.anarazel.de Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: checkpointer continuous flushing
Re: checkpointer continuous flushing |
List | pgsql-hackers |
On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote: > Here is a v13, which is just a rebase after 1aba62ec. And here's v14. It's not something entirely ready. A lot of details have changed, I unfortunately don't remember them all. But there are more important things than the details of the patch. I've played *a lot* with this patch. I found a bunch of issues: 1) The FileFlushContext context infrastructure isn't actually correct. There's two problems: First, using the actual 'fd' number to reference a to-be-flushed file isn't meaningful. If there are lots of files open, fds get reused within fd.c. That part is enough fixed by referencing File instead the fd. The bigger problem is that the infrastructure doesn't deal with files being closed. There can, which isn't that hard to trigger, be smgr invalidations causing smgr handle and thus the file to be closed. I think this means that the entire flushing infrastructure actually needs to be hoisted up, onto the smgr/md level. 2) I noticed that sync_file_range() blocked far more often than I'd expected. Reading the kernel code that turned out to be caused by a pessimization in the kernel introduced years ago - in many situation SFR_WRITE waited for the writes. A fix for this will be in the 4.4 kernel. 3) I found that latency wasn't improved much for workloads that are significantly bigger than shared buffers. The problem here is that neither bgwriter nor the backends have, so far, done sync_file_range() calls. That meant that the old problem of having gigabytes of dirty data that periodically get flushed out, still exists. Having these do flushes mostly attacks that problem. Benchmarking revealed that for workloads where the hot data set mostly fits into shared buffers flushing and sorting is anywhere from a small to a massive improvement, both in throughput and latency. Even without the patch from 2), although fixing that improves things furhter. What I did not expect, and what confounded me for a long while, is that for workloads where the hot data set does *NOT* fit into shared buffers, sorting often led to be a noticeable reduction in throughput. Up to 30%. The performance was still much more regular than before, i.e. no more multi-second periods without any transactions happening. By now I think I know what's going on: Before the sorting portion of the patch the write-loop in BufferSync() starts at the current clock hand, by using StrategySyncStart(). But after the sorting that obviously doesn't happen anymore - buffers are accessed in their sort order. By starting at the current clock hand and moving on from there the checkpointer basically makes it more less likely that victim buffers need to be written either by the backends themselves or by bgwriter. That means that the sorted checkpoint writes can, indirectly, increase the number of unsorted writes by other processes :( My benchmarking suggest that that effect is the larger, the shorter the checkpoint timeout is. That seems to intuitively make sense, give the above explanation attempt. If the checkpoint takes longer the clock hand will almost certainly soon overtake checkpoints 'implicit' hand. I'm not sure if we can really do anything about this problem. While I'm pretty jet lagged, I still spent a fair amount of time thinking about it. Seems to suggest that we need to bring back the setting to enable/disable sorting :( What I think needs to happen next with the patch is: 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully handling the issue of smgr invalidations. 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that later can contain multiple elements like checkpoint, bgwriter, backends, ddl, bulk-writes. That seems better than adding GUCs for these separately. Then make the flush locations in the patch configurable using that. 3) I think we should remove the sort timing from the checkpoint logging before commit. It'll always be pretty short. Greetings, Andres Freund
Attachment
pgsql-hackers by date: