Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | 20150817151306.GB10786@awork2.anarazel.de Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: checkpointer continuous flushing
|
List | pgsql-hackers |
On 2015-08-17 15:21:22 +0200, Fabien COELHO wrote: > My current thinking is "maybe yes, maybe no":-), as it may depend on the OS > implementation of posix_fadvise, so it may differ between OS. As long as fadvise has no 'undirty' option, I don't see how that problem goes away. You're telling the OS to throw the buffer away, so unless it ignores it that'll have consequences when you read the page back in. > This is a reason why I think that flushing should be kept a guc, even if the > sort guc is removed and always on. The sync_file_range implementation is > clearly always very beneficial for Linux, and the posix_fadvise may or may > not induce a good behavior depending on the underlying system. That's certainly an argument. > This is also a reason why the default value for the flush guc is currently > set to false in the patch. The documentation should advise to turn it on for > Linux and to test otherwise. Or if Linux is assumed to be often a host, then > maybe to set the default to on and to suggest that on some systems it may be > better to have it off. I'd say it should then be an os-specific default. No point in making people work for it needlessly on linux and/or elsewhere. > (Another reason to keep it "off" is that I'm not sure about what > happens with such HD flushing features on virtual servers). I don't see how that matters? Either the host will entirely ignore flushing, and thus the sync_file_range and the fsync won't cost much, or fsync will be honored, in which case the pre-flushing is helpful. > Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host > and it was as bad as Linux (namely the database and even the box was offline > for long minutes...), and if you can avoid that having to read back some > data may be not that bad a down payment. I don't see how that'd alleviate my fear. Sure, the latency for many workloads will be better, but I don't how that argument says anything about the reads? And we'll not just use this in cases it'd be beneficial... > The issue is largely mitigated if the data is not removed from > shared_buffers, because the OS buffer is just a copy of already hold data. > What I would do on such systems is to increase shared_buffers and keep > flushing on, that is to count less on the system cache and more on postgres > own cache. That doesn't work that well for a bunch of reasons. For one it's completely non-adaptive. With the OS's page cache you can rely on free memory being used for caching *and* it be available should a query or another program need lots of memory. > Overall, I'm not convince that the practice of relying on the OS cache is a > good one, given what it does with it, at least on Linux. The alternatives aren't super realistic near-term though. Using direct IO efficiently on the set of operating systems we support is *hard*. It's more or less trivial to hack pg up to use direct IO for relations/shared_buffers, but it'll perform utterly horribly in many many cases. To pick one thing out: Without the OS buffering writes any write will have to wait for the disks, instead being asynchronous. That'll make writes performed by backends a massive bottleneck. > Now, if someone could provide a dedicated box with posix_fadvise (say > FreeBSD, maybe others...) for testing that would allow to provide data > instead of speculating... and then maybe to decide to change its default > value. Testing, as an approximation, how it turns out to work on linux would be a good step. Greetings, Andres Freund
pgsql-hackers by date: