Re: double writes using "double-write buffer" approach [WIP] - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: double writes using "double-write buffer" approach [WIP] |
Date | |
Msg-id | CA+TgmobCJEVmnWGam7EmWAeZ5zGWYFN4QmC11Ha6JzdeTdX3aQ@mail.gmail.com Whole thread Raw |
In response to | Re: double writes using "double-write buffer" approach [WIP] (Dan Scales <scales@vmware.com>) |
Responses |
Re: double writes using "double-write buffer" approach
[WIP]
|
List | pgsql-hackers |
On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote: > Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think whatyou can say about this double-write patch is that it will work not work well for configurations that have a small Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. The general guidance for setting shared_buffers these days is 25% of RAM up to a maximum of 8GB, so the configuration that you're describing as not optimal for this patch is the one normally used when running PostgreSQL. I've run across several cases where larger values of shared_buffers are a huge win, because the entire working set can then be accommodated in shared_buffers. But it's certainly not the case that all working sets fit. And in this case, I think that's beside the point anyway. I had shared_buffers set to 8GB on a machine with much more memory than that, but the database created by pgbench -i -s 10 is about 156 MB, so the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with mostof it left over. However, because of the BufferAccessStrategy stuff, pages start to get forced out to the OS pretty quickly. Of course, we could disable the BufferAccessStrategy stuff when double_writes is in use, but bear in mind that the reason we have it in the first place is to prevent cache trashing effects. It would be imprudent of us to throw that out the window without replacing it with something else that would provide similar protection. And even if we did, that would just delay the day of reckoning. You'd be able to blast through and dirty the entirety of shared_buffers at top speed, but then as soon as you started replacing pages performance would slow to an utter crawl, just as it did here, only you'd need a bigger scale factor to trigger the problem. The more general point here is that there are MANY aspects of PostgreSQL's design that assume that shared_buffers accounts for a relatively small percentage of system memory. Here's another one: we assume that backends that need temporary memory for sorts and hashes (i.e. work_mem) can just allocate it from the OS. If we were to start recommending setting shared_buffers to large percentages of the available memory, we'd probably have to rethink that. Most likely, we'd need some kind of in-core mechanism for allocating temporary memory from the shared memory segment. And here's yet another one: we assume that it is better to recycle old WAL files and overwrite the contents rather than create new, empty ones, because we assume that the pages from the old files may still be present in the OS cache. We also rely on the fact that an evicted CLOG page can be pulled back in quickly without (in most cases) a disk access. We also rely on shared_buffers not being too large to avoid walloping the I/O controller too hard at checkpoint time - which is forcing some people to set shared_buffers much smaller than would otherwise be ideal. In other words, even if setting shared_buffers to most of the available system memory would fix the problem I mentioned, it would create a whole bunch of new ones, many of them non-trivial. It may be a good idea to think about what we'd need to do to work efficiently in that sort of configuration, but there is going to be a very large amount of thinking, testing, and engineering that has to be done to make it a reality. There's another issue here, too. The idea that we're going to write data to the double-write buffer only when we decide to evict the pages strikes me as a bad one. We ought to proactively start dumping pages to the double-write area as soon as they're dirtied, and fsync them after every N pages, so that by the time we need to evict some page that requires a double-write, it's already durably on disk in the double-write buffer, and we can do the real write without having to wait. It's likely that, to make this perform acceptably for bulk loads, you'll need the writes to the double-write buffer and the fsyncs of that buffer to be done by separate processes, so that one backend (the background writer, perhaps) can continue spooling additional pages to the double-write files while some other process (a new auxiliary process?) fsyncs the ones that are already full. Along with that, the page replacement algorithm probably needs to be adjusted to avoid evicting pages that need an as-yet-unfinished double-write like the plague, even to the extent of allowing the BufferAccessStrategy rings to grow if the double-writes can't be finished before the ring wraps around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: