WIP(!) Double Writes - Mailing list pgsql-hackers
From | David Fetter |
---|---|
Subject | WIP(!) Double Writes |
Date | |
Msg-id | 20120105061916.GB21048@fetter.org Whole thread Raw |
Responses |
Re: WIP(!) Double Writes
Re: WIP(!) Double Writes |
List | pgsql-hackers |
Folks, Please find attached two patches, each under the PostgreSQL license, one which implements page checksums vs. REL9_0_STABLE, the other which depends on the first (i.e. requires that it be applied first) and implements double writes. They're vs. REL9_0_STABLE because they're extracted from vPostgres 1.0, a proprietary product currently based on PostgreSQL 9.0. I had wanted the first patch set to be: - Against git head, and - Based on feedback from Simon's patch. The checksum part does the wrong thing, namely changes the page format and has some race conditions that Simon's latest page checksum patch removes. There are doubtless other warts, but I decided not to let the perfect be the enemy of the good. If that's a mistake, it's all mine. I tested with "make check," which I realize isn't the most thorough, but again, this is mostly to get out the general ideas of the patches so people have actual code to poke at. Dan Scales <scales@vmware.com> wrote the double write part and extracted the page checksums from previous work by Ganesh Venkitachalam, who's written here before. Dan will be answering questions if I can't :) Jignesh Shah may be able to answer performance questions, as he has been doing yeoman work on vPostgres in that arena. Let the brickbats begin! Cheers, David. Caveats (from Dan): The attached patch implements a "double_write" option. The idea of this option (as has been discussed) is to handle the problem of torn writes for buffer pages by writing (almost) all buffers twice, once to a double-write file and once to the data file. If a crash occurs, then a buffer should always have a correct copy either in the double-write file or in the data file, so the double-write file can be used to correct any torn writes to the data files. The "double_write" option can therefore be used in place of "full_page_writes", and can not only improve performance, but also reduce the size of the WAL log. The patch currently makes use of checksums on the data pages. As has been pointed out, double writes only strictly require that the pages in the double write file be checksummed, and we can fairly easily make data checksums optional. However, if data checksums are used, then Postgres can provide more useful messages on exactly when torn pages have occurred. It is very likely that a torn page happened if, during recovery, the checksum of a data page is incorrect, but a copy of the page with a valid checksum is in the double-write file. To achieve efficiency, the checkpoint writer and bgwriter should batch writes to multiple pages together. Currently, there is an option "batched_buffer_writes" that specifies how many buffers to batch at a time. However, we may want to remove that option from view, and just force batched_buffer_writes to a default (32) if double_writes is enabled. In order to batch, the checkpoint writer must acquire multiple buffer locks simultaneously as it is building up the batch. The patch does simple deadlock detection that ends a batch early if the lock for the next buffer that it wants to include in the batch is held. This situation almost never happens. Given the batching functionality, double writes by the checkpoint writer (and bgwriter) is implemented efficiently by writing a batch of pages to the double-write file and fsyncing, and then writing the pages to the appropriate data files, and fsyncing all the necessary data files. While the data fsyncing might be viewed as expensive, it does help eliminate a lot of the fsync overhead at the end of checkpoints. FlushRelationBuffers() and FlushDatabaseBuffers() can be similarly batched. We have some other code (not included) that sorts buffers to be checkpointed in file/block order -- this can reduce fsync overhead further by ensuring that each batch writes to only one or a few data files. The actual batch writes are done using writev(), which might have to be replaced with equivalent code, if this is a portability issue. A struct iocb structure is currently used for bookkeeping during the low-level batching, since it is compatible with an async IO approach as well (not included). We do have to do the same double write for dirty buffer evictions by individual backends (in BufferAlloc). This could be expensive, if there are a lot of dirty buffer evictions (i.e. where the checkpoint/bgwriter can generate enough clean pages for the backends). Double writes must be done for any page which might be used after recovery even if there was a full crash while writing the page. This includes all writes to such pages in a checkpoint, not just the first, since Postgres cannot do correct WAL recovery on a torn page (I believe). Pages in temporary tables and some unlogged operations do not require double writes. Feedback is especially welcome on whether we have missed some kinds of pages that do/do not require double writes. As Jignesh has mentioned on this list, we see significant performance gains when enabling double writes & disabling full_page_writes for OLTP runs with sufficient buffer cache size. We are now trying to measure some runs where the dirty buffer eviction rate by the backends is high. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
pgsql-hackers by date: