Re: 16-bit page checksums for 9.2 - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: 16-bit page checksums for 9.2 |
Date | |
Msg-id | 4EFC449F02000025000441CD@gw.wicourts.gov Whole thread Raw |
In response to | 16-bit page checksums for 9.2 (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: 16-bit page checksums for 9.2
Re: 16-bit page checksums for 9.2 Re: 16-bit page checksums for 9.2 |
List | pgsql-hackers |
> Heikki Linnakangas wrote: > On 28.12.2011 01:39, Simon Riggs wrote: >> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas >> wrote: >>> On 25.12.2011 15:01, Kevin Grittner wrote: >>>> >>>> I don't believe that. Double-writing is a technique to avoid >>>> torn pages, but it requires a checksum to work. This chicken- >>>> and-egg problem requires the checksum to be implemented first. >>> >>> I don't think double-writes require checksums on the data pages >>> themselves, just on the copies in the double-write buffers. In >>> the double-write buffer, you'll need some extra information per- >>> page anyway, like a relfilenode and block number that indicates >>> which page it is in the buffer. You are clearly right -- if there is no checksum in the page itself, you can put one in the double-write metadata. I've never seen that discussed before, but I'm embarrassed that it never occurred to me. >> How would you know when to look in the double write buffer? > > You scan the double-write buffer, and every page in the double > write buffer that has a valid checksum, you copy to the main > storage. There's no need to check validity of pages in the main > storage. Right. I'll recap my understanding of double-write (from memory -- if there's a material error or omission, I hope someone will correct me). The write-ups I've seen on double-write techniques have all the writes to the double-write buffer (a single, sequential file that stays around). This is done as sequential writing to a file which is overwritten pretty frequently, making the writes to a controller very fast, and a BBU write-back cache unlikely to actually write to disk very often. On good server-quality hardware, it should be blasting RAM-to_RAM very efficiently. The file is fsync'd (like I said, hopefully to BBU cache), then each page in the double-write buffer is written to the normal page location, and that is fsync'd. Once that is done, the database writes have no risk of being torn, and the double-write buffer is marked as empty. This all happens at the point when you would be writing the page to the database, after the WAL-logging. On crash recovery you read through the double-write buffer from the start and write the pages which look good (including a good checksum) to the database before replaying WAL. If you find a checksum error in processing the double-write buffer, you assume that you never got as far as the fsync of the double-write buffer, which means you never started writing the buffer contents to the database, which means there can't be any torn pages there. If you get to the end and fsync, you can be sure any torn pages from a previous attempt to write to the database itself have been overwritten with the good copy in the double-write buffer. Either way, you move on to WAL processing. You wind up with a database free of torn pages before you apply WAL. full_page_writes to the WAL are not needed as long as double-write is used for any pages which would have been written to the WAL. If checksums were written to the double-buffer metadata instead of adding them to the page itself, this could be implemented alone. It would probably allow a modest speed improvement over using full_page_writes and would eliminate those full-page images from the WAL files, making them smaller. If we do add a checksum to the page header, that could be used for testing for torn pages in the double-write buffer without needing a redundant calculation for double-write. With no torn pages in the actual database, checksum failures there would never be false positives. To get this right for a checksum in the page header, double-write would need to be used for all cases where full_page_writes now are used (i.e., the first write of a page after a checkpoint), and for all unlogged writes (e.g., hint-bit-only writes). There would be no correctness problem for always using double-write, but it would be unnecessary overhead for other page writes, which I think we can avoid. -Kevin
pgsql-hackers by date: