Re: Page Checksums + Double Writes - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Page Checksums + Double Writes |
Date | |
Msg-id | 4EF2A9090200002500043FB4@gw.wicourts.gov Whole thread Raw |
In response to | Page Checksums + Double Writes (David Fetter <david@fetter.org>) |
Responses |
Re: Page Checksums + Double Writes
Re: Page Checksums + Double Writes |
List | pgsql-hackers |
Simon Riggs wrote: > So overall, I do now think its still possible to add an optional > checksum in the 9.2 release and am willing to pursue it unless > there are technical objections. Just to restate Simon's proposal, to make sure I'm understanding it, we would support a new page header format number and the old one in 9.2, both to be the same size and carefully engineered to minimize what code would need to be aware of the version. PageHeaderIsValid() and PageInit() certainly would, and we would need some way to set, clear (maybe), and validate a CRC. We would need a GUC to indicate whether to write the CRC, and if present we would always test it on read and treat it as a damaged page if it didn't match. (Perhaps other options could be added later, to support recovery attempts, but let's not complicate a first cut.) This whole idea would depend on either (1) trusting your storage system never to tear a page on write or (2) getting the double-write feature added, too. I see some big advantages to this over what I suggested to David. For starters, using a flag bit and putting the CRC somewhere other than the page header would require that each AM deal with the CRC, exposing some function(s) for that. Simon's idea doesn't require that. I was also a bit concerned about shifting tuple images to convert non-protected pages to protected pages. No need to do that, either. With the bit flags, I think there might be some cases where we would be unable to add a CRC to a converted page because space was too tight; that's not an issue with Simon's proposal. Heikki was talking about a pre-convert tool. Neither approach really needs that, although with Simon's approach it would be possible to have a background *post*-conversion tool to add CRCs, if desired. Things would continue to function if it wasn't run; you just wouldn't have CRC protection on pages not updated since pg_upgrade was run. Simon, does it sound like I understand your proposal? Now, on to the separate-but-related topic of double-write. That absolutely requires some form of checksum or CRC to detect torn pages, in order for the technique to work at all. Adding a CRC without double-write would work fine if you have a storage stack which prevents torn pages in the file system or hardware driver. If you don't have that, it could create a damaged page indication after a hardware or OS crash, although I suspect that would be the exception, not the typical case. Given all that, and the fact that it would be cleaner to deal with these as two separate patches, it seems the CRC patch should go in first. (And, if this is headed for 9.2, *very soon*, so there is time for the double-write patch to follow.) It seems to me that the full_page_writes GUC could become an enumeration, with "off" having the current meaning, "wal" meaning what "on" now does, and "double" meaning that the new double-write technique would be used. (It doesn't seem to make any sense to do both at the same time.) I don't think we need a separate GUC to tell us *what* to protect against torn pages -- if not "off" we should always protect the first write of a page after checkpoint, and if "double" and write_page_crc (or whatever we call it) is "on", then we protect hint-bit-only writes. I think. I can see room to argue that with CRCs on we should do a full-page write to the WAL for a hint-bit-only change, or that we should add another GUC to control when we do this. I'm going to take a shot at writing a patch for background hinting over the holidays, which I think has benefit alone but also boosts the value of these patches, since it would reduce double-write activity otherwise needed to prevent spurious error when using CRCs. This whole area has some overlap with spreading writes, I think. The double-write approach seems to count on writing a bunch of pages (potentially from different disk files) sequentially to the double-write buffer, fsyncing that, and then writing the actual pages -- which must be fsynced before the related portion of the double-write buffer can be reused. The simple implementation would be to simply fsync the files just written to if they required a prior write to the double-write buffer, although fancier techniques could be used to try to optimize that. Again, setting hint bits set before the write when possible would help reduce the impact of that. -Kevin
pgsql-hackers by date: