Re: CRCs - Mailing list pgsql-hackers
From | ncm@zembu.com (Nathan Myers) |
---|---|
Subject | Re: CRCs |
Date | |
Msg-id | 20010113024753.B7991@store.zembu.com Whole thread Raw |
In response to | RE: CRCs ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
Responses |
Re: CRCs
|
List | pgsql-hackers |
On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote: > Example. > 1. Tuple was inserted into index. > 2. Looking for free buffer bufmgr decides to write index block. > 3. Following WAL core rule bufmgr first calls XLogFlush() to write > and fsync log record related to index tuple insertion. > 4. *Believing* that log record is on disk now (after successful fsync) > bufmgr writes index block. > > If log record was not really flushed on disk in 3. but on-disk image of > index block was updated in 4. and system crashed after this then after > restart recovery you'll have unlawful index tuple pointing to where? > Who knows! No guarantee that corresponding heap tuple was flushed on > disk. > > Isn't database corrupted now? Note, I haven't read the WAL code, so much of what I've said is based on what I know is and isn't possible with logging, rather than on Vadim's actual choices. I know it's *possible* to implement a logging database which can maintain consistency without need for strict write ordering; but without strict write ordering, it is not possible to guarantee durable transactions. That is, after a power outage, such a database may be guaranteed to recover uncorrupted, but some number (>= 0) of the last few acknowledged/committed transactions may be lost. Vadim's implementation assumes strict write ordering, so that (e.g.) with IDE disks a corrupt database is possible in the event of a power outage. (Database and OS crashes don't count; those don't keep the blocks from finding their way from on-disk buffers to disk.) This is no criticism; it is more efficient to assume strict write ordering, and a database that can lose (the last few) committed transactions has limited value. To achieve disk write-order independence is probably not a worthwhile goal, but for systems that cannot provide strict write ordering (e.g., most PCs) it would be helpful to be able to detect that the database has become corrupted. In Vadim's example above, if the index were to contain not only the heap blocks' numbers, but also their CRCs, then the corruption could be detected when the index is used. When the block is read in, its CRC is checked, and when it is referenced via the index, the two CRC values are simply compared and the corruption is revealed. On a machine that does provide strict write ordering, the CRCs in the index might be unnecessary overhead, but they also provide cross-checks to help detect corruption introduced by bugs and whatnot. Or maybe I don't know what I'm talking about. Nathan Myers ncm@zembu.com
pgsql-hackers by date: