Home > mailing lists

Re: CRCs - Mailing list pgsql-hackers

From	ncm@zembu.com (Nathan Myers)
Subject	Re: CRCs
Date	January 13, 2001 05:47:57
Msg-id	20010113024753.B7991@store.zembu.com Whole thread Raw
In response to	RE: CRCs ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>)
Responses	Re: CRCs
List	pgsql-hackers

Tree view

On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote:
> Example.
> 1. Tuple was inserted into index.
> 2. Looking for free buffer bufmgr decides to write index block.
> 3. Following WAL core rule bufmgr first calls XLogFlush() to write
>    and fsync log record related to index tuple insertion.
> 4. *Believing* that log record is on disk now (after successful fsync)
>    bufmgr writes index block.
> 
> If log record was not really flushed on disk in 3. but on-disk image of
> index block was updated in 4. and system crashed after this then after
> restart recovery you'll have unlawful index tuple pointing to where?
> Who knows! No guarantee that corresponding heap tuple was flushed on
> disk.
> 
> Isn't database corrupted now?

Note, I haven't read the WAL code, so much of what I've said is based 
on what I know is and isn't possible with logging, rather than on 
Vadim's actual choices.  I know it's *possible* to implement a logging 
database which can maintain consistency without need for strict write 
ordering; but without strict write ordering, it is not possible to 
guarantee durable transactions.  That is, after a power outage, such 
a database may be guaranteed to recover uncorrupted, but some number 
(>= 0) of the last few acknowledged/committed transactions may be lost.

Vadim's implementation assumes strict write ordering, so that (e.g.) 
with IDE disks a corrupt database is possible in the event of a power 
outage.  (Database and OS crashes don't count; those don't keep the 
blocks from finding their way from on-disk buffers to disk.)  This is 
no criticism; it is more efficient to assume strict write ordering, 
and a database that can lose (the last few) committed transactions 
has limited value.

To achieve disk write-order independence is probably not a worthwhile 
goal, but for systems that cannot provide strict write ordering (e.g., 
most PCs) it would be helpful to be able to detect that the database 
has become corrupted.  In Vadim's example above, if the index were to
contain not only the heap blocks' numbers, but also their CRCs, then 
the corruption could be detected when the index is used.  When the 
block is read in, its CRC is checked, and when it is referenced via 
the index, the two CRC values are simply compared and the corruption
is revealed. 

On a machine that does provide strict write ordering, the CRCs in the 
index might be unnecessary overhead, but they also provide cross-checks
to help detect corruption introduced by bugs and whatnot.

Or maybe I don't know what I'm talking about.  

Nathan Myers
ncm@zembu.com

pgsql-hackers by date:

From: ncm@zembu.com (Nathan Myers)
Date: 13 January 2001, 04:36:57
Subject: Re: CRCs

From: Larry Rosenman
Date: 13 January 2001, 06:48:02
Subject: (forw) Re: CVS Commit message generator...

Re: CRCs - Mailing list pgsql-hackers

Previous

Next