Home > mailing lists

Re: 8192 BLCKSZ ? - Mailing list pgsql-hackers

From	Nathan Myers
Subject	Re: 8192 BLCKSZ ?
Date	November 28, 2000 16:02:42
Msg-id	20001128130134.C22345@store.zembu.com Whole thread Raw
In response to	Re: 8192 BLCKSZ ? (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: 8192 BLCKSZ ?
List	pgsql-hackers

Tree view

On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote:
> "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> > I don't believe it's a performance issue, I believe it's that writes to
> > blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> > system.  Hence, 32k blocks would break the transactions system.
> 
> As Nathan remarks nearby, it's hard to tell how big a write can be
> assumed atomic, unless you have considerable knowledge of your OS and
> hardware.  

Not to harp on the subject, but even if you _do_ know a great deal
about your OS and hardware, you _still_ can't assume any write is
atomic.

To give an idea of what is involved, consider that modern disk 
drives routinely re-order writes, by themselves.  You think you
have asked for a sequential write of 8K bytes, or 16 sectors,
but the disk might write the first and last sectors first, and 
then the middle sectors in random order.  A block of all zeroes
might not be written at all, but just noted in the track metadata.

Most disks have a "feature" that they report the write complete
as soon as it is in the RAM cache, rather than after the sectors
are on the disk.  (It's a "feature" because it makes their
benchmarks come out better.)  It can usually be turned off, but 
different vendors have different ways to do it.  Have you turned
it off on your production drives?

In the event of a power outage, the drive will stop writing in
mid-sector.  If you're lucky, that sector would have a bad checksum
if you tried to read it.  If the half-written sector happens to 
contain track metadata, you might have a bigger problem.  

----
The short summary is: for power outage or OS-crash recovery purposes,
there is no such thing as atomicity.  This is why backups and 
transaction logs are important.

"Invest in a UPS."  Use a reliable OS, and operate it in a way that
doesn't stress it.  Even a well-built OS will behave oddly when 
resources are badly stressed.  (That the oddities may be documented
doesn't really help much.)

For performance purposes, it may be more or less efficient to group 
writes into 4K, 8K, or 32K chunks.  That's not a matter of database 
atomicity, but of I/O optimization.  It can only confuse people to 
use "atomicity" in that context.

Nathan Myers
ncm@zembu.com

pgsql-hackers by date:

From: Bruce Guenter
Date: 28 November 2000, 13:32:51
Subject: Re: 8192 BLCKSZ ?

From: Tom Lane
Date: 28 November 2000, 16:24:32
Subject: Re: 8192 BLCKSZ ?

Re: 8192 BLCKSZ ? - Mailing list pgsql-hackers

Previous

Next