Re: 8192 BLCKSZ ? - Mailing list pgsql-hackers
From | Nathan Myers |
---|---|
Subject | Re: 8192 BLCKSZ ? |
Date | |
Msg-id | 20001128130134.C22345@store.zembu.com Whole thread Raw |
In response to | Re: 8192 BLCKSZ ? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: 8192 BLCKSZ ?
|
List | pgsql-hackers |
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote: > "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > > I don't believe it's a performance issue, I believe it's that writes to > > blocks greater than 8k cannot be guaranteed 'atomic' by the operating > > system. Hence, 32k blocks would break the transactions system. > > As Nathan remarks nearby, it's hard to tell how big a write can be > assumed atomic, unless you have considerable knowledge of your OS and > hardware. Not to harp on the subject, but even if you _do_ know a great deal about your OS and hardware, you _still_ can't assume any write is atomic. To give an idea of what is involved, consider that modern disk drives routinely re-order writes, by themselves. You think you have asked for a sequential write of 8K bytes, or 16 sectors, but the disk might write the first and last sectors first, and then the middle sectors in random order. A block of all zeroes might not be written at all, but just noted in the track metadata. Most disks have a "feature" that they report the write complete as soon as it is in the RAM cache, rather than after the sectors are on the disk. (It's a "feature" because it makes their benchmarks come out better.) It can usually be turned off, but different vendors have different ways to do it. Have you turned it off on your production drives? In the event of a power outage, the drive will stop writing in mid-sector. If you're lucky, that sector would have a bad checksum if you tried to read it. If the half-written sector happens to contain track metadata, you might have a bigger problem. ---- The short summary is: for power outage or OS-crash recovery purposes, there is no such thing as atomicity. This is why backups and transaction logs are important. "Invest in a UPS." Use a reliable OS, and operate it in a way that doesn't stress it. Even a well-built OS will behave oddly when resources are badly stressed. (That the oddities may be documented doesn't really help much.) For performance purposes, it may be more or less efficient to group writes into 4K, 8K, or 32K chunks. That's not a matter of database atomicity, but of I/O optimization. It can only confuse people to use "atomicity" in that context. Nathan Myers ncm@zembu.com
pgsql-hackers by date: