Re: Enabling Checksums - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Re: Enabling Checksums |
Date | |
Msg-id | 513A23F1.2040200@agliodbs.com Whole thread Raw |
In response to | Re: Enabling Checksums (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Enabling Checksums
|
List | pgsql-hackers |
> I also see the checksum patch is taking a beating. I wanted to step > back and ask what pertntage of known corruptions cases will this > checksum patch detect? I'm pretty sure that early on Jeff posted some statstics which indicated that the current approach would detect 99% of corruption introduced at the PostgreSQL, filesystem, or storage layer, and a significant but minority amount of the corruption introduced through bad RAM (this is harder to detect, and FS checksums don't catch it either). > What percentage of these corruptions would > filesystem checksums have detected? In what way is that relevant? Given that there were already a couple dozen posts establishing that FS checksums are not adequate, please don't bring this up again. > Also, don't all modern storage drives have built-in checksums, and > report problems to the system administrator? Does smartctl help report > storage corruption? To date, there are no useful tools which would detect user-level file corruption using these. Not that there couldn't theoretically be, but such tools appearing in "enterprise" OSes is at least several years away. > Let me take a guess at answering this --- we have several layers in a > database server: > > 1 storage > 2 storage controller > 3 file system > 4 RAM > 5 CPU > > My guess is that storage checksums only cover layer 1, while our patch > covers layers 1-3, and probably not 4-5 because we only compute the > checksum on write. You're forgetting two other major causes: * PostgreSQL bugs * operator error > > If that is correct, the open question is what percentage of corruption > happens in layers 1-3? The majority. I don't know that anyone has done an industry survey to determine this, but out of the cases of Postgres corruption we've had to deal with for clients, only one was the result of bad RAM. I have never seen corruption caused by a CPU bug. The rest have been caused by: * operator error * postgres bugs * bad controller/driver * bad disk * filesystem bug Further, the solution for bad RAM is fairly easy: use ECC RAM, and make sure that the syslog goes to some real person. ECC RAM is pretty good at detecting its own errors. There's also another use case people have not been discussing, which is the "technical validation" use case. Give you an example: We had a client who had a server device running on FreeBSD/UFS. In 2009, they upgraded the device spec, including new storage and a new version of PostgreSQL. Their customers began filing corruption bug reports. After some examination of the systems involved, we conculded that the issue was the FreeBSD drivers for the new storage, which were unstable and had custom source patches. However, without PostgreSQL checksums, we couldn't *prove* it wasn't PostgreSQL at fault. It ended up taking weeks of testing, most of which was useless, to prove to them they had a driver problem so it could be fixed. If Postgres had had checksums, we could have avoided wasting a couple weeks looking for non-existant PostgreSQL bugs. In any large enterprise with dozens to hundreds of PostgreSQL servers, PostgreSQL, the OS/FS, and the hardware are going to be run by 3 different teams. When corruption occurs, the DBAs need to be able to demonstrate that the corruption is not in the DBMS, in order to get the other teams to hunt corruption bugs on their own layers. Also, I'm kinda embarassed that, at this point, InnoDB has checksums and we don't. :-( -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
pgsql-hackers by date: