Thread: a provocative question?
I am getting in the habit of storing much of my day-to-day information in postgres, rather than "flat" files. I have not had any problems of data corruption or loss, but others have warned me against abandoning files. I like the benefits of enforced data types, powerful searching, data integrity, etc. But I worry a bit about the "safety" of my data, residing in a big scary database, instead of a simple friendly folder-based files system. I ran across this quote on Wikipedia at http://en.wikipedia.org/wiki/Eudora_%28e-mail_client%29 "Text files are also much safer than databases, in that should disk corruption occur, most of the mail is likely to be unaffected, and any that is damaged can usually be recovered." How naive (optimistic?) is it to think that "the database" can replace "the filesystem"? TJ O'Donnell http://www.gnova.com/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/06/07 10:43, TJ O'Donnell wrote: > I am getting in the habit of storing much of my day-to-day > information in postgres, rather than "flat" files. > I have not had any problems of data corruption or loss, > but others have warned me against abandoning files. > I like the benefits of enforced data types, powerful searching, > data integrity, etc. > But I worry a bit about the "safety" of my data, residing > in a big scary database, instead of a simple friendly > folder-based files system. > > I ran across this quote on Wikipedia at > http://en.wikipedia.org/wiki/Eudora_%28e-mail_client%29 > > "Text files are also much safer than databases, in that should disk > corruption occur, most of the mail is likely to be unaffected, and any > that is damaged can usually be recovered." > > How naive (optimistic?) is it to think that "the database" can > replace "the filesystem"? Text file are *simple*. When fsck repairs the disk and creates a bunch of recovery files, just fire up $EDITOR (or cat, for that matter) and piece your text files back together. You may lose a block of data, but the rest is there, easy to read. Database files are *complex*. Pointers and half-vacuumed freespace and binary fields and indexes and WALs, yadda yadda yadda. And, by design, it's all got to be internally consistent. Any little corruption and *poof*, you've lost a table. A strategically placed corruption and you've lost your database. But... that's why database vendors create backup/restore commands. You *do* back up your database(s), right?????? - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG4D2nS9HxQb37XmcRAg73AKCD321T0u7lux0K2NBhkpQ4kwBjOwCfWh3D WDuns1HAZboUPlraTzbE0oo= =NuLE -----END PGP SIGNATURE-----
"TJ O'Donnell" <tjo@acm.org> writes: > I ran across this quote on Wikipedia at > http://en.wikipedia.org/wiki/Eudora_%28e-mail_client%29 > "Text files are also much safer than databases, in that should disk > corruption occur, most of the mail is likely to be unaffected, and any > that is damaged can usually be recovered." This is mostly FUD. You can get data out of a damaged database, too. (I'd also point out that modern filesystems are nearly as complicated as databases --- try getting your "simple" text files back if the filesystem metadata is fried.) In the end there is no substitute for a good backup policy... regards, tom lane
Tom Lane wrote:
Should probably insert as well the standard disclaimer about Wikipedia. Great source of info, but that particular sentence has not been corrected yet by the forces-that-dictate-everything-ends-up-correct-sooner-or-later to point out the design trade-offs between simple systems like files (or paper for that matter) vs more complex but safer systems such as databases.
And no, I wont write it.... :)
"TJ O'Donnell" <tjo@acm.org> writes:I ran across this quote on Wikipedia at http://en.wikipedia.org/wiki/Eudora_%28e-mail_client%29 "Text files are also much safer than databases, in that should disk corruption occur, most of the mail is likely to be unaffected, and any that is damaged can usually be recovered."
Should probably insert as well the standard disclaimer about Wikipedia. Great source of info, but that particular sentence has not been corrected yet by the forces-that-dictate-everything-ends-up-correct-sooner-or-later to point out the design trade-offs between simple systems like files (or paper for that matter) vs more complex but safer systems such as databases.
And no, I wont write it.... :)
This is mostly FUD. You can get data out of a damaged database, too. (I'd also point out that modern filesystems are nearly as complicated as databases --- try getting your "simple" text files back if the filesystem metadata is fried.) In the end there is no substitute for a good backup policy... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
-- Kenneth Downs Secure Data Software, Inc. www.secdat.com www.andromeda-project.org 631-689-7200 Fax: 631-689-0527 cell: 631-379-0010
tjo@acm.org ("TJ O'Donnell") writes: > I am getting in the habit of storing much of my day-to-day > information in postgres, rather than "flat" files. > I have not had any problems of data corruption or loss, > but others have warned me against abandoning files. > I like the benefits of enforced data types, powerful searching, > data integrity, etc. > But I worry a bit about the "safety" of my data, residing > in a big scary database, instead of a simple friendly > folder-based files system. > > I ran across this quote on Wikipedia at > http://en.wikipedia.org/wiki/Eudora_%28e-mail_client%29 > > "Text files are also much safer than databases, in that should disk > corruption occur, most of the mail is likely to be unaffected, and any > that is damaged can usually be recovered." > > How naive (optimistic?) is it to think that "the database" can > replace "the filesystem"? There is certainly some legitimacy to the claim; the demerits of things like the Windows Registry as compared to "plain text configuration" have been pretty clear. If the "monstrous fragile binary data structure" gets stomped on, by any means, then you can lose data in pretty massive and invisible ways. It's most pointedly true if the data representation conflates data and indexes in some attempt to "simplify" things by having Just One File. In such a case, if *any* block gets corrupted, that has the potential to irretrievably destroy the database. However, the argument may also be taken too far. -> A PostgreSQL database does NOT assemble data into "one monstrous fragile binary data structure." Each table consists of data files that are separate from index files. Blowing up an index file *doesn't* blow up the data. -> You are taking regular backups, right??? If you are, that's a considerable mitigation of risks. I don't believe it's typical to set up off-site backups of one's Windows Registry, in contrast... -> In the case of PostgreSQL, mail stored in tuples is likely to get TOASTed, which changes the shape of things further; the files get smaller (due to compression), which changes the "target profile" for this data. -> In the contrary direction, storing the data as a set of files, each of which requires storing metadata in binary filesystem data structures provides an (invisible-to-the-user) interface to what is, no more or less, than a "monstrous fragile binary data structure." That is, after all, what a filesystem is, if you strip out the visible APIs that turn it into open()/close()/mkdir() calls. If the wrong directory block gets "crunched," then /etc could get munched just like the Windows Registry could. Much of the work going into filesystem efforts, the last dozen years, is *exceeding* similar to the work going into managing storage in DBMSes. People working in both areas borrow from each other. The natural result is that they live in fairly transparent homes in relation to one another. Someone who "casts stones" of the sort in your quote is making the fallacious assumption that since the fact that a filesystem is a database of file information is kept fairly much invisible, that a filesystem is somehow fundamentally less vulnerable to the same kinds of corruptions. Reality is that they are vulnerable in similar ways. The one thing I could point to, in Eudora, as a *further* visible merit that DOES retain validity is that there is not terribly much metadata entrusted to the filesystem. Much the same is true for the Rand MH "Mail Handler", where each message is a file with very little filesystem-based metadata. If you should have a filesystem failure, and discover you have a zillion no-longer-named in lost+found, and decline to recover from a backup, it should nonetheless be possible to re-process them through any mail filters, and rebuild a mail filesystem that will appear roughly similar to what it was like before. That actually implies that there is *more* "conservatism of format" than first meets the eye; in effect, the data is left in raw form, replete with redundancies that can, in order to retain the ability to perform this recovery process, *never* be taken out. There is, in effect, more than meets the eye here... -- (format nil "~S@~S" "cbbrowne" "acm.org") http://linuxfinances.info/info/advocacy.html "Lumping configuration data, security data, kernel tuning parameters, etc. into one monstrous fragile binary data structure is really dumb." - David F. Skoll
There's also a point in regard to how modifications are made to your data store. In general, things working with text files don't go to much effort to maintain durability like a real database would. The most direct way of editing a text file is to make all the changes in memory, then write the whole thing out. Some editors make backup files, or use a create-delete-rename cycle, but they won't necessarily force the data to disk -- if it's entirely in cache you could end up losing the contents of the file anyway. In the general case on the systems I work with, corruption is a relatively low concern due to the automatic error detection and correction my disks perform, and the consistency guarantees of modern filesystems. Interruptions (e.g. crashes or power failures) are much more likely, and in that regard the typical modification process of text files is more of a risk than working with a database. I've also had times where faulty RAM corrupted gigabytes of data on disk due to cache churn alone. It will always depend on your situation. In both cases, you definitely want backups just for the guarantees neither approach can make. [way off topic] In regard to the Windows Registry in particular... > There is certainly some legitimacy to the claim; the demerits of > things like the Windows Registry as compared to "plain text > configuration" have been pretty clear. > -> You are taking regular backups, right??? > > If you are, that's a considerable mitigation of risks. I don't > believe it's typical to set up off-site backups of one's Windows > Registry, in contrast... Sometimes I think most people get their defining impressions of the Windows Registry from experience with the Windows 9x line. I'll definitely agree that it was simply awful there, and there's much to complain about still, but... The Windows Registry in NT is an actual database, with a WAL, structured and split into several files, replication of some portions in certain network arrangements, redundant backup of key parts in a local system, and any external storage or off-site backup system for Windows worth its salt does, indeed, back it up. It's been that way for about a decade.
quension@gmail.com ("Trevor Talbot") writes: > There's also a point in regard to how modifications are made to your > data store. In general, things working with text files don't go to > much effort to maintain durability like a real database would. The > most direct way of editing a text file is to make all the changes in > memory, then write the whole thing out. Some editors make backup > files, or use a create-delete-rename cycle, but they won't > necessarily force the data to disk -- if it's entirely in cache you > could end up losing the contents of the file anyway. In the case of Eudora, if its filesystem access protocol involves writing a new text file, and completing that before unlinking the old version, then the risk of "utter destruction" remains fairly low specifically because of the nature of access protocol. > In the general case on the systems I work with, corruption is a > relatively low concern due to the automatic error detection and > correction my disks perform, and the consistency guarantees of > modern filesystems. Interruptions (e.g. crashes or power failures) > are much more likely, and in that regard the typical modification > process of text files is more of a risk than working with a > database. Error rates are not so low that it's safe to be cavalier about this. > I've also had times where faulty RAM corrupted gigabytes of data on > disk due to cache churn alone. Yeah, and there is the factor that as disk capacities grow, the chances of there being errors grow (more bytes, more opportunities) and along with that, the number of opportunities for broken checksums to match by accident also grow. (Ergo "don't be cavalier" unless you can be pretty sure that your checksums are getting more careful...) > It will always depend on your situation. In both cases, you > definitely want backups just for the guarantees neither approach can > make. Certainly. > [way off topic] > In regard to the Windows Registry in particular... > >> There is certainly some legitimacy to the claim; the demerits of >> things like the Windows Registry as compared to "plain text >> configuration" have been pretty clear. > >> -> You are taking regular backups, right??? >> >> If you are, that's a considerable mitigation of risks. I don't >> believe it's typical to set up off-site backups of one's Windows >> Registry, in contrast... > > Sometimes I think most people get their defining impressions of the > Windows Registry from experience with the Windows 9x line. I'll > definitely agree that it was simply awful there, and there's much to > complain about still, but... > > The Windows Registry in NT is an actual database, with a WAL, > structured and split into several files, replication of some portions > in certain network arrangements, redundant backup of key parts in a > local system, and any external storage or off-site backup system for > Windows worth its salt does, indeed, back it up. > > It's been that way for about a decade. I guess I deserve that :-). There is a further risk, that is not directly mitigated by backups, namely that if you don't have some lowest common denominator that's easy to recover from, you may not have a place to recover that data. In the old days, Unix filesystems were sufficiently buggy corruptible that it was worthwhile to have an /sbin partition, all statically linked, generally read-only, and therefore seldom corrupted, to have as a base for recovering the rest of the system. Using files in /etc, for config, and /sbin for enough tools to recover with, provided a basis for recovery. In contrast, there is definitely risk to stowing all config in a DBMS such that you may have the recursive problem that you can't get the parts of the system up to help you recover it without having the DBMS running, but since it's corrupted, you don't have the config needed to get the system started, and so we recurse... -- let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;; http://www3.sympatico.ca/cbbrowne/linuxdistributions.html As of next Monday, TRIX will be flushed in favor of VISI-CALC. Please update your programs.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 09/06/07 20:45, Chris Browne wrote: > quension@gmail.com ("Trevor Talbot") writes: >> There's also a point in regard to how modifications are made to your >> data store. In general, things working with text files don't go to >> much effort to maintain durability like a real database would. The >> most direct way of editing a text file is to make all the changes in >> memory, then write the whole thing out. Some editors make backup >> files, or use a create-delete-rename cycle, but they won't >> necessarily force the data to disk -- if it's entirely in cache you >> could end up losing the contents of the file anyway. > > In the case of Eudora, if its filesystem access protocol involves > writing a new text file, and completing that before unlinking the old > version, then the risk of "utter destruction" remains fairly low > specifically because of the nature of access protocol. mbox is a monolithic file also, and you need to copy/delete, copy/delete, yadda yadda yadda. Just to do anything, you need 2x as much free disk space as you biggest mbox file. What a PITA. mh and Maildir are, as has been partially mentioned, much more efficient in that regard. (Yes... mbox is an excellent transport format.) - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFG4Nx3S9HxQb37XmcRAg+6AJ42gRm82MTmocxNC2hp3yQ9ZsFhQgCgoXVQ i51vvPBwN2Qot2TUR9AjMBY= =8WKX -----END PGP SIGNATURE-----