Re: Changeset Extraction v7.0 (was logical changeset generation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Changeset Extraction v7.0 (was logical changeset generation) |
Date | |
Msg-id | CA+TgmoZ1DTGKJ6FthQ7vSAiniih2LZ_aL0FM8kCzQNc8d2Gfmg@mail.gmail.com Whole thread Raw |
In response to | Re: Changeset Extraction v7.0 (was logical changeset generation) (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: Changeset Extraction v7.0 (was logical changeset
generation)
|
List | pgsql-hackers |
On Thu, Jan 23, 2014 at 7:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't think shared buffers fsyncs are the apt comparison. It's more > something like UpdateControlFile(). Which PANICs. > > I really don't get why you fight PANICs in general that much. There are > some nasty PANICs in postgres which can happen in legitimate situations, > which should be made to fail more gracefully, but this surely isn't one > of them. We're doing rename(), unlink() and rmdir(). That's it. > We should concentrate on the ones that legitimately can happen, not the > ones created by an admin running a chmod -R 000 . ; rm -rf $PGDATA or > mount -o remount,ro /. We don't increase reliability by a bit adding > codepaths that will never get tested. Sorry, I don't buy it. Lots of people I know have stories that go like this "$HORRIBLE happened, and PostgreSQL kept on running, and it didn't even lose my data!", where $HORRIBLE may be variously that the disk filled up, that disk writes started failing with I/O errors, that somebody changed the permissions on the data directory inadvertently, that the entire data directory got removed, and so on. I've been through some of those scenarios myself, and the care and effort that's been put into failure modes has saved my bacon more than a few times, too. We *do* increase reliability by worrying about what will happen even in code paths that very rarely get exercised. It's certainly true that our bug count there is higher there than for the parts of our code that get exercised more regularly, but it's also lower than it would be if we didn't make the effort, and the dividend that we get from that effort is that we have a well-deserved reputation for reliability. I think it's completely unacceptable for the failure of routine filesystem operations to result in a PANIC. I grant you that we have some existing cases where that can happen (like UpdateControlFile), but that doesn't mean we should add more. Right this very minute there is massive bellyaching on a nearby thread caused by the fact that a full disk condition while writing WAL can PANIC the server, while on this thread at the very same time you're arguing that adding more ways for a full disk to cause PANICs won't inconvenience anyone. The other thread is right, and your argument here is wrong. We have been able to - and have taken the time to - fix comparable problems in other cases, and we should do the same thing here. As for why I fight PANICs so much in general, there are two reasons. First, I believe that to be project policy. I welcome correction if I have misinterpreted our stance in that area. Second, I have encountered a few situations where customers had production servers that repeatedly PANICked due to some bug or other. If I've ever encountered angrier customers, I can't remember when. A PANIC is no big deal when it happens on your development box, but when it happens on a machine with 100 users connected to it, it's a big deal, especially if a single underlying cause makes it happen over and over again. I think we should be devoting our time to figuring how to improve this, not whether to improve it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: