Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date | |
Msg-id | CAM-w4HOHGTVmG6OY2F3aeRgBBAF2jcOiS5R=c+4+q5nBVW1ELg@mail.gmail.com Whole thread Raw |
In response to | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS (Anthony Iliopoulos <ailiop@altatus.com>) |
List | pgsql-hackers |
On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote: > On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote: >> On 8 April 2018 at 22:47, Anthony Iliopoulos <ailiop@altatus.com> wrote: > To make things a bit simpler, let us focus on EIO for the moment. > The contract between the block layer and the filesystem layer is > assumed to be that of, when an EIO is propagated up to the fs, > then you may assume that all possibilities for recovering have > been exhausted in lower layers of the stack. Well Postgres is using the filesystem. The interface between the block layer and the filesystem may indeed need to be more complex, I wouldn't know. But I don't think "all possibilities" is a very useful concept. Neither layer here is going to be perfect. They can only promise that all possibilities that have actually been implemented have been exhausted. And even among those only to the degree they can be done automatically within the engineering tradeoffs and constraints. There will always be cases like thin provisioned devices that an operator can expand, or degraded raid arrays that can be repaired after a long operation and so on. A network device can't be sure whether a remote server may eventually come back or not and have to be reconfigured by a human or system automation tool to point to the new server or new network configuration. > Right. This implies though that apart from the kernel having > to keep around the dirtied-but-unrecoverable pages for an > unbounded time, that there's further an interface for obtaining > the exact failed pages so that you can read them back. No, the interface we have is fsync which gives us that information with the granularity of a single file. The database could in theory recognize that fsync is not completing on a file and read that file back and write it to a new file. More likely we would implement a feature Oracle has of writing key files to multiple devices. But currently in practice that's not what would happen, what would happen would be a human would recognize that the database has stopped being able to commit and there are hardware errors in the log and would stop the database, take a backup, and restore onto a new working device. The current interface is that there's one error and then Postgres would pretty much have to say, "sorry, your database is corrupt and the data is gone, restore from your backups". Which is pretty dismal. > There is a clear responsibility of the application to keep > its buffers around until a successful fsync(). The kernels > do report the error (albeit with all the complexities of > dealing with the interface), at which point the application > may not assume that the write()s where ever even buffered > in the kernel page cache in the first place. Postgres cannot just store the entire database in RAM. It writes things to the filesystem all the time. It calls fsync only when it needs a write barrier to ensure consistency. That's only frequent on the transaction log to ensure it's flushed before data modifications and then periodically to checkpoint the data files. The amount of data written between checkpoints can be arbitrarily large and Postgres has no idea how much memory is available as filesystem buffers or how much i/o bandwidth is available or other memory pressure there is. What you're suggesting is that the application should have to babysit the filesystem buffer cache and reimplement all of it in user-space because the filesystem is free to throw away any data any time it chooses? The current interface to throw away filesystem buffer cache is unmount. It sounds like the kernel would like a more granular way to discard just part of a device which makes a lot of sense in the age of large network block devices. But I don't think just saying that the filesystem buffer cache is now something every application needs to re-implement in user-space really helps with that, they're going to have the same problems to solve. -- greg
pgsql-hackers by date: