Re: silent data loss with ext4 / all current versions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: silent data loss with ext4 / all current versions |
Date | |
Msg-id | 56589A65.4060201@2ndquadrant.com Whole thread Raw |
In response to | Re: silent data loss with ext4 / all current versions (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: silent data loss with ext4 / all current versions
|
List | pgsql-hackers |
On 11/27/2015 02:18 PM, Michael Paquier wrote: > On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> So, what's going on? The problem is that while the rename() is atomic, it's >> not guaranteed to be durable without an explicit fsync on the parent >> directory. And by default we only do fdatasync on the recycled segments, >> which may not force fsync on the directory (and ext4 does not do that, >> apparently). > > Yeah, that seems to be the way the POSIX spec clears things. > "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall > force all currently queued I/O operations associated with the file > indicated by file descriptor fildes to the synchronized I/O completion > state. All I/O operations shall be completed as defined for > synchronized I/O file integrity completion." > http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html > If I understand that right, it is guaranteed that the rename() will be > atomic, meaning that there will be only one file even if there is a > crash, but that we need to fsync() the parent directory as mentioned. > >> FWIW this has nothing to do with storage reliability - you may have good >> drives, RAID controller with BBU, reliable SSDs or whatever, and you're >> still not safe. This issue is at the filesystem level, not storage. > > The POSIX spec authorizes this behavior, so the FS is not to blame, > clearly. At least that's what I get from it. The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX expert), but we should be prepared for the less favorable interpretation I think. > >> I think this issue might also result in various other issues, not just data >> loss. For example, I wouldn't be surprised by data corruption due to >> flushing some of the changes in data files to disk (due to contention for >> shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL >> segment. Also, while I have only seen 1 to 3 segments getting lost, it might >> be possible that more segments can get lost, possibly making the recovery >> impossible. And of course, this might cause problems with WAL archiving due >> to archiving the same >> segment twice (before and after crash). > > Possible, the switch to .done is done after renaming the segment in > xlogarchive.c. So this could happen in theory. Yes. That's one of the suspicious places in my notes (haven't posted all the details, the message was long enough already). >> Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure >> this needs to be backpatched to all backbranches. I've also attached a patch >> that adds pg_current_xlog_flush_location() function, which proved to be >> quite useful when debugging this issue. > > Agreed. We should be sure as well that the calls to fsync_fname get > issued in a critical section with START/END_CRIT_SECTION(). It does > not seem to be the case with your patch. Don't know. I've based that on code from replication/logical/ which does fsync_fname() on all the interesting places, without the critical section. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: