silent data loss with ext4 / all current versions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | silent data loss with ext4 / all current versions |
Date | |
Msg-id | 56583BDD.9060302@2ndquadrant.com Whole thread Raw |
Responses |
Re: silent data loss with ext4 / all current versions
Re: silent data loss with ext4 / all current versions Re: silent data loss with ext4 / all current versions Re: silent data loss with ext4 / all current versions |
List | pgsql-hackers |
Hi, I've been doing some power failure tests (i.e. unexpectedly interrupting power) a few days ago, and I've discovered a fairly serious case of silent data loss on ext3/ext4. Initially i thought it's a filesystem bug, but after further investigation I'm pretty sure it's our fault. What happens is that when we recycle WAL segments, we rename them and then sync them using fdatasync (which is the default on Linux). However fdatasync does not force fsync on the parent directory, so in case of power failure the rename may get lost. The recovery won't realize those segments actually contain changes from "future" and thus does not replay them. Hence data loss. The recovery completes as if everything went OK, so the data loss is entirely silent. Reproducing this is rather trivial. I've prepared a simple C program simulating our WAL recycling, that I intended to send to ext4 mailing list to demonstrate the ext4 bug before (I realized it's most likely our bug and not theirs). The example program is called ext4-data-loss.c and is available here (along with other stuff mentioned in this message): https://github.com/2ndQuadrant/ext4-data-loss Compile it, run it (over ssh from another host), interrupt the power and after restart you should see some of the segments be lost (the rename reverted). The git repo also contains a bunch of python scripts that I initially used to reproduce this on PostgreSQL - insert.py, update.py and xlog-watch.py. I'm not going to explain the details here, it's a bit more complicated but the cause is exactly the same as with the C program, just demonstrated in database. See README for instructions. So, what's going on? The problem is that while the rename() is atomic, it's not guaranteed to be durable without an explicit fsync on the parent directory. And by default we only do fdatasync on the recycled segments, which may not force fsync on the directory (and ext4 does not do that, apparently). This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and also all supported PostgreSQL versions (tested on 9.1.19, but I believe all versions since spread checkpoints were introduced are vulnerable). FWIW this has nothing to do with storage reliability - you may have good drives, RAID controller with BBU, reliable SSDs or whatever, and you're still not safe. This issue is at the filesystem level, not storage. I've done the same tests on xfs and that seems to be safe - I've been unable to reproduce the issue, so either the issue is not there or it's more difficult to hit it. I haven't tried on other file systems, because ext4 and xfs cover vast majority of deployments (at least on Linux), and thus issue on ext4 is serious enough I believe. It's possible to make ext3/ext4 safe with respect to this issue by using full journaling (data=journal) instead of the default (data=ordered) mode. However this comes at a significant performance cost and pretty much no one is using it with PostgreSQL because data=ordered is believed to be safe. It's also possible to mitigate this by setting wal_sync_method=fsync, but I don't think I've ever seen that change at a customer. This also comes with a significant performance penalty, comparable to setting data=journal. This has the advantage that this can be done without restarting the database (SIGHUP is enough). So pretty much everyone running on Linux + ext3/ext4 is vulnerable. It's also worth mentioning that the data is not actually lost - it's properly fsynced in the WAL segments, it's just the rename that got lost. So it's possible to survive this without losing data by manually renaming the segments, but this must happen before starting the cluster because the automatic recovery comes and discards all the data etc. I think this issue might also result in various other issues, not just data loss. For example, I wouldn't be surprised by data corruption due to flushing some of the changes in data files to disk (due to contention for shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL segment. Also, while I have only seen 1 to 3 segments getting lost, it might be possible that more segments can get lost, possibly making the recovery impossible. And of course, this might cause problems with WAL archiving due to archiving the same segment twice (before and after crash). Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure this needs to be backpatched to all backbranches. I've also attached a patch that adds pg_current_xlog_flush_location() function, which proved to be quite useful when debugging this issue. I'd also like to propose adding "last segment" to pg_controldata, next to the last checkpoint / restartpoint. We don't need to write this on every commit, once per segment (on the first write) is enough. This would make investigating the issue much easier, and it'd also make it possible to terminate the recovery with an error if the last found segment does not match the expectation (instead of just assuming we've found all segments, leading to data loss). Another useful change would be to allow pg_xlogdump to print segments even if the contents does not match the filename. Currently it's impossible to even look at the contents in that case, so renaming the existing segments is mostly guess work (find segments whrere pg_xlogdump fails, try renaming to next segments). And finally, I've done a quick review of all places that might suffer the same issue - some are not really interesting as the stuff is ephemeral anyway (like pgstat for example), but there are ~15 places that may need this fix: * src/backend/access/transam/timeline.c (2 matches) * src/backend/access/transam/xlog.c (9 matches) * src/backend/access/transam/xlogarchive.c (3 matches) * src/backend/postmaster/pgarch.c (1 match) Some of these places might be actually safe because a fsync happens somewhere immediately after the rename (e.g. in a caller), but I guess better safe than sorry. I plan to do more power failure testing soon, with more complex test scenarios. I suspect there might be other similar issues (e.g. when we rename a file before a checkpoint and don't fsync the directory - then the rename won't be replayed and will be lost). regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: