Re: [HACKERS] Potential data loss of 2PC files - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: [HACKERS] Potential data loss of 2PC files |
Date | |
Msg-id | CAFjFpRfsjKYCCetfrzuUk8MJM1hLAr=9eW9zXT4QbnMH6PwhFA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Potential data loss of 2PC files (David Steele <david@pgmasters.net>) |
Responses |
Re: [HACKERS] Potential data loss of 2PC files
|
List | pgsql-hackers |
On Thu, Mar 16, 2017 at 10:17 PM, David Steele <david@pgmasters.net> wrote: > On 2/13/17 12:10 AM, Michael Paquier wrote: >> On Tue, Jan 31, 2017 at 11:07 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> On Mon, Jan 30, 2017 at 10:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>> If that can happen, don't we have the same problem in many other places? >>>> Like, all the SLRUs? They don't fsync the directory either. >>> >>> Right, pg_commit_ts and pg_clog enter in this category. >> >> Implemented as attached. >> >>>> Is unlink() guaranteed to be durable, without fsyncing the directory? If >>>> not, then we need to fsync() the directory even if there are no files in it >>>> at the moment, because some might've been removed earlier in the checkpoint >>>> cycle. >>> >>> Hm... I am not an expert in file systems. At least on ext4 I can see >>> that unlink() is atomic, but not durable. So if an unlink() is >>> followed by a power failure, the previously unlinked file could be >>> here if the parent directory is not fsync'd. >> >> So I have been doing more work on this patch, with the following things done: >> - Flush pg_clog, pg_commit_ts and pg_twophase at checkpoint phase to >> ensure their durability. >> - Create a durable_unlink() routine to give a way to perform a durable >> file removal. >> I am now counting 111 calls to unlink() in the backend code, and >> looking at all of them most look fine with plain unlink() if they are >> not made durable as they work on temporary files (see timeline.c for >> example), with some exceptions: >> - In pg_stop_backup, the old backup_label and tablespace_map removal >> should be durable to avoid putting the system in a wrong state after >> power loss. Other calls of unlink() are followed by durable_rename so >> they are fine if let as such. >> - Removal of old WAL segments should be durable as well. There is >> already an effort to rename them durably in case of a segment >> recycled. In case of a power loss, a file that should have been >> removed could remain in pg_xlog. >> >> Looking around, I have bumped as well on the following bug report for >> SQlite which is in the same category of things: >> http://sqlite.1065341.n5.nabble.com/Potential-bug-in-crash-recovery-code-unlink-and-friends-are-not-synchronous-td68885.html >> Scary to see that in this case durability can be a problem at >> transaction commit... > > This patch applies cleanly and compiles at cccbdde. > > Ashutosh, do you know when you'll have a chance to review? The scope of this work has expanded, since last time I reviewed and marked it as RFC. Right now I am busy with partition-wise joins and do not have sufficient time to take a look at the expanded scope. However, I can come back to this after partition-wise join, but that may stretch to the end of the commitfest. Sorry. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
pgsql-hackers by date: