Postgres, fsync, and OSs (specifically linux) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Postgres, fsync, and OSs (specifically linux) |
Date | |
Msg-id | 20180427222842.in2e4mibx45zdth5@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Postgres, fsync, and OSs (specifically linux)
Re: Postgres, fsync, and OSs (specifically linux) Re: Postgres, fsync, and OSs (specifically linux) Re: Postgres, fsync, and OSs (specifically linux) Re: Postgres, fsync, and OSs (specifically linux) Re: Postgres, fsync, and OSs (specifically linux) |
List | pgsql-hackers |
Hi, I thought I'd send this separately from [0] as the issue has become more general than what was mentioned in that thread, and it went off into various weeds. I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say it was a very productive discussion. I'll first try to recap the current situation, updated with knowledge I gained. Secondly I'll try to discuss the kernel changes that seem to have been agreed upon. Thirdly I'll try to sum up what postgres needs to change. == Current Situation == The fundamental problem is that postgres assumed that any IO error would be reported at fsync time, and that the error would be reported until resolved. That's not true in several operating systems, linux included. There's various judgement calls leading to the current OS (specifically linux, but the concerns are similar in other OSs) behaviour: - By the time IO errors are treated as fatal, it's unlikely that plain retries attempting to write exactly the same data are going to succeed. There are retries on several layers. Some cases would be resolved by overwriting a larger amount (so device level remapping functionality can mask dead areas), but plain retries aren't going to get there if they didn't the first time round. - Retaining all the data necessary for retries would make it quite possible to turn IO errors on some device into out of memory errors. This is true to a far lesser degree if only enough information were to be retained to (re-)report an error, rather than actually retry the write. - Continuing to re-report an error after one fsync() failed would make it hard to recover from that fact. There'd need to be a way to "clear" a persistent error bit, and that'd obviously be outside of posix. - Some other databases use direct-IO and thus these paths haven't been exercised under fire that much. - Actually marking files as persistently failed would require filesystem changes, and filesystem metadata IO, far from guaranteed in failure scenarios. Before linux v4.13 errors in kernel writeback would be reported at most once, without a guarantee that that'd happen (IIUC memory pressure could lead to the relevant information being evicted) - but it was pretty likely. After v4.13 (see https://lwn.net/Articles/724307/) errors are reported exactly once to all open file descriptors for a file with an error - but never for files that have been opened after the error occurred. It's worth to note that on linux it's not well defined what contents one would read after a writeback error. IIUC xfs will mark the pagecache contents that triggered an error as invalid, triggering a re-read from the underlying storage (thus either failing or returning old but persistent contents). Whereas some other filesystems (among them ext4 I believe) retain the modified contents of the page cache, but marking it as clean (thereby returning new contents until the page cache contents are evicted). Some filesystems (prominently NFS in many configurations) perform an implicit fsync when closing the file. While postgres checks for an error of close() and reports it, we don't treat it as fatal. It's worth to note that by my reading this means that an fsync error at close() will *not* be re-reported by the time an explicit fsync() is issued. It also means that we'll not react properly to the possible ENOSPC errors that may be reported at close() for NFS. At least the latter isn't just the case in linux. Proposals for how postgres could deal with this included using syncfs(2) - but that turns out not to work at all currently, because syncfs() basically wouldn't return any file-level errors. It'd also imply superflously flushing temporary files etc. The second major type of proposal was using direct-IO. That'd generally be a desirable feature, but a) would require some significant changes to postgres to be performant, b) isn't really applicable for the large percentage of installations that aren't tuned reasonably well, because at the moment the OS page cache functions as a memory-pressure aware extension of postgres' page cache. Another topic brought up in this thread was the handling of ENOSPC errors that aren't triggered on a filesystem level, but rather are triggered by thin provisioning. On linux that currently apprently lead to page cache contents being lost (and errors "eaten") in a lot of places, including just when doing a write(). In a lot of cases it's pretty much expected that the file system will just hang or react unpredictably upon space exhaustion. My reading is that the block-layer thin provisioning code is still pretty fresh, and should only be used with great care. The only way to halfway reliably use it appears to change the configuration so space exhaustion blocks until admin intervention (at least dm-thinp provides allows that). There's some clear need to automate some more testing in this area so that future behaviour changes don't surprise us. == Proposed Linux Changes == - Matthew Wilcox proposed (and posted a patch) that'd partially revert behaviour to the pre v4.13 world, by *also* reporting errors to "newer" file-descriptors if the error hasn't previously been reported. That'd still not guarantee that the error is reported (memory pressure could evict information without open fd), but in most situations we'll again get the error in the checkpointer. This seems largely be agreed upon. It's unclear whether it'll go into the stable backports for still-maintained >= v4.13 kernels. - syncfs() will be fixed so it reports errors properly - that'll likely require passing it an O_PATH filedescriptor to have space to store the errseq_t value that allows discerning already reported and new errors. No patch has appeared yet, but the behaviour seems largely agreed upon. - Make per-filesystem error counts available in a uniform (i.e. same for every supporting fs) manner. Right now it's very hard to figure out whether errors occurred. There seemed general agreement that exporting knowledge about such errors is desirable. Quite possibly the syncfs() fix above will provide the necessary infrastructure. It's unclear as of yet how the value would be exposed. Per-fs /sys/ entries and an ioctl on O_PATH fds have been mentioned. These'd error counts would not vanish due to memory pressure, and they can be checked even without knowing which files in a specific filesystem have been touched (e.g. when just untar-ing something). There seemed to be fairly widespread agreement that this'd be a good idea. Much less clearer whether somebody would do the work. - Provide config knobs that allow to define the FS error behaviour in a consistent way across supported filesystems. XFS currently has various knobs controlling what happens in case of metadata errors [1] (retry forever, timeout, return up). It was proposed that this interface be extended to also deal with data errors, and moved into generic support code. While the timeline is unclear, there seemed to be widespread support for the idea. I believe Dave Chinner indicated that he at least has plans to generalize the code. - Stop inodes with unreported errors from being evicted. This will guarantee that a later fsync (without an open FD) will see the error. The memory pressure concerns here are lower than with keeping all the failed pages in memory, and it could be optimized further. I read some tentative agreement behind this idea, but I think it's the by far most controversial one. == Potential Postgres Changes == Several operating systems / file systems behave differently (See e.g. [2], thanks Thomas) than we expected. Even the discussed changes to e.g. linux don't get to where we thought we are. There's obviously also the question of how to deal with kernels / OSs that have not been updated. Changes that appear to be necessary, even for kernels with the issues addressed: - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and retry recovery. While ENODEV (underlying device went away) will be persistent, it probably makes sense to treat it the same or even just give up and shut down. One question I see here is whether we just want to continue crash-recovery cycles, or whether we want to limit that. - We need more aggressive error checking on close(), for ENOSPC and EIO. In both cases afaics we'll have to trigger a crash recovery cycle. It's entirely possible to end up in a loop on NFS etc, but I don't think there's a way around that. Robert, on IM, wondered whether there'd be a race between some backend doing a close(), triggering a PANIC, and a checkpoint succeeding. I don't *think* so, because the error will only happen if there's outstanding dirty data, and the checkpoint would have flushed that out if it belonged to the current checkpointing cycle. - The outstanding fsync request queue isn't persisted properly [3]. This means that even if the kernel behaved the way we'd expected, we'd not fail a second checkpoint :(. It's possible that we don't need to deal with this because we'll henceforth PANIC, but I'd argue we should fix that regardless. Seems like a time-bomb otherwise (e.g. after moving to DIO somebody might want to relax the PANIC...). - It might be a good idea to whitelist expected return codes for write() and PANIC one ones that we did not expect. E.g. when hitting an EIO we should probably PANIC, to get back to a known good state. Even though it's likely that we'd again that error at fsync(). - Docs. I think we also need to audit a few codepaths. I'd be surprised if we PANICed appropriately on all fsyncs(), particularly around the SLRUs. I think we need to be particularly careful around the WAL handling, I think it's fairly likely that there's cases where we'd write out WAL in one backend and then fsync() in another backend with a file descriptor that has only been opened *after* the write occurred, which means we might miss the error entirely. Then there's the question of how we want to deal with kernels that haven't been updated with the aforementioned changes. We could say that we expect decent OS support and declare that we just can't handle this - given that at least various linux versions, netbsd, openbsd, MacOS just silently drop errors and we'd need different approaches for dealing with that, that doesn't seem like an insane approach. What we could do: - forward file descriptors from backends to checkpointer (using SCM_RIGHTS) when marking a segment dirty. That'd require some optimizations (see [4]) to avoid doing so repeatedly. That'd guarantee correct behaviour in all linux kernels >= 4.13 (possibly backported by distributions?), and I think it'd also make it vastly more likely that errors are reported in earlier kernels. This should be doable without a noticeable performance impact, I believe. I don't think it'd be that hard either, but it'd be a bit of a pain to backport it to all postgres versions, as well as a bit invasive for that. The infrastructure this'd likely end up building (hashtable of open relfilenodes), would likely be useful for further things (like caching file size). - Add a pre-checkpoint hook that checks for filesystem errors *after* fsyncing all the files, but *before* logging the checkpoint completion record. Operating systems, filesystems, etc. all log the error format differently, but for larger installations it'd not be too hard to write code that checks their specific configuration. While I'm a bit concerned adding user-code before a checkpoint, if we'd do it as a shell command it seems pretty reasonable. And useful even without concern for the fsync issue itself. Checking for IO errors could e.g. also include checking for read errors - it'd not be unreasonable to not want to complete a checkpoint if there'd been any media errors. - Use direct IO. Due to architectural performance issues in PG and the fact that it'd not be applicable for all installations I don't think this is a reasonable fix for the issue presented here. Although it's independently something we should work on. It might be worthwhile to provide a configuration that allows to force DIO to be enabled for WAL even if replication is turned on. - magic Greetings, Andres Freund [0] https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com [1] static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = { { .name = "default", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "EIO", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENOSPC", .max_retries = XFS_ERR_RETRY_FOREVER, .retry_timeout = XFS_ERR_RETRY_FOREVER, }, { .name = "ENODEV", .max_retries = 0, /* We can't recover from devices disappearing */ .retry_timeout = 0, }, }; [2] https://wiki.postgresql.org/wiki/Fsync_Errors [3] https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk [4] https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de
pgsql-hackers by date: