Re: Postgres, fsync, and OSs (specifically linux) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Postgres, fsync, and OSs (specifically linux) |
Date | |
Msg-id | 20180518210357.yz2w27gkr4sxitnp@alap3.anarazel.de Whole thread Raw |
In response to | Postgres, fsync, and OSs (specifically linux) (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Postgres, fsync, and OSs (specifically linux)
|
List | pgsql-hackers |
Hi, On 2018-04-27 15:28:42 -0700, Andres Freund wrote: > == Potential Postgres Changes == > > Several operating systems / file systems behave differently (See > e.g. [2], thanks Thomas) than we expected. Even the discussed changes to > e.g. linux don't get to where we thought we are. There's obviously also > the question of how to deal with kernels / OSs that have not been > updated. > > Changes that appear to be necessary, even for kernels with the issues > addressed: > > - Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and > retry recovery. While ENODEV (underlying device went away) will be > persistent, it probably makes sense to treat it the same or even just > give up and shut down. One question I see here is whether we just > want to continue crash-recovery cycles, or whether we want to limit > that. Craig has a patch for this, although I'm not yet 100% happy with it. > - We need more aggressive error checking on close(), for ENOSPC and > EIO. In both cases afaics we'll have to trigger a crash recovery > cycle. It's entirely possible to end up in a loop on NFS etc, but I > don't think there's a way around that. This needs to be handled. > - The outstanding fsync request queue isn't persisted properly [3]. This > means that even if the kernel behaved the way we'd expected, we'd not > fail a second checkpoint :(. It's possible that we don't need to deal > with this because we'll henceforth PANIC, but I'd argue we should fix > that regardless. Seems like a time-bomb otherwise (e.g. after moving > to DIO somebody might want to relax the PANIC...). > What we could do: > > - forward file descriptors from backends to checkpointer (using > SCM_RIGHTS) when marking a segment dirty. That'd require some > optimizations (see [4]) to avoid doing so repeatedly. That'd > guarantee correct behaviour in all linux kernels >= 4.13 (possibly > backported by distributions?), and I think it'd also make it vastly > more likely that errors are reported in earlier kernels. > > This should be doable without a noticeable performance impact, I > believe. I don't think it'd be that hard either, but it'd be a bit of > a pain to backport it to all postgres versions, as well as a bit > invasive for that. > > The infrastructure this'd likely end up building (hashtable of open > relfilenodes), would likely be useful for further things (like caching > file size). I've written a patch series for this. Took me quite a bit longer than I had hoped. The attached patchseries consists out of a few preparatory patches: - freespace optimization to not call smgrexists() unnecessarily - register_dirty_segment() optimization to not queue requests for segments that locally are known to already have been dirtied. This seems like a good optimization regardless of further changes. Doesn't yet deal with the mdsync counter wrapping around (which is unlikely to ever happen in practice, it's 32bit). - some fd.c changes, I don't think they're quite right yet - new functions to send/recv data over a unix domain socket, *including* a file descriptor. The main patch guarantees that fsync requests are forwarded from backends to the checkpointer, including the file descriptor. As we do so immediately at mdwrite() time, that guarantees that the fd has been open from before the write started, therefore linux will guarantee that that FD will see errors. The design of the patch went through a few iterations. I initially attempted to make the fsync request hashtable shared, but that turned out to be a lot harder to do reliably *and* fast than I was anticipating (we'd need to hold a lock for the entirety of mdsync(), dynahash doesn't allow iteration while other backends modify). So what I instead did was to replace the shared memory fsync request queue with a unix domain socket (created in postmaster, using socketpair()). CheckpointerRequest structs are written to that queue, including the associated file descriptor. The checkpointer absorbs those requests, and updates the local pending requests hashtable in local process memory. To facilitate that mdsync() has read all requests from the last cycle, checkpointer self-enqueues a token, which allows to detect the end of the relevant portion of the queue. The biggest complication in all this scheme is that checkpointer now needs to keep a file descriptor open for every segment. That obviously requires adding a few new fields to the hashtable entry. But the bigger issue is that it's now possible that pending requests need to be processed earlier than the next checkpoint, because of file descriptor limits. To address that absorbing the fsync request queue will now do a mdsync() style pass, doing the necessary fsyncs. Because mdsync() (or rather its new workhorse mdsyncpass()) will now not open files itself, there's no need to do deal with retries for files that have been deleted. For the cases where we didn't yet receive a fsync cancel request, we'll just fsync the fd. That's unnecessary, but harmless. Obviously this is currently heavily unix specific (according to my research all our unix platforms support say that they support sending fds across unix domain sockets w/ SCM_RIGHTS). It's unclear whether any OS but linux benefits from not closing file descriptors before fsync(). We could make this work for windows, without *too* much trouble (one can just open fds in another process, using that process' handle). I think there's some advantage in using the same approach everywhere. For one not maintaining two radically different approaches for complicated code. It'd also allow us to offload more fsyncs to checkpointer, not just the ones for normal relation files, which does seem advantageous. Not having ugly retry logic around deleted files in mdsync() also seems nice. But there's cases where this is likely slower, due to the potential of having to wait for checkpointer when the queue is full. I'll note that I think the new mdsync() is considerably simpler. Even if we do not decide to use an approach as presented here, I think we should make some of those changes. Specifically not unlinking the pending requests bitmap in mdsync() seems like it both resolves existing bug (see upthread) and makes the code simpler. I plan to switch to working on something else for a day or two next week, and then polish this further. I'd greatly appreciate comments till then. I didn't want to do this now, but I think we should also consider removing all awareness of segments from the fsync request queue. Instead it should deal with individual files, and the segmentation should be handled by md.c. That'll allow us to move all the necessary code to smgr.c (or checkpointer?); Thomas said that'd be helpful for further work. I personally think it'd be a lot simpler, because having to have long bitmaps with only the last bit set for large append only relations isn't a particularly sensible approach imo. The only thing that that'd make more complicated is that the file/database unlink requests get more expensive (as they'd likely need to search the whole table), but that seems like a sensible tradeoff. Alternatively using a tree structure would be an alternative obviously. Personally I was thinking that we should just make the hashtable be over a pathname, that seems most generic. Greetings, Andres Freund
Attachment
- v1-0001-freespace-Don-t-constantly-close-files-when-readi.patch
- v1-0002-Add-functions-to-send-receive-data-FD-over-a-unix.patch
- v1-0003-Make-FileGetRawDesc-ensure-there-s-an-associated-.patch
- v1-0004-WIP-Allow-to-create-a-transient-file-for-a-previo.patch
- v1-0005-WIP-Allow-more-transient-files-and-allow-to-query.patch
- v1-0006-WIP-Optimize-register_dirty_segment-to-not-repeat.patch
- v1-0007-Heavily-WIP-Send-file-descriptors-to-checkpointer.patch
pgsql-hackers by date: