Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Refactoring the checkpointer's fsync request queue |
Date | |
Msg-id | CA+hUKGK8yRjDc1ED8bmZavm=g8TYmZKOzQF3QUx_HhOUCSzkZg@mail.gmail.com Whole thread Raw |
In response to | Re: Refactoring the checkpointer's fsync request queue (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Refactoring the checkpointer's fsync request queue
|
List | pgsql-hackers |
On Thu, Feb 21, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote: > Thanks for the update! +1, thanks Shawn. It's interesting that you measure performance improvements that IIUC can come only from dropping the bitmapset stuff (or I guess bugs). I don't understand the mechanism for that improvement yet. The idea of just including the segment number (in some representation, possibly opaque to this code) in the hash table key instead of carrying a segment list seems pretty good from here (and I withdraw the sorted vector machinery I mentioned earlier as it's redundant for this project)... except for one detail. In your patch, you still have FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this sync manager code being able to understand which stuff to drop from the hash table, which means that is still has knowledge of the key hierarchy, so that it can match all entries for the relation or database. That's one of the things that Andres is arguing against. I can see how to fix that for the relation case: md.c could simply enqueue a FORGET_REQUEST message for every segment and fork in the relation, so they can be removed by O(1) hash table lookup. I can't immediately see how to deal with the database case though. Perhaps in a tag scheme, you'd have to make the tag mostly opaque, except for a DB OID field, so you can handle this case? (Or some kind of predicate callback, so you can say "does this tag cancel this other tag?"; seems over the top). > On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote: > > As promised, here's a patch that addresses the points discussed by > > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to > > track what files to fsync, the pending ops table now integrates the > > forknum and segno as part of the hash key eliminating the need for the > > bitmapsets, or vectors from the previous iterations. We re-construct the > > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use > > PathNameOpenFile to get the file descriptor to use for fsync. > > I still object to exposing segment numbers to smgr and above. I think > that's an implementation detail of the various storage managers, and we > shouldn't expose smgr.c further than it already is. Ok by me. The solution to this is probably either raw paths (as Andres has suggested, but which seem problematic to me -- see below), or some kind of small fixed size tag type that is morally equivalent to a path and can be converted to a path but is more convenient for shoving through pipes and into hash tables. > > Apart from that, this patch moves the system for requesting and > > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr > > component specific callbacks to retrieve metadata like relation and > > segment paths. This allows smgr components to maintain how relfilenodes, > > forks and segments map to specific files without exposing this knowledge > > to smgr. It redefines smgrsync() behavior to be closer to that of > > smgrimmedsysnc(), i.e., if a regular sync is required for a particular > > file, enqueue it in locally or forward it to checkpointer. > > smgrimmedsync() retains the existing behavior and fsyncs the file right > > away. The processing of fsync requests has been moved from mdsync() to a > > new ProcessFsyncRequests() function. > > I think that's also wrong, imo fsyncs are storage detail, and should be > relegated to *below* md.c. That's not to say the code should live in > md.c, but the issuing of such calls shouldn't be part of smgr.c - > consider e.g. developing a storage engine living in non volatile ram. How about we take all that sync-related stuff, that Shawn has moved from md.c into smgr.c, and my earlier patch had moved out of md.c into smgrsync.c, and we put it into a new place src/backend/storage/file/sync.c? Or somewhere else, but not under smgr. It doesn't know anything about smgr concepts, and it can be used to schedule file sync for any kind of file, not just the smgr implementations' files. Though they'd be the main customers, I guess. md.c and undofile.c etc would call it to register stuff, and checkpointer.c would call it to actually perform all the fsync calls. If we do that, the next question is how to represent filenames. One idea is to use tags that can be converted back to a path. I suppose there could be a table of function pointers somewhere, and the tag could be a discriminated union? Or just a descriminator + a small array of dumb uint32_t of a size big enough for all users, a bit like lock tags. One reason to want to use small fixed-sized tags is to avoid atomicity problems in the future when it comes to the fd-passing work, as mentioned up-thread. Here are some other ideas, to avoid having to use tags: * send the paths through a shm queue, but the fds through the Unix domain socket; the messages carrying fds somehow point to the pathname in the shm queue (and deal with slight disorder...) * send the paths through the socket, but hold an LWLock while doing so to make sure it's atomic, no matter what the size * somehow prove that it's really already atomic even for long paths, on every operating system we support, and that it's never change, so there is no problem here Another problem with variable sized pathnames even without the future fd-passing work is that it's harder to size the shm queue: the current code sets max_requests to NBuffers, which makes some kind of sense because that's a hard upper bound on the number of dirty segments there could possibly be at a given moment in time (one you'll practically never hit), after deduplication. It's harder to come up with a decent size for a new variable-sized-message queue; MAXPGPATH * NBuffers would be insanely large (it's be 1/8th the size of the buffer pool), but if you make it any smaller there is no guarantee that compacting it can create space. Perhaps the solution to that is simply to block/wait while shm queue is full -- but that might have deadlock problems. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: