Home > mailing lists

Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: Refactoring the checkpointer's fsync request queue
Date	February 21, 2019 03:28:32
Msg-id	CA+hUKGK8yRjDc1ED8bmZavm=g8TYmZKOzQF3QUx_HhOUCSzkZg@mail.gmail.com Whole thread Raw
In response to	Re: Refactoring the checkpointer's fsync request queue (Andres Freund <andres@anarazel.de>)
Responses	Re: Refactoring the checkpointer's fsync request queue
List	pgsql-hackers

Tree view

On Thu, Feb 21, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the update!

+1, thanks Shawn.

It's interesting that you measure performance improvements that IIUC
can come only from dropping the bitmapset stuff (or I guess bugs).  I
don't understand the mechanism for that improvement yet.

The idea of just including the segment number (in some representation,
possibly opaque to this code) in the hash table key instead of
carrying a segment list seems pretty good from here (and I withdraw
the sorted vector machinery I mentioned earlier as it's redundant for
this project)... except for one detail.  In your patch, you still have
FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC.  That relies on this
sync manager code being able to understand which stuff to drop from
the hash table, which means that is still has knowledge of the key
hierarchy, so that it can match all entries for the relation or
database.  That's one of the things that Andres is arguing against.

I can see how to fix that for the relation case: md.c could simply
enqueue a FORGET_REQUEST message for every segment and fork in the
relation, so they can be removed by O(1) hash table lookup.  I can't
immediately see how to deal with the database case though.  Perhaps in
a tag scheme, you'd have to make the tag mostly opaque, except for a
DB OID field, so you can handle this case?  (Or some kind of predicate
callback, so you can say "does this tag cancel this other tag?"; seems
over the top).

> On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:
> > As promised, here's a patch that addresses the points discussed by
> > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
> > track what files to fsync, the pending ops table now integrates the
> > forknum and segno as part of the hash key eliminating the need for the
> > bitmapsets, or vectors from the previous iterations. We re-construct the
> > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
> > PathNameOpenFile to get the file descriptor to use for fsync.
>
> I still object to exposing segment numbers to smgr and above. I think
> that's an implementation detail of the various storage managers, and we
> shouldn't expose smgr.c further than it already is.

Ok by me.  The solution to this is probably either raw paths (as
Andres has suggested, but which seem problematic to me -- see below),
or some kind of small fixed size tag type that is morally equivalent
to a path and can be converted to a path but is more convenient for
shoving through pipes and into hash tables.

> > Apart from that, this patch moves the system for requesting and
> > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
> > component specific callbacks to retrieve metadata like relation and
> > segment paths. This allows smgr components to maintain how relfilenodes,
> > forks and segments map to specific files without exposing this knowledge
> > to smgr.  It redefines smgrsync() behavior to be closer to that of
> > smgrimmedsysnc(), i.e., if a regular sync is required for a particular
> > file, enqueue it in locally or forward it to checkpointer.
> > smgrimmedsync() retains the existing behavior and fsyncs the file right
> > away. The processing of fsync requests has been moved from mdsync() to a
> > new ProcessFsyncRequests() function.
>
> I think that's also wrong, imo fsyncs are storage detail, and should be
> relegated to *below* md.c.  That's not to say the code should live in
> md.c, but the issuing of such calls shouldn't be part of smgr.c -
> consider e.g. developing a storage engine living in non volatile ram.

How about we take all that sync-related stuff, that Shawn has moved
from md.c into smgr.c, and my earlier patch had moved out of md.c into
smgrsync.c, and we put it into a new place
src/backend/storage/file/sync.c?  Or somewhere else, but not under
smgr.  It doesn't know anything about smgr concepts, and it can be
used to schedule file sync for any kind of file, not just the smgr
implementations' files.  Though they'd be the main customers, I guess.
md.c and undofile.c etc would call it to register stuff, and
checkpointer.c would call it to actually perform all the fsync calls.

If we do that, the next question is how to represent filenames.  One
idea is to use tags that can be converted back to a path.  I suppose
there could be a table of function pointers somewhere, and the tag
could be a discriminated union?  Or just a descriminator + a small
array of dumb uint32_t of a size big enough for all users, a bit like
lock tags.

One reason to want to use small fixed-sized tags is to avoid atomicity
problems in the future when it comes to the fd-passing work, as
mentioned up-thread.  Here are some other ideas, to avoid having to
use tags:

* send the paths through a shm queue, but the fds through the Unix
domain socket; the messages carrying fds somehow point to the pathname
in the shm queue (and deal with slight disorder...)
* send the paths through the socket, but hold an LWLock while doing so
to make sure it's atomic, no matter what the size
* somehow prove that it's really already atomic even for long paths,
on every operating system we support, and that it's never change, so
there is no problem here

Another problem with variable sized pathnames even without the future
fd-passing work is that it's harder to size the shm queue: the current
code sets max_requests to NBuffers, which makes some kind of sense
because that's a hard upper bound on the number of dirty segments
there could possibly be at a given moment in time (one you'll
practically never hit), after deduplication.  It's harder to come up
with a decent size for a new variable-sized-message queue; MAXPGPATH *
NBuffers would be insanely large (it's be 1/8th the size of the buffer
pool), but if you make it any smaller there is no guarantee that
compacting it can create space.  Perhaps the solution to that is
simply to block/wait while shm queue is full -- but that might have
deadlock problems.

-- 
Thomas Munro
https://enterprisedb.com

pgsql-hackers by date:

From: Jim Finnerty
Date: 21 February 2019, 03:27:47
Subject: Re: NOT IN subquery optimization

From: Etsuro Fujita
Date: 21 February 2019, 03:50:44
Subject: Re: Another way to fix inherited UPDATE/DELETE

Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers

Previous

Next