Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+TgmoYP0vzPw64DfMQT1JHY6SzyAvjogLkj3erMZzzN2f9xLA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I find this unification business really complicated. I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I just looked at the buffile.c/buffile.h changes in the latest version of the patch and I agree with this criticism, though maybe not with the proposed solution. I actually don't understand what "unification" is supposed to mean. The patch really doesn't explain that anywhere that I can see. It says stuff like: + * Parallel operations can use an interface to unify multiple worker-owned + * BufFiles and a leader-owned BufFile within a leader process. This relies + * on various fd.c conventions about the naming of temporary files. That comment tells you that unification is a thing you can do -- via an unspecified interface for unspecified reasons using unspecified conventions -- but it doesn't tell you what the semantics of it are supposed to be. For example, if we "unify" several BufFiles, do they then have a shared seek pointer? Do the existing contents effectively get concatenated in an unpredictable order, or are they all expected to be empty at the time unification happens? Or something else? It's fine to make up new words -- indeed, in some sense that is the essence of writing any complex problem -- but you have to define them. As far as I can tell, the idea is that we're somehow magically concatenating the BufFiles into one big super-BufFile, but I'm fuzzy on exactly what's supposed to be going on there. It's hard to understand how something like this doesn't leak resources. Maybe that's been thought about here, but it isn't very clear to me how it's supposed to work. In Heikki's proposal, if process A is trying to read a file owned by process B, and process B dies and removes the file before process A gets around to reading it, we have got trouble, especially on Windows which apparently has low tolerance for such things. Peter's proposal avoids that - I *think* - by making the leader responsible for all resource cleanup, but that's inferior to the design we've used for other sorts of shared resource cleanup (DSM, DSA, shm_mq, lock groups) where the last process to detach always takes responsibility. That avoids assuming that we're always dealing with a leader-follower situation, it doesn't categorically require the leader to be the one who creates the shared resource, and it doesn't require the leader to be the last process to die. Imagine a data structure that is stored in dynamic shared memory and contains space for a filename, a reference count, and a mutex. Let's call this thing a SharedTemporaryFile or something like that. It offers these APIs: extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); After setting aside sizeof(SharedTemporaryFile) bytes in your shared DSM sgement, you call SharedTemporaryFileInitialize() to initialize them. Then, every process that cares about the file does SharedTemporaryFileAttach(), which bumps the reference count and sets an on_dsm_detach hook to decrement the reference count and unlink the file if the reference count thereby reaches 0. One of those processes does SharedTemporaryFileAssign(), which fills in the pathname and clears FD_TEMPORARY. Then, any process that has attached can call SharedTemporaryFileGetFile() to get a File which can then be accessed normally. So, the pattern for parallel sort would be: - Leader sets aside space and calls SharedTemporaryFileInitialize() and SharedTemporaryFileAttach(). - The cooperating worker calls SharedTemporaryFileAttach() and then SharedTemporaryFileAssign(). - The leader then calls SharedTemporaryFileGetFile(). Since the leader can attach to the file before the path name is filled in, there's no window where the file is at risk of being leaked. Before SharedTemporaryFileAssign(), the worker is solely responsible for removing the file; after that call, whichever of the leader and the worker exits last will remove the file. > That leaves one problem, though: reusing space in the final merge phase. If > the tapes being merged belong to different LogicalTapeSets, and create one > new tape to hold the result, the new tape cannot easily reuse the space of > the input tapes because they are on different tape sets. If the worker is always completely finished with the tape before the leader touches it, couldn't the leader's LogicalTapeSet just "adopt" the tape and overwrite it like any other? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: