Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZRO5He5y-h1u8ozLgZb=9srOA0f169qYbKYKrhvoWzxQQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> What I have in mind is something like the attached patch. It refactors >> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape >> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet >> doesn't have the concept of a tape number anymore, it can contain any number >> of tapes, and you can create more on the fly. With that, it'd be fairly easy >> to make tuplesort.c merge LogicalTapes that came from different tape sets, >> backed by different BufFiles. I think that'd avoid much of the unification >> code. > > I just looked at the buffile.c/buffile.h changes in the latest version > of the patch and I agree with this criticism, though maybe not with > the proposed solution. I actually don't understand what "unification" > is supposed to mean. The patch really doesn't explain that anywhere > that I can see. It says stuff like: > > + * Parallel operations can use an interface to unify multiple worker-owned > + * BufFiles and a leader-owned BufFile within a leader process. This relies > + * on various fd.c conventions about the naming of temporary files. Without meaning to sound glib, unification is the process by which parallel CREATE INDEX has the leader read temp files from workers sufficient to complete its final on-the-fly merge. So, it's a terminology that's bit like "speculative insertion" was up until UPSERT was committed: a concept that is somewhat in flux, and describes a new low-level mechanism built to support a higher level operation, which must accord with a higher level set of requirements (so, for speculative insertion, that would be avoiding "unprincipled deadlocks" and so on). That being the case, maybe "unification" isn't useful as an precise piece of terminology at this point, but that will change. While I'm fairly confident that I basically have the right idea with this patch, I think that you are better at judging the ins and outs of resource management than I am, not least because of the experience of working on parallel query itself. Also, I'm signed up to review parallel hash join in large part because I think there might be some convergence concerning the sharing of BufFiles among parallel workers. I don't think I'm qualified to judge what a general abstraction like this should look like, but I'm trying to get there. > That comment tells you that unification is a thing you can do -- via > an unspecified interface for unspecified reasons using unspecified > conventions -- but it doesn't tell you what the semantics of it are > supposed to be. For example, if we "unify" several BufFiles, do they > then have a shared seek pointer? No. > Do the existing contents effectively > get concatenated in an unpredictable order, or are they all expected > to be empty at the time unification happens? Or something else? The order is the same order as ordinal identifiers are assigned to workers within tuplesort.c, which is undefined, with the notable exception of the leader's own identifier (-1) and area of the unified BufFile space (this is only relevant in randomAccess cases, where leader may write stuff out to its own reserved part of the BufFile space). It only matters that the bit of metadata in shared memory is in that same order, which it clearly will be. So, it's unpredictable, but in the same way that ordinal identifiers are assigned in a not-well-defined order; it doesn't or at least shouldn't matter. We can imagine a case where it does matter, and we probably should, but that case isn't parallel CREATE INDEX. >It's fine to make up new words -- indeed, in some sense that is the essence > of writing any complex problem -- but you have to define them. I invite you to help me define this new word. > It's hard to understand how something like this doesn't leak > resources. Maybe that's been thought about here, but it isn't very > clear to me how it's supposed to work. I agree that it would be useful to centrally document what all this unification stuff is about. Suggestions on where that should live are welcome. > In Heikki's proposal, if > process A is trying to read a file owned by process B, and process B > dies and removes the file before process A gets around to reading it, > we have got trouble, especially on Windows which apparently has low > tolerance for such things. Peter's proposal avoids that - I *think* - > by making the leader responsible for all resource cleanup, but that's > inferior to the design we've used for other sorts of shared resource > cleanup (DSM, DSA, shm_mq, lock groups) where the last process to > detach always takes responsibility. Maybe it's inferior to that, but I think what Heikki proposes is more or less complementary to what I've proposed, and has nothing to do with resource management and plenty to do with making the logtape.c interface look nice, AFAICT. It's also about refactoring/simplifying logtape.c itself, while we're at it. I believe that Heikki has yet to comment either way on my approach to resource management, one aspect of the patch that I was particularly keen on your looking into. The theory of operation here is that workers own their own BufFiles, and are responsible for deleting them when they die. The assumption, rightly or wrongly, is that it's sufficient that workers flush everything out (write out temp files), and yield control to the leader, which will open their temp files for the duration of the leader final on-the-fly merge. The resource manager in the leader knows it isn't supposed to ever delete worker-owned files (just close() the FDs), and the leader errors if it cannot find temp files that match what it expects. If there is a an error in the leader, it shuts down workers, and they clean up, more than likely. If there is an error in the worker, or if the files cannot be deleted (e.g., if there is a classic hard crash scenario), we should also be okay, because nobody will trip up on some old temp file from some worker, since fd.c has some gumption about what workers need to do (and what the leader needs to avoid) in the event of a hard crash. I don't see a risk of file descriptor leaks, which may or may not have been part of your concern (please clarify). > That avoids assuming that we're > always dealing with a leader-follower situation, it doesn't > categorically require the leader to be the one who creates the shared > resource, and it doesn't require the leader to be the last process to > die. I have an open mind about that, especially given the fact that I hope to generalize the unification stuff further, but I am not aware of any reason why that is strictly necessary. > Imagine a data structure that is stored in dynamic shared memory and > contains space for a filename, a reference count, and a mutex. Let's > call this thing a SharedTemporaryFile or something like that. It > offers these APIs: > > extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); > extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg); > extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); > extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); I'm a little bit tired right now, and I have yet to look at Thomas' parallel hash join patch in any detail. I'm interested in what you have to say here, but I think that I need to learn more about its requirements in order to have an informed opinion. >> That leaves one problem, though: reusing space in the final merge phase. If >> the tapes being merged belong to different LogicalTapeSets, and create one >> new tape to hold the result, the new tape cannot easily reuse the space of >> the input tapes because they are on different tape sets. > > If the worker is always completely finished with the tape before the > leader touches it, couldn't the leader's LogicalTapeSet just "adopt" > the tape and overwrite it like any other? I'll remind you that parallel CREATE INDEX doesn't actually ever need to be randomAccess, and so we are not actually going to ever need to do this as things stand. I wrote the code that way in order to not break the existing interface, which seemed like a blocker to posting the patch. I am open to the idea of such an "adoption" occurring, even though it actually wouldn't help any case that exists in the patch as proposed. I didn't go that far in part because it seemed premature, given that nobody had looked at my work to date at the time, and given the fact that there'd be no initial user-visible benefit, and given how the exact meaning of "unification" was (and is) somewhat in flux. I see no good reason to not do that, although that might change if I actually seriously undertook to teach the leader about this kind of "adoption". I suspect that the interface specification would make for confusing reading, which isn't terribly appealing, but I'm sure I could manage to make it work given time. -- Peter Geoghegan
pgsql-hackers by date: