Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZS6g0fM=W9rXof_BrVRUpyX0tFFcCBApYoxbW1ieYGDeg@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I find this unification business really complicated. I can certainly understand why you would. As I said, it's the most complicated part of the patch, which overall is one of the most ambitious patches I've ever written. > I think it'd be simpler > to keep the BufFiles and LogicalTapeSets separate, and instead teach > tuplesort.c how to merge tapes that live on different > LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single > LogicalTapeSet can contain tapes from different underlying BufFiles. > > What I have in mind is something like the attached patch. It refactors > LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape > as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet > doesn't have the concept of a tape number anymore, it can contain any number > of tapes, and you can create more on the fly. With that, it'd be fairly easy > to make tuplesort.c merge LogicalTapes that came from different tape sets, > backed by different BufFiles. I think that'd avoid much of the unification > code. I think that it won't be possible to make a LogicalTapeSet ever use more than one BufFile without regressing the ability to eagerly reuse space, which is almost the entire reason for logtape.c existing. The whole indirect block thing is an idea borrowed from the FS world, of course, and so logtape.c needs one block-device-like BufFile, with blocks that can be reclaimed eagerly, but consumed for recycling in *contiguous* order (which is why they're sorted using qsort() within ltsGetFreeBlock()). You're going to increase the amount of random I/O by using more than one BufFile for an entire tapeset, I think. This patch you posted ("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just keeps one BufFile, and only changes the interface to expose the tapes themselves to tuplesort.c, without actually making tuplesort.c do anything with that capability. I see what you're getting at, I think, but I don't see how that accomplishes all that much for parallel CREATE INDEX. I mean, the special case of having multiple tapesets from workers (not one "unified" tapeset created from worker temp files from their tapesets to begin with) now needs special treatment. Haven't you just moved the complexity around (once your patch is made to care about parallelism)? Having multiple entire tapesets explicitly from workers, with their own BufFiles, is not clearly less complicated than managing ranges from BufFile fd.c files with delineated ranges of "logical tapeset space". Seems almost equivalent, except that my way doesn't bother tuplesort.c with any of this. >> + * As a consequence of only being permitted to write to the leader >> + * controlled range, parallel sorts that require a final >> materialized tape >> + * will use approximately twice the disk space for temp files >> compared to >> + * a more or less equivalent serial sort. > I'm slightly worried about that. Maybe it's OK for a first version, but it'd > be annoying in a query where a sort is below a merge join, for example, so > that you can't do the final merge on the fly because mark/restore support is > needed. My intuition is that we'll *never* end up using this for merge joins. I think that I could do better here (why should workers really care at this point?), but just haven't bothered to. This parallel sort implementation is something written with CREATE INDEX and CLUSTER in mind only (maybe one or two other things, too). I believe that for query execution, partitioning is the future [1]. With merge joins, partitioning is desirable because it lets you push down *everything* to workers, not just sorting (e.g., by aligning partitioning boundaries on each side of each merge join sort in the worker, and having the worker also "synchronize" each side of the join, all independently and without a dependency on a final merge). That's why I think it's okay that I use twice as much space for randomAccess tuplesort.c callers. No real world caller will ever end up needing to do this. It just seems like a good idea to support randomAccess when using this new infrastructure, on general principle. Forcing myself to support that case during initial development actually resulted in much cleaner, less invasive changes to tuplesort.c in general. [1] https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: