Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" |
Date | |
Msg-id | CAM3SWZTWTpN2UJG46e4JX3MS_WYdg3cdbgbcouqtosTTj5fd8A@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Using quicksort and a merge step to significantly
improve on tuplesort's single run "external sort"
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" |
List | pgsql-hackers |
On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm. You don't really need to merge the in-memory array into the tape, as > you know that all the tuples in the in-memory must come after the tuples > already on the tape. You can just return all the tuples from the tape first, > and then all the tuples from the array. It's more complicated than it appears, I think. Tuples may be variable sized. WRITETUP() performs a pfree(), and gives us back a variable amount of availMem. What if we dumped a single, massive, outlier tuple out when a caller passes it and it goes to the root of the heap? We'd dump that massive tuple in one go (this would be an incremental dumptuples() call, which we still do in the patch), making things !LACKMEM() again, but by an usually comfortable margin. We read in a few more regular tuples, but we're done consuming tuples before things ever get LACKMEM() again (no more dumping needed, at least with this patch applied). What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was *partially* consumed) being less than the last/greatest tuple on tape? If the answer is "nothing", a merge step is clearly required. This is not a problem when every single tuple is dumped, but that doesn't happen anymore. I probably should have shown more tests, that tested HeapTuple sorts (not just datum tuple sorts). I agree that things at least usually happen as you describe, FWIW. > I think it'd make sense to structure the code differently, to match the way > I described this optimization above. Instead of adding a new tuplesort state > for this, abstract this in the logical tape code. Add a function to attach > an in-memory "tail" to a tape, and have LogicalTapeRead() read from the tail > after reading the on-disk tape. The rest of the code wouldn't need to care > that sometimes part of the tape is actually in memory. I'll need to think about all of that. Certainly, quicksorting runs in a more general way seems like a very promising idea, and I know that this patch does not go far enough. But it also add this idea of not dumping out most tuples, which seems impossible to generalize further, so maybe it's a useful special case to start from for that reason (and also because it's where the pain is currently felt most often). >> + * Note that there might actually be 2 runs, but only the >> + * contents of one of them went to tape, and so we can >> + * safely "pretend" that there is only 1 run (since we're >> + * about to give up on the idea of the memtuples array being >> + * a heap). This means that if our sort happened to require >> + * random access, the similar "single run" optimization >> + * below (which sets TSS_SORTEDONTAPE) might not be used at >> + * all. This is because dumping all tuples out might have >> + * forced an otherwise equivalent randomAccess case to >> + * acknowledge a second run, which we can avoid. > > > Is that really true? We don't start a second run until we have to, i.e. when > it's time to dump the first tuple of the second run to tape. So I don't > think the case you describe above, where you have two runs but only one of > them has tuples on disk, can actually happen. I think we're talking about two slightly different things. I agree that I am avoiding "starting" a second run because I am avoiding dumping tuples, just as you say (I describe this as avoiding "acknowledging" a second run). But there could still be SortTuples that have a tupindex that is > 0 (they could be 1, to be specific). It's pretty clear from looking at the TSS_BUILDRUNS case within puttuple_common() that this is true. So, if instead you define "starting" a tuple as adding a sortTuple with a tupindex that is > 0, then yes, this comment is true. The important thing is that since we're not dumping every tuple, it doesn't matter whether or not a that TSS_BUILDRUNS case within puttuple_common() ever took the "currentRun + 1" insertion path (which can easily happen), provided things aren't so skewed that it ends up on tape even without dumping all tuples (which seems much less likely). As I've said, this optimization will occur a lot more often then the existing one run optimization (assuming !randomAccess), as a nice side benefit of not dumping every tuple. Quicksort does not use HEAPCOMPARE(), so clearly having multiple runs in that "subrun" is a non-issue. Whether or not we say that a second run "started", or that there was merely the "unfulfilled intent to start a new, second run" is just semantics. While I certainly care about semantics, my point is that we agree that this useful "pretend there is only one run" thing happens (I think). The existing one run optimization only really occurs when the range of values in the set of tuples is well characterized by the tuple values observed during initial heapification, which is bad. Or would be bad, if the existing optimization was good. :-) >> Performance >> ========== > > > Impressive! > >> Predictability >> ========== > > > Even more impressive! Thanks! > As an extra optimization, you could delay quicksorting the in-memory array > until it's time to read the first tuple from it. If the caller reads only > the top-N tuples from the sort for some reason (other than LIMIT, which we > already optimize for), that could avoid a lot of work. Won't comment on that yet, since it's predicated on the merge step being unnecessary. I need to think about this some more. -- Peter Geoghegan
pgsql-hackers by date: