Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZS3ttv3_AjGw_BmRRd3QML3YZ9FueN=bwQ0r6dZVS2mDA@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run |
List | pgsql-hackers |
On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > OK, I have now committed 0001 I attach a revision of the external quicksort patch and supplementary small patches, rebased on top of the master branch. Changes: 1. As I mentioned on the "Improve memory management for external sorts" -committers thread, we should protect against currentRun integer overflow. This new revision does so. I'm not sure if that change needs to be back-patched; I just don't want to take any risks, and see this as low cost insurance. Really low workMem sorts are now actually fast enough that this seems like something that could happen on a misconfigured system. 2. Add explicit constants for special run numbers that replacement selection needs to care about in particular. I did this because change 1 reminded me of the "currentRun vs. SortTuple.tupindex" run numbering subtleties. The explicit use of certain run number constants seems to better convey some tricky details, in part by letting me add a few documenting if obvious assertions. It's educational to be able to grep for the these constants (e.g., the new HEAP_RUN_NEXT constant) to jump to the parts of the code that need to think about replacement selection. As things were, that code relied on too much from too great a distance (arguably this is true even in the master branch). This change in turn led to minor wordsmithing to adjacent areas here and there, most of it subtractive. As an example of where this helps, ISTM that the assertion added to the routine tuplesort_heap_insert() is now self-documenting, which wasn't the case before. 3. There was one very tricky consideration around an edge-case that required careful thought. This was an issue within my new function dumpbatch(). It could previously perform what turns out to be a superfluous selectnewtape() call when we take the dumpbatch() "state->memtupcount == 0" early return path (see the last revision for full details of that now-axed code path). Now, we accept that there may on occasion be 0 tuple runs. In other words, we now never returned early from within dumpbatch(). There was previously no explanation for why it was okay to have a superfluous selectnewtape() call. However, needing to be certain that any newly selected destTape tape will go on to receive a run is implied for the general case by this existing selectnewtape() code comment: * This is called after finishing a run when we know another run * must be started. This implements steps D3, D4 of Algorithm D. While the previous revision was correct anyway, I tried to explain why it was correct in comments, and soon realized that that was a really bad idea; the rationale was excessively complicated. Allowing 0 tuple runs in rare cases seems like the simplest solution. After all, mergeprereadone() is expressly prepared for 0 tuple runs. It says "ensure that we have at least one tuple, if any are to be had". There is no reason to assume that it says this only because it imagines that no tuples might be found *only after* the first preread for the merge (by which I mean I don't think that only applies when a final on-the-fly merge reloads tuples from one particular tape following running out of tuples of the tape/run in memory). 4. I updated the function beginmerge() to acknowledge an inconsistency for pass-by-value datumsorts, which I mentioned in passing on this thread a few days back. The specific change: @@ -2610,7 +2735,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch) if (finalMergeBatch) { - /* Free outright buffers for tape never actually allocated */ + /* + * Free outright buffers for tape never actually allocated. The + * !state->tuples case is actually entitled to have at least this much + * of a refund ahead of its final merge, but we don't go out of our way + * to marginally improve that case. + */ FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD); It's not worth worrying about this case, since the savings are small (especially now that maxTapes is capped). But it's worth acknowledging that the "!state->tuples" case is being "short-changed", in the new spirit of heavily scrutinizing where memory goes in tuplesort.c. 5. I updated the "Add MemoryContextStats() calls for debugging" patch. I now formally propose that this debugging instrumentation be committed. This revised debugging instrumentation patch does not have the system report anything about the memory context just because "trace_sort = on". Rather, it does nothing on ordinary builds, where the macro SHOW_MEMORY_STATS will not be defined (it also respects trace_sort). This is about the same approach seen in postgres.c's finish_xact_command(). ISTM that we ought to provide a way of debugging memory use within tuplesort.c, since we now know that that could be very important. Let's not forget where the useful places to look for problems are. 6. Based on your feedback on the batch memory patch (your commit c27033ff), I made a stylistic change. I made similar comments about the newly added quicksort/dumpbatch() MemoryContextReset() call, since it has its own special considerations (a big change in the pattern of allocations occurs after batch memory is used -- we need to be careful about how that could impact the "bucketing by size class"). Thanks -- Peter Geoghegan
Attachment
pgsql-hackers by date: