Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" |
Date | |
Msg-id | CAM3SWZTzLT5Y=VY320NznAyz2z_em3us6x=7rXMEUma9Z9yN6Q@mail.gmail.com Whole thread Raw |
Responses |
Re: Using quicksort and a merge step to significantly improve
on tuplesort's single run "external sort"
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" |
List | pgsql-hackers |
The behavior of external sorts that do not require any merge step due to only having one run (what EXPLAIN ANALYZE output shows as an "external sort", and not a "merge sort") seems like an area that can be significantly improved upon. As noted in code comments, this optimization did not appear in The Art of Computer Programming, Volume III. It's not an unreasonable idea, but it doesn't work well on modern machines due to using heapsort, which is known to use the cache ineffectively. It also often implies significant additional I/O for little or no benefit. I suspect that all the reports we've heard of smaller work_mem sizes improving sort performance are actually down to this one-run optimization *hurting* performance. The existing optimization for this case tries to benefit from avoiding a final N-way merge step; that's what it's all about (this is not to be confused with the other reason why a sort can end in TSS_SORTEDONTAPE -- because random tape access is required, and an *on-the-fly* TSS_FINALMERGE merge step cannot happen. Currently, TSS_SORTEDONTAPE is sometimes the "fast" case and sometimes the slow case taken only because the caller specified randomAccess, and I'm improving on only the "fast" case [1], where the caller may or may not have requested randomAccess. I require that they specify !randomAccess to use my improved optimization, though, and I'm not trying to avoid a merge step.) The existing optimization just dumps all tuples in the memtuples array (which is a heap at that point) to tape, from the top of the heap, writing a tuple out at a time, maintaining the heap invariant throughout. Then, with the memtuples array emptied, tuples are fetched from tape/disk for client code, without any merge step occurring on-the-fly (or at all). Patch -- "quicksort with spillover" ========================= With the attached patch, I propose to add an additional, better "one run special case" optimization. This new special case "splits" the single run into 2 "subruns". One of the runs is comprised of whatever tuples were in memory when the caller finished passing tuples to tuplesort. To sort that, we use quicksort, which in general has various properties that make it much faster than heapsort -- it's a cache oblivious algorithm, which is important these days. The other "subrun" is whatever tuples were on-tape when tuplesort_performsort() was called. This will often be a minority of the total, but certainly never much more than half. This is already sorted when tuplesort_performsort() is reached. This spillover is already inserted at the front of the sorted-on-tape tuples, and so already has reasonably good cache characteristics. With the patch, we perform an on-the-fly merge that is somewhat similar to the existing (unaffected) "merge sort" TSS_FINALMERGE case, except that one of the runs is in memory, and is potentially much larger than the on-tape/disk run (but never much smaller), and is quicksorted. The existing "merge sort" case similarly is only used when the caller specified !randomAccess. For subtle reasons, the new TSS_MEMTAPEMERGE case will happen significantly more frequently than the existing, comparable TSS_SORTEDONTAPE case currently happens (this applies to !randomAccess callers only, of course). See comments added to tuplesort_performsort(), and the commit message for the details. Note that the existing, comparable case was relocated to tuplesort_performsort(), to highlight that it is now a fallback for the new TSS_MEMTAPEMERGE case (also in tuplesort_performsort()). Performance ========== This new approach can be much faster. For example: select setseed(1); -- 10 million tuple table with low cardinality integer column to sort: create unlogged table low_cardinality_int4 as select (random() * 1000)::int4 s, 'abcdefghijlmn'::text junk from generate_series(1, 10000000); set work_mem = '225MB'; -- test query: select count(distinct(s)) from low_cardinality_int4; count ------- 1001 (1 row) On my laptop, a patched Postgres takes about 4.2 seconds to execute this query. Master takes about 16 seconds. The patch sees this case quicksort 9,830,398 tuples out of a total of 10 million with this 225MB work_mem setting. This is chosen to be a sympathetic case, but it is still quite realistic. We should look at a much less sympathetic case, too. Even when the in-memory "subrun" is about as small as it can be while still having this new special case optimization occur at all, the patch still ends up pretty far ahead of the master branch. With work_mem at 100MB, 4,369,064 tuples are quicksorted by the patch when the above query is executed, which is less than half of the total (10 million). Execution with the patch now takes about 10.2 seconds. Master is about 14.7 seconds with the same work_mem setting (yes, master gets faster as the patch gets slower as work_mem is reduced...but they never come close to meeting). That's still a fairly decent improvement, and it occurs when we're on the verge of not using the optimization at all. Most users that really care about performance will at least have enough memory to benefit from this optimization when building an index on a large table, because the patch roughly halves the amount of memory you need to get at least some of the benefit of an internal sort. Performance regressions ---------------------------------- I have been unable to find any regressions in the performance of queries with the patch. If you're looking for a query that might have been regressed, I suggest coming up with a case involving a sort with pass-by-reference types that are expensive. For example, sorting a tuple on many low cardinality text attributes. Only the most extreme such cases seem likely to be regressed, though, because heapsort also has bad temporal locality. My strcoll() result caching patch [2] tries to take advantage of temporal locality rather than spatial locality, which works well with quicksort and mergesort. Memory use ----------------- The new TSS_MEMTAPEMERGE case uses no more memory than the existing "TSS_SORTEDONTAPE due to one run" optimization (actually, it uses very slightly less) if we only worry about the high-water mark. In both cases the high-water mark comes as the work_mem limit is reached. Typically, most memory isn't released until the sort is shut down. Because we're not dumping all tuples here (only as many as we need to dump to stay within our memory budget), we're also not freeing memory for each and every "tuple proper" as each and every SortTuple is written out (because we usually avoid writing out most tuples, which is an important goal of the patch). Although the memtuples array itself is often tuplesort's dominant consumer of work_mem, it's still possible that the aggregate effect of this patch on some workload's memory consumption is that more memory is used. I doubt it, though; the overall effect on memory usage will probably always be to reduce it. Finishing a sort sooner allows us to make memory available for other operations sooner. Besides, we have broken no promise to the caller wrt memory use. Predictability ========== The patch alters the performance characteristics of tuplesort, but very much for the better. With the patch, as less memory is gradually made available for sorting, performance tends to also degrade gradually, because we more or less have an algorithm that's a kind of hybrid internal/external sort, that, roughly speaking, blends from the former to the latter based on the availability of memory. It's particularly useful that performance doesn't fall off a cliff when we can no longer fit everything in memory because work_mem is slightly too small. The "almost internal" query from my example above takes about 4.2 seconds. An equivalent internal sort (quicksort) takes about 3.5 seconds, which is pretty close (~98% of tuples are quicksorted for the "almost internal" case, but heapification is a price we must pay to spill even one tuple). Furthermore, as work_mem is decreased to the point that even the optimization is no longer used -- when a traditional "merge sort"/TSS_FINALMERGE is used, instead -- there is also no big drop. *Predictable* performance characteristics are a big asset. Decreasing work_mem by 10% (or a moderate increase in the size of a table) should not make the performance of reporting queries tank. Concerns about worst case performance (e.g. particular queries suddenly taking much longer to execute) have certainly prevented me from decreasing work_mem from within postgresql.conf in the past, even though I was fairly confident that it made sense for the average case. Open Items ========= There are a few smaller open items indicated by "XXX" comments. There is a little overlap with this patch, and a couple of others that are in the queue that also affect sorting. For example, I'm considerate of cases that don't exist yet. Future work ========= In the future, we should think about optimization of the "merge sort"/TSS_FINALMERGE case, which should be made to sort *every* run using quicksort (the devil in the details there, but in general I think that runs should be quicksorted wherever possible). For now, what I came up with seems like a relatively simple approach that offers much of the benefit of that more comprehensive project, since the need to do a "merge sort" is increasingly rare due to the enormous main memory sizes that are affordable these days. Of course, Noah's excellent work on huge repalloc() also contributes to our being able to put large amounts of memory to good use when sorting. Since heapification is now a big fraction of the total cost of a sort sometimes, even where the heap invariant need not be maintained for any length of time afterwards, it might be worth revisiting the patch to make that an O(n) rather than a O(log n) operation [3]. Not sure about that, but someone should probably look into it. Jeremy Harris is CC'd here; perhaps he will consider picking that work up once more in light of this development. It would be nice to get a sort that quicksorts 99% of all tuples even closer to a conventional internal sort. [1] Seems bogus to me that EXPLAIN ANALYZE shows state TSS_SORTEDONTAPE as "external sort" rather than a "merge sort", regardless of whether or not a merge step was actually required (that is, regardless of whether or not state ended up TSS_SORTEDONTAPE due to using the existing "single run" optimization, or because caller required randomAccess) [2] https://commitfest.postgresql.org/6/294/ [3] http://www.postgresql.org/message-id/52F16843.8080001@wizmail.org -- Peter Geoghegan
Attachment
pgsql-hackers by date: