Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZRiHaF7jdf923ZZ2qhDJiErqP5uU_+JPuMvUmeD0z9fFA@mail.gmail.com Whole thread Raw |
In response to | Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Using quicksort for every external sort run
|
List | pgsql-hackers |
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote: > Let's start by comparing an external sort that uses 1/3 the memory of > an internal sort against the master branch. That's completely unfair > on the patch, of course, but it is a useful indicator of how well > external sorts do overall. Although an external sort surely cannot be > as fast as an internal sort, it might be able to approach an internal > sort's speed when there is plenty of I/O bandwidth. That's a good > thing to aim for, I think. > The patch only takes ~10% more time to execute this query, which seems > very good considering that ~1/3 the work_mem has been put to use. > Note that the on-tape runs are small relative to CPU costs, so this > query is a bit sympathetic (consider the time spent writing batches > that trace_sort indicates here). CREATE INDEX would not compare so > well with an internal sort, for example, especially if it was a > composite index or something. This is something that I've made great progress on (see "concrete example" below for numbers). The differences in the amount of I/O required between these two cases (due to per-case variability in the width of tuples written to tape for datum sorts and index sorts) did not significantly factor in to the differences in performance, it turns out. The big issue was that while a pass-by-value datum sort accidentally has good cache characteristics during the merge step, that is not generally true. I figured out a way of making it generally true, though. I attach a revised patch series with a new commit that adds an optimization to the merge step, relieving what was a big remaining bottleneck in the CREATE INDEX case (and *every* external sort case that isn't a pass-by-value datum sort, which is most things). There are a few tweaks to earlier commits including, but nothing very interesting. All of my benchmarks suggests that this most recent revision puts external sorting within a fairly small margin of a fully internal sort on the master branch in many common cases. This difference is seen when the implementation only makes use of a fraction of the memory required for an internal sort, provided the system is reasonably well balanced. For a single backend, there is an overhead of about 5% - 20% against master's internal sort performance. This speedup appears to be fairly robust across a variety of different cases. I particularly care about CREATE INDEX, since that is where most pain is felt in the real world, and I'm happy that I found a way to make CREATE INDEX external sort reasonably comparable in run time to internal sorts that consume much more memory. I think it's time to stop talking about this as performance work, and start talking about it as scalability work. With that in mind, I'm mostly going to compare the performance of the new, optimized external sort implementation with the existing internal sort implementation from now on. New patch -- Sequential memory access =============================== The trick I hit upon for relieving the merge bottleneck was fairly simple. Prefetching works for internal sorts, but isn't practical for external sorts while merging. OTOH, I can arrange to have runs allocate their "tuple proper" contents into a memory pool, partitioned by final on-the-fly tape number. Today, runs/tapes are slurped from disk sequentially in a staggered fashion, based on the availability of in-memory tuples from each tape while merging. The new patch is very effective in reducing cache misses by simply making sure that each tape's "tuple proper" (e.g. each IndexTuple) is accessed in memory in the natural, predictable order (the sorted order that runs on tape always have). Unlike with internal sorts (where explicit memory prefetching of each "tuple proper" may be advisable), the final order in which the caller must consume a tape's "tuple proper" is predictable well in advance. A little rearrangement is required to make what were previously retail palloc() calls during prereading (a palloc() for each "tuple proper", within each READTUP() routine) consume space from the memory pool instead. The pool (a big, once-off memory allocation) is reused in a circular fashion per tape partition. This saves a lot of palloc() overhead. Under this scheme, each tape's next few IndexTuples are all in one cacheline. This patch has the merge step make better use of available memory bandwidth, rather than attempting to conceal memory latency. Explicit prefetch instructions (that we may independently end up using to do something similar with internal sorts when fetching tuples following sorting proper) are all about hiding latency. Concrete example -- performance --------------------------------------------- I attach a text file describing a practical, reproducible example CREATE INDEX. It shows how CREATE INDEX now compares fairly well with an equivalent operation that has enough maintenance_work_mem to complete its sort internally. I'll just summarize it here: A CREATE INDEX on a single int4 attribute on an unlogged table takes only ~18% longer. This is a 100 million row table that is 4977 MB on disk. On master, CREATE INDEX takes 66.6 seconds in total with an *internal* sort. With the patch series applied, an *external* sort involving a final on-the-fly merge of 6 runs takes 78.5 seconds. Obviously, since there are 6 runs to merge, work_mem is only approximately 1/6 of what is required for a fully internal sort. High watermark memory usage ------------------------------------------ One concern about the patch may be that it increases the high watermark memory usage by any on-the-fly final merge step. It takes full advantage of the availMem allowance at a point where every "tuple proper" is freed, and availMem has only had SortTuple/memtuples array "slot" memory subtracted (plus overhead). Memory is allocated in bulk once, and partitioned among active tapes, with no particular effort towards limiting memory usage beyond enforcing that we always !LACKMEM(). A lot of the overhead of many retail palloc() calls is removed by simply using one big memory allocation. In practice, LACKMEM() will rarely become true, because the availability of slots now tends to be the limiting factor. This is partially explained by the number of slots being established when palloc() overhead was in play, prior to the final merge step. However, I have concerns about the memory usage of this new approach. With the int4 CREATE INDEX case above, which has a uniform distribution, I noticed that about 40% of each tape's memory space remains unused when slots are exhausted. Ideally, we'd only have allocated enough memory to run out at about the same time that slots are exhausted, since the two would be balanced. This might be possible for fixed-sized tuples. I have not allocated each final on-the-fly merge step's active tape's pool individually, because while this waste of memory is large enough to be annoying, it's not large enough to be significantly helped by managing a bunch of per-tape buffers and enlarging them as needed geometrically (e.g. starting small, and doubling each time the buffer size is hit until the per-tape limit is finally reached). The main reason that the high watermark is increased is not because of this, though. It's mostly just that "tuple proper" memory is not freed until the sort is done, whereas before there were many small pfree() calls to match the many palloc() calls -- calls that occurred early and often. Note that the availability of "slots" (i.e. the size of the memtuples array, minus one element for each tape's heap item) is currently determined by whatever size it happened to be at when memtuples stopped growing, which isn't particularly well principled (hopefully this is no worse now). Optimal memory usage ------------------------------- In the absence of any clear thing to care about most beyond making sorting faster while still enforcing !LACKMEM(), for now I've kept it simple. I am saving a lot of memory by clawing back palloc() overhead, but may be wasting more than that in another way now, to say nothing of the new high watermark itself. If we're entirely I/O bound, maybe we should not waste memory by simply not allocating as much anyway (i.e. the extra memory may only theoretically help even when it is written to). But what does it really mean to be I/O bound? The OS cache probably consumes plenty of memory, too. Finally, let us not forget that it's clearly still the case that even following this work, run size needs to be optimized using a cost model, rather than simply being determined by how much memory can be made available (work_mem). If we get a faster sort using far less work_mem, then the DBA is probably accidentally wasting huge amounts of memory due to failing to do that. As an implementor, it's really hard to balance all of these concerns, or to say that one in particular is most urgent. Parallel sorting =========== Simon rightly emphasized the need for joined-up thinking in relation to applying important tuplesort optimizations. We must at least consider parallelism as part of this work. I'm glad that the first consumer of parallel infrastructure is set to be parallel sequential scans, not internal parallel sorts. That's because it seems that overall, a significant cost is actually reading tuples into memtuples to sort -- heap scanning and related costs in the buffer manager (even assuming everything is in shared_buffers), COPYTUP() palloc() calls, and so on. Taken together, they can be a bigger overall cost than sorting proper, even assuming abbreviated keys are not used. The third bucket that I tend to categorize costs into, "time spent actually writing out finished runs", is small on a well balanced system. Surprisingly small, I would say. I will sketch a simple implementation of parallel sorting based on the patch series that may be workable, and requires relatively little implementation effort compare to other ideas that were raised at various times: * Establish an optimal run size ahead of time using a cost model. We need this for serial external sorts anyway, to relieve the DBA of having to worry about sizing maintenance_work_mem according to obscure considerations around cache efficiency within tuplesort. Parallelism probably doesn't add much complexity to the cost model, which is not especially complicated to begin with. Note that I have not added this cost model yet (just the ad-hoc, tuplesort-private cost model for using replacement selection to get a "quicksort with spillover"). It may be best if this cost model lives in the optimizer. * Have parallel workers do a parallel heap scan of the relation until they fill this optimal run size. Use local memory to sort within workers. Write runs out in the usual way. Then, the worker picks up the next run scheduled. If there are no more runs to build, there is no more work for the parallel workers. * Shut down workers. Do an on-the-fly merge in the parent process. This is the same as with a serial merge, but with a little coordination with worker processes to make sure every run is available, etc. In general, coordination is kept to an absolute minimum. I tend to think that this really simple approach would get much of the gain of something more complicated -- no need to write shared memory management code, minimal need to handle coordination between workers, and no real changes to the algorithms used for each sub-problem. This makes merging more of a bottleneck again, but that is a bottleneck on I/O and especially memory bandwidth. Parallelism cannot help much with that anyway (except by compressing runs with offset coding, perhaps, but that isn't specific to parallelism and won't always help). Writing out runs in bulk is very fast here -- certainly much faster than I thought it would be when I started thinking about external sorting. And if that turns out to be a problem for cases that have sufficient memory to do everything internally, that can later be worked on non-invasively. As I've said in the past, I think parallel sorting only makes sense when memory latency and bandwidth are not huge bottlenecks, which we should bend over backwards to avoid. In a sense, you can't really make use of parallel workers for sorting until you fix that problem first. I am not suggesting that we do this because it's easier than other approaches. I think it's actually most effective to not make parallel sorting too divergent from serial sorting, because making things cumulative makes speed-ups from localized optimizations cumulative, while at the same time, AFAICT there isn't anything to recommend extensive specialization for parallel sort. If what I've sketched is also a significantly easier approach, then that's a bonus. -- Peter Geoghegan
Attachment
- quicksort_external_test.txt
- 0005-Use-tuple-proper-memory-pool-in-tuplesort.patch
- 0004-Prefetch-from-memtuples-array-in-tuplesort.patch
- 0003-Log-requirement-for-multiple-external-sort-passes.patch
- 0002-Further-diminish-role-of-replacement-selection.patch
- 0001-Quicksort-when-performing-external-sorts.patch
pgsql-hackers by date: