Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZS4Mr-k6nUFBPcYkCoSQsjHXVJDqDa=PN_56Fiy1dsFgA@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run |
List | pgsql-hackers |
On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan <pg@heroku.com> wrote: > BTW, I'm not necessarily determined to make the new special-purpose > allocator work exactly as proposed. It seemed useful to prioritize > simplicity, and currently so there is one big "huge palloc()" with > which we blow our memory budget, and that's it. However, I could > probably be more clever about "freeing ranges" initially preserved for > a now-exhausted tape. That kind of thing. Attached is a revision that significantly overhauls the memory patch, with several smaller changes. We can now grow memtuples to rebalance the size of the array (memtupsize) against the need for memory for tuples. Doing this makes a big difference with a 500MB work_mem setting in this datum sort case, as my newly expanded trace_sort instrumentation shows: LOG: grew memtuples 1.40x from 9362286 (219429 KB) to 13107200 (307200 KB) for final merge LOG: tape 0 initially used 34110 KB of 34110 KB batch (1.000) and 13107200 slots remaining LOG: tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has 1534 slots remaining LOG: tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has 1535 slots remaining LOG: tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has 1533 slots remaining LOG: tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has 1534 slots remaining LOG: tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has 1535 slots remaining This is a big improvement. With the new batchmemtuples() call commented out (i.e. no new grow_memtuples() call), the LOG output around the same point is: LOG: tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining LOG: tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has 1 slots remaining (I actually added a bit more detail to what you see here during final clean-up) Obviously we're using memory a lot more efficiently here as compared to my last revision (or the master branch -- it always has palloc() overhead, of course). With no grow_memtuples, we're not wasting ~1530 slots per tape anymore (which is a tiny fraction of 1% of the total), but we are wasting 50% of all batch memory, or almost 30% of all work_mem. Note that this improvement is possible despite the fact that memory is still MAXALIGN()'d -- I'm mostly just clawing back what I can, having avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly merge. I tend to think that the bigger problem here is that we use so many memtuples when merging in the first place though (e.g. 60% in the above case), because memtuples are much less useful than something like a simple array of pointers when merging; I can certainly see why you'd need 6 memtuples here, for the merge heap, but the other ~13 million seem mostly unnecessary. Anyway, what I have now is as far as I want to go to accelerate merging for 9.6, since parallel CREATE INDEX is where the next big win will come from. As wasteful as this can be, I think it's of secondary importance. With this revision, I've given up on the idea of trying to map USEMEM()/FREEMEM() to "logical" allocations and deallocations that consume from each tape's batch. The existing merge code in the master branch is concerned exclusively with making each tape's use of memory fair; each tape only gets so many "slots" (memtuples), and so much memory, and that's it (there is never any shuffling of those resource budgets between tapes). I get the same outcome from simply only allowing tapes to get memory from their own batch allocation, which isn't much complexity, because only READTUP() routines regularly need memory. We detect when memory has been exhausted within mergeprereadone() in a special way, not using LACKMEM() at all -- this seems simpler. (Specifically, we use something called overflow allocations for this purpose. This means that there are still a very limited number of retail palloc() calls.) This new version somewhat formalizes the idea that batch allocation may one day have uses beyond the final on-the-fly merge phase, which makes a lot of sense. We should really be saving a significant amount of memory when initially sorting runs, too. This revision also pfree()s tape memory early if the tape is exhausted early, which will help a lot when there is a logical/physical correlation. Overall, I'm far happier with how memory is managed in this revision, mostly because it's easier to reason about. trace_sort now closely monitors where memory goes, and I think that's a good idea in general. That makes production performance problems a lot easier to reason about -- the accounting should be available to expert users (that enable trace_sort). I'll have little sympathy for the suggestion that this will overwhelm users, because trace_sort is already only suitable for experts. Besides, it isn't that complicated to figure this stuff out, or at least gain an intuition for what might be going on based on differences seen in a problematic case. Getting a better picture of what "bad" looks like can guide an investigation without the DBA necessarily understanding the underlying algorithms. At worst, it gives them something specific to complain about here. Other changes: * No longer use "tuple proper" terminology. Also, memory pools are now referred to as batch memory allocations. This is at the request of Jeff and Robert. * Fixed silly bug in useselection() cost model that causes "quicksort with spillover" to never be used. The cost model is otherwise unchanged, because I didn't come up with any bright ideas about how to do better there. Ideas from other people are very much welcome. * Cap the maximum number of tapes to 500. I think it's silly that the number of tapes is currently a function of work_mem, without any further consideration of the details of the sort, but capping is a simpler solution than making tuplesort_merge_order() smarter. I previously saw quite a lot of waste with high work_mem settings, with tens of thousands of tapes that will never be used, precisely because we have lots of memory (the justification for having, say, 40k tapes seems to be almost an oxymoron). Tapes (or the accounting for never-allocated tapes) could take almost 10% of all memory. Also, less importantly, we now refund/FREEMEM() unallocated tape memory ahead of final on-the-fly merge preallocation of batch memory. Note that we contemplated bounding the number of tapes in the past several times. See the commit message of c65ab0bfa9, a commit from almost a decade ago, for an example of this. That message also describes how "slots" (memtuples) and memory for tuples must be kept in balance while merging, which is very much relevant to my new grow_memtuples() call. -- Peter Geoghegan
Attachment
pgsql-hackers by date: