Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAMkU=1ym2N-v+3=aMm9m-h1A941gQV7CgM9HJuwhRFu-8tHFaw@mail.gmail.com Whole thread Raw |
In response to | Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Using quicksort for every external sort run
|
List | pgsql-hackers |
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg@heroku.com> wrote: Hi Peter, Your most recent versions of this patch series (not the ones on the email I am replying to) give a compiler warning: tuplesort.c: In function 'mergeruns': tuplesort.c:2741: warning: unused variable 'memNowUsed' > Multi-pass sorts > --------------------- > > I believe, in general, that we should consider a multi-pass sort to be > a kind of inherently suspect thing these days, in the same way that > checkpoints occurring 5 seconds apart are: not actually abnormal, but > something that we should regard suspiciously. Can you really not > afford enough work_mem to only do one pass? I don't think it is really about the cost of RAM. What people can't afford is spending all of their time personally supervising all the sorts on the system. It is pretty easy for a transient excursion in workload to make a server swap itself to death and fall over. Not just the PostgreSQL server, but the entire OS. Since we can't let that happen, we have to be defensive about work_mem. Yes, we have far more RAM than we used to. We also have far more things demanding access to it at the same time. I agree we don't want to optimize for low memory, but I don't think we should throw it under the bus, either. Right now we are effectively saying the CPU-cache problems with the heap start exceeding the larger run size benefits at 64kb (the smallest allowed setting for work_mem). While any number we pick is going to be a guess that won't apply to all hardware, surely we can come up with a guess better than 64kb. Like, 8 MB, say. If available memory for the sort is 8MB or smaller and the predicted size anticipates a multipass merge, then we can use the heap method rather than the quicksort method. Would a rule like that complicate things much? It doesn't matter to me personally at the moment, because the smallest work_mem I run on a production system is 24MB. But if for some reason I had to increase max_connections, or had to worry about plans with many more possible concurrent work_mem allocations (like some partitioning), then I might not need to rethink that setting downward. > > In theory, the answer could be "yes", but it seems highly unlikely. > Not only is very little memory required to avoid a multi-pass merge > step, but as described above the amount required grows very slowly > relative to linear growth in input. I propose to add a > checkpoint_warning style warning (with a checkpoint_warning style GUC > to control it). I'm skeptical about a warning for this. I think it is rather unlike checkpointing, because checkpointing is done in a background process, which greatly limits its visibility, while sorting is a foreground thing. I know if my sorts are slow, without having to go look in the log file. If we do have the warning, shouldn't it use a log-level that gets sent to the front end where the person running the sort can see it and locally change work_mem? And if we have a GUC, I think it should be a dial, not a binary. If I have a sort that takes a 2-way merge and then a final 29-way merge, I don't think that that is worth reporting. So maybe, if the maximum number of runs on a tape exceeds 2 (rather than exceeds 1, which is the current behavior with the patch) would be the setting I would want to use, if I were to use it at all. ... > This patch continues to have tuplesort determine run size based on the > availability of work_mem only. It does not entirely fix the problem of > having work_mem sizing impact performance in counter-intuitive ways. > In other words, smaller work_mem sizes can still be faster. It does > make that general situation much better, though, because quicksort is > a cache oblivious algorithm. Smaller work_mem sizes are sometimes a > bit faster, but never dramatically faster. Yes, that is what I found as well. I think the main reason it is even that small bit slower at large memory is because writing and sorting are not finely interleaved, like they are with heap selection. Once you sit down to qsort 3GB of data, you are not going to write any more tuples until that qsort is entirely done. I didn't do any testing beyond 3GB of maintenance_work_mem, but I imagine this could get more important if people used dozens or hundreds of GB. One idea would be to stop and write out a just-sorted partition whenever that partition is contiguous to the already-written portion. If the qsort is tweaked to recurse preferentially into the left partition first, this would result in tuples being written out at a pretty study pace. If the qsort was unbalanced and the left partition was always the larger of the two, then that approach would have to be abandoned at some point. But I think there are already defenses against that, and at worst you would give up and revert to the sort-them-all then write-them-all behavior. Overall this is very nice. Doing some real world index builds of short text (~20 bytes ascii) identifiers, I could easily get speed ups of 40% with your patch if I followed the philosophy of "give it as much maintenance_work_mem as I can afford". If I fine-tuned the maintenance_work_mem so that it was optimal for each sort method, then the speed up quite a bit less, only 22%. But 22% is still very worthwhile, and who wants to spend their time fine-tuning the memory use for every index build? Cheers, Jeff
pgsql-hackers by date: