Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZTYneCG1oZiPwRU=J6ks+VpRxt2Da1ZMmqFBrd5jaSJSA@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run Re: Using quicksort for every external sort run |
List | pgsql-hackers |
On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > For me very large sorts (100,000,000 ints) with work_mem below 4MB do > better with unpatched than with your patch series, by about 5%. Not a > big deal, but also if it is easy to keep the old behavior then I think > we should. Yes, it is dumb to do large sorts with work_mem below 4MB, > but if you have canned apps which do a mixture of workloads it is not > so easy to micromanage their work_mem. Especially as there are no > easy tools that let me as the DBA say "if you connect from this IP > address, you get this work_mem". I'm not very concerned about a regression that is only seen when work_mem is set below the (very conservative) postgresql.conf default value of 4MB when sorting 100 million integers. Thank you for characterizing the regression, though -- it's good to have a better idea of how much of a problem that is in practice. I can still preserve the old behavior with a GUC, but it isn't completely trivial, and I don't want to complicate things any further without a real benefit, which I still don't see. I'm still using a replacement selection style heap, and I think that there will be future uses for the heap (e.g. dynamic duplicate removal within tuplesort), though. >> Other systems expose this explicitly, and, as I said, say in an >> unqualified way that a multi-pass merge should be avoided. Maybe the >> warning isn't the right way of communicating that message to the DBA >> in detail, but I am confident that it ought to be communicated to the >> DBA fairly clearly. > > I thinking about how many other places in the code could justify a > similar type of warning "If you just gave me 15% more memory, this > hash join would be much faster", and what that would make the logs > look like if future work went along with this precedence. If there > were some mechanism to put the warning in a system view counter > instead of the log file, that would be much cleaner. Or a way to > separate the server log file into streams. But since we don't have > those, I guess I can't really object much to the proposed behavior. I'm going to let this go, actually. Not because I don't think that avoiding a multi-pass sort is a good goal for DBAs to have, but because a multi-pass sort doesn't appear to be a point at which performance tanks these days, with modern block devices. Also, I just don't have time to push something non-essential that there is resistance to. >>> One idea would be to stop and write out a just-sorted partition >>> whenever that partition is contiguous to the already-written portion. >>> If the qsort is tweaked to recurse preferentially into the left >>> partition first, this would result in tuples being written out at a >>> pretty study pace. If the qsort was unbalanced and the left partition >>> was always the larger of the two, then that approach would have to be >>> abandoned at some point. But I think there are already defenses >>> against that, and at worst you would give up and revert to the >>> sort-them-all then write-them-all behavior. >> >> Seems kind of invasive. > > I agree, but I wonder if it won't become much more important at 30GB > of work_mem. Of course if there is no reason to ever set work_mem > that high, then it wouldn't matter--but there is always a reason to do > so, if you have so much memory to spare. So better than that invasive > work, I guess would be to make sort use less than work_mem if it gets > no benefit from using all of it. Anyway, ideas for future work, > either way. I hope to come up with a fairly robust model for automatically sizing an "effective work_mem" in the context of external sorts. There should be a heuristic that balances fan-in against other considerations. I think that doing this with the existing external sort code would be completely hopeless. This is a problem that is well understood by the research community, although balances things well in the context of PostgreSQL is a little trickier. I also think it's a little arbitrary that the final on-the-fly merge step uses a work_mem-ish sized buffer, much like the sorting of runs, as if there is a good reason to be consistent. Maybe that's fine, though. There are advantages to returning tuples earlier in the context of parallelism, which recommends smaller effective work_mem sizes (provided they're above a certain threshold). For this reason, having larger runs may not be a useful goal in general, even without considering the cost in cache misses paid in the pursuit that goal. >> Thanks, but I expected better than that. Was it a collated text >> column? The C collation will put the patch in a much better light >> (more strcoll() calls are needed with this new approach -- it's still >> well worth it, but it is a downside that makes collated text not >> especially sympathetic). Just sorting on an integer attribute is also >> a good sympathetic case, FWIW. > > It was UTF8 encoded (although all characters were actually ASCII), but > C collated. I think that I should have considered that you'd hand-optimized the work_mem setting for each case in reacting here -- I was at a conference when I responded. You can show the existing code in a better light by doing that, as you have, but I think it's all but irrelevant. It isn't even practical for experts to do that, so the fact that it is possible is only really a footnote. My choice of work_mem for my tests tended to be round numbers, like 1GB, because that was the first thing I thought of. > I've never seen improvements of 3 fold or more like you saw, under any > conditions, so I wonder if your test machine doesn't have unusually > slow main memory. I think that there is a far simpler explanation. Any time I reported a figure over ~2.5x, it was for "quicksort with spillover", and with a temp tablespace on tmpfs to simulate lots of I/O bandwidth (but with hardly any actual writing to tape -- that's the whole point of that case). I also think that the heap structure does very badly with low cardinality sets, which is where the 3.25X - 4X numbers came from. You haven't tested "quicksort with spillover" here at all, which is fine, since it is less important. Finally, as I said, I did not give the master branch the benefit of fine-tuning work_mem (which I think is fair and representative). > My largest test, which took my true table and extrapolated it out for > a few years growth, had about 500,000,000 rows. Cool. > At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs > unpatched to build the index, with timings of 3168.66 sec and 5713.07 > sec. > > The final merging is intermixed with whatever other work goes on to > build the actual index files out of the sorted data, so I don't know > exactly what the timing of just the merge part was. But it was > certainly a minority of the time, even if you assume the actual index > build were free. For the patched code, the majority of the time goes > to the quick sorting stages. I'm not sure what you mean here. I agree that the work of (say) inserting leaf tuples as part of an index build is kind of the same cost as the merge step itself, or doesn't vary markedly between the CREATE INDEX case, and other cases (where there is some analogous processing of final sorted output). I would generally expect that the merge phase takes significantly less than sorting runs, regardless of how we sort runs, unless parallelism is involved, where merging could dominate. The master branch has a faster merge step, at least proportionally, because it has larger runs. > When I test each version of the code at its own most efficient > maintenance_work_mem, I get > 3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched. As I said, it seems a little bit unfair to hand-tune work_mem or maintenance_work_mem like that. Who can afford to do that? I think you agree that it's untenable to have DBAs allocate work_mem differently for cases where an internal sort or external sort is expected; workloads are just far too complicated and changeable. > I'm attaching the trace_sort output from the client log for all 4 of > those scenarios. "sort_0005" means all 5 of your patches were > applied, "origin" means none of them were. Thanks for looking at this. This is very helpful. It looks like the server you used here had fairly decent disks, and that we tended to be CPU bound more often than not. That's a useful testing ground. Consider run #7 (of 13 total) with 3GB maintenance_work_mem, for example (this run was picked at random): ... LOG: finished writing run 6 to tape 5: CPU 35.13s/1028.44u sec elapsed 1080.43 sec LOG: starting quicksort of run 7: CPU 38.15s/1051.68u sec elapsed 1108.19 sec LOG: finished quicksorting run 7: CPU 38.16s/1228.09u sec elapsed 1284.87 sec LOG: finished writing run 7 to tape 6: CPU 40.21s/1235.36u sec elapsed 1295.19 sec LOG: starting quicksort of run 8: CPU 42.73s/1257.59u sec elapsed 1321.09 sec ... So there was 27.76 seconds spent copying tuples into local memory ahead of the quicksort, 2 minutes 56.68 seconds spent actually quicksorting, and a trifling 10.32 seconds actually writing the run! I bet that the quicksort really didn't use up too much memory bandwidth on the system as a whole, since abbreviated keys are used with a cache oblivious internal sorting algorithm. This suggests that this case would benefit rather a lot from parallel workers doing this for each run at the same time (once my code is adopted to do that, of course). This is something I'm currently researching. I think that (roughly speaking) each core on this system is likely slower than the cores on a 4-core consumer desktop/laptop, which is very normal, particularly with x86_64 systems. That also makes it more representative than my previous tests. -- Peter Geoghegan
pgsql-hackers by date: