Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | 7142bf2a-9460-b549-61f3-8cb32a3209f9@2ndquadrant.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run |
List | pgsql-hackers |
Hi, On 03/30/2016 04:53 AM, Peter Geoghegan wrote: > On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: ... >>> If there is ever a regression, it is only really sensible to talk >>> about it while looking at trace_sort output (and, I guess, the query >>> plan). I've asked Tomas for trace_sort output in all relevant cases. >>> There is no point in "flying blind" and speculating what the problem >>> was from a distance. >> >> >> The updated benchmarks are currently running. I'm out of office until >> Friday, and I'd like to process the results over the weekend. FWIW I'll have >> results for these cases: >> >> 1) unpatched (a414d96a) >> 2) patched, default settings >> 3) patched, replacement_sort_mem=64 >> >> Also, I'll have trace_sort=on output for all the queries, so we can >> investigate further. > > Thanks! That will tell us a lot more. So, I do have the results from both machines - I've attached the basic comparison spreadsheets, the complete summary is available here: https://github.com/tvondra/sort-benchmark The database log also includes the logs for trace_sort=on for each query (use the timestamp logged for each query in the spreadsheet to locate the right section of the log). The benchmark was slightly modified, based on the previous feedback: * fix the maintenance_work_mem thinko (affects CREATE INDEX cases) * use "SELECT * FROM (... OFFSET 1e10)" pattern instead of the original approach (copy to /dev/null) * change the data generation for "low cardinality" data sets (by mistake it generated mostly the same stuff as "high cardinality") I have not collected explain plans. I guess we'll need explain analyze in most cases anyway, and collecting those would increase the duration of the benchmark. So I plan to collect this info for the interesting cases on request. While it might look like I'm somehow opposed to this patch series, that's mostly because we tend to look only at the few cases that behave poorly. So let me be clear: I do think the patch seems to be a significant performance improvement for most of the queries, and I'm OK with accepting a few regressions (particularly if we agree those are pathological cases, unlikely to happen in real-world workloads). It's quite rare that a patch is a universal win without regressions, so it's important to consider how likely those regressions are and what's the net effect of the patch - and the patch seems to be a significant improvement in most cases (and regressions limited to pathological or rare corner cases). I don't think those are reasons not to push this into 9.6. Following is a rudimentary analysis of the results, a bit about how the benchmark was constructed (and it's representativeness). rudimentary analysis -------------------- I haven't done any thorough investigation of the results yet, but in general it seems the results from both machines are quite similar - the numbers are different, but the speedup/slowdown patterns are mostly the same (with some exceptions that I'd guess are due to HW differences). The slowdown/speedup patterns (red/green cells in the spreadheets) are also similar to those collected originally. Some timings are much lower, presumably thanks to using the "OFFSET 1e10" pattern, but the patterns are the same. CREATE INDEX statements are an obvious exception, of course, due to the thinko in the previous benchmark. The one thing that surprised me a bit is that replacement_sort_mem=64 actually often made the results considerably worse in many cases. A common pattern is that the slowdown "spreads" to nearby cells - the are many queries where the 8MB case is 1:1 with master and 32MB is 1.5:1 (i.e. takes 1.5x more time), and setting replacement_sort_mem=64 just slows down the 8MB case. In general, replacement_sort_mem=64 seems to only affect the 8MB case, and in most cases it results in 100% slowdown (so 2x as long queries). That being said, I do think the results are quite impressive - there are far many queries with significant speedups (usually ~2x or more) than slowdowns (and less significant than speedups). I mostly agree with Peter that we probably don't need to worry about the slowdown cases with low work_mem settings - if you do sorts with millions of rows, you really need to give the database enough RAM. But there are multiple slowdown cases with work_mem=128MB, and I'd dare to say 128MB is not quite low-end work_mem value. So perhaps we should look at least at those cases. It's also interesting that setting replacement_sort_mem=64 makes this much worse - i.e. the number of slowdowns with higher work_mem values increases, and the difference is often quite huge. So I'm really not sure what to do with this GUC ... L2/L3 cache ----------- I think we're overly optimistic when it comes to the size of the CPU cache - while it's certainly true that modern CPUs have quite a bit of it (the modern Xeon E5 have up to ~45MB per socket), there are two important factors here: 1) The cache is shared by all cores on the socket (and on average there's ~2-3 MB/s per physical core), and thus by all processes running on the CPU. It's possible to run a single process on the CPU (thus getting all the cache), but that's a bit expensive 1-core CPU. 2) The cache is shared by all nodes in the query plan, and we do have executor that interleaves the nodes (so while an implementation of the node may be very efficient when executed in isolation, that may not be true when executed as part of a larger plan). The sort may be immune to this to some degree, though. I'm not sure how much this is considered in the 1994 VLDB paper, but I'd be very careful about making claims about how much CPU cache is available today (even on the best server CPUs). benchmark discussion -------------------- 1) representativeness Let me explain how I constructed the benchmark - I simply compiled a list of queries executing sorts, and ran them on synthetic datasets with different characteristics (cardinality and initial ordering). And I've done that with different work_mem values, to see how that affects the behavior. I've done it this way for a few reasons - firstly, I'm extremely lazy and did not want to study the internals of the patch as I'm not too much into sorting details. Secondly, I did not want to tailor the benchmark too tightly to the patch - it's quite possible some of the queries are not executing the modified code at all, in which case they should be unaffected (no slowdown, no speedup). So while the benchmark might certainly include additional queries or data sets with different characteristics, I'd dare to claim it's not entirely misguided. Some of the tested combinations may certainly be seen as implausible or pathological, although intentional and not constructed on purpose. I'm perfectly fine with identifying such cases and ignoring them. 2) TOAST overhead Peter also mentioned that some of the cases have quite a bit of padding, and that the TOAST overhead distorts the results. It's true there's quite a bit of padding (~320B), but I don't quite see why this would makes the results bogus - I've intentionally constructed it like this to see how the sort behaves with wide rows, because: * many BI queries actually fetch quite a lot of columns, and while 320B may seem a bit high, it's not that difficult to reach with a few NUMERIC columns * we're getting parallel aggregate in 9.6, which relies on serializing the aggregate state (and the combine phase may then need to do a sort again) Moreover, while there certainly is TOAST overhead, I don't quite see why it should change with the patch (as the padding columns are not used as a sort key). Perhaps the patch results in "moving the tuples around more" (deemphasizing comparison), but I don't see why that shouldn't be an important metric in general - memory bandwidth seems to be a quite important bottleneck these days. Of course, if this only affects the pathological cases, we may ignore that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: