Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw@mail.gmail.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: MCV lists for highly skewed distributions
|
List | pgsql-hackers |
On 7 February 2018 at 15:58, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 7 February 2018 at 15:25, Robert Haas <robertmhaas@gmail.com> wrote: >> Do you plan to press forward with this, then, or what's >> the next step? > > I plan to do more testing TL;DR: I tested this new algorithm using 9 different relative standard error cutoffs (10%, 15%, ... 50%) against a variety of different data distributions, and it looks like a definite improvement over the current algorithm, for a wide range of cutoff values. Overall, it isn't that sensitive to the exact cutoff chosen, but it looks like a value of 20% is a good general-purpose choice. One of the properties of the new algorithm is that the size of the MCV list responds better to changing the stats target, so I don't think it's worth having a separate knob to control the cutoff percentage. Such a knob would have a similar, but weaker effect than the stats target knob, and thus doesn't seem worthwhile. Attached is an updated patch. I decided to move the code that determines the minimum number of occurrences for an MCV list item out to a separate function. It's not much code, but there's a large comment to explain its statistical justification, and I didn't want to duplicate that in 2 different places (possibly to become 3, with the multivariate MCV stats patch). I think this is ready for commit now, so barring objections, I will do so in the next day or so. --- I tested using the attached python script, which tests a large number of different data distributions, comparing the results with the patch vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates against actual row counts, both for values in the MCV list, and non-MCV values, making it possible to tell whether it would have been better if the MCV list had been bigger or smaller. There's a lot of random variation between tests, but by running a lot of them, the general pattern can be seen. I ran the test 9 times, with different MCV cutoff values in the patched code of 10%, 15%, ... 50%, then collected all the results from the test runs into a single CSV file to analyse using SQL. The columns in the CSV are: test_name - A textual description of the data distribution used in the test. mcv_cutoff - The relative standard error cutoff percentage used (10, 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests against HEAD. stats_tgt - The stats target used (100 or 1000). num_mcvs - The size of the resulting MCV list. avg_mcv_err - The average percentage difference between estimated and actual row counts for values in the MCV list. (There's a bit of a fudge factor in calculating these percentages, to avoid them becoming too large in cases where the row counts are small, but I think it's still useful for comparison purposes.) max_mcv_err - The maximum percentage difference between estimated and actual row counts for values in the MCV list. avg_non_mcv_err - The average percentage difference between estimated and actual row counts for a bunch of non-MCV values. max_non_mcv_err - The maximum percentage difference between estimated and actual row counts for a bunch of non-MCV values. avg_err - The average percentage difference between estimated and actual row counts for all values tested. max_err - The maximum percentage difference between estimated and actual row counts for all values tested. From this, it's possible to look for cases where the MCV list is too large or too small. For example, looking for too-many-mcv cases (generally the more uniform data distributions): SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 50 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -40 | 41 -50 | 40 -25 | 39 -15 | 38 -20 | 38 -30 | 38 -35 | 38 -45 | 37 -10 | 37 50 | 35 40 | 35 45 | 34 35 | 33 30 | 33 25 | 25 20 | 13 15 | 6 10 | 3 (18 rows) SELECT mcv_cutoff, count(*) FROM results WHERE max_mcv_err - max_non_mcv_err > 100 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -30 | 17 -40 | 16 -15 | 15 -25 | 14 -10 | 14 -35 | 14 -45 | 13 -50 | 13 -20 | 12 35 | 12 45 | 11 40 | 10 30 | 10 50 | 10 25 | 6 10 | 2 (16 rows) ( and implicitly: 15 | 0 20 | 0 ) So the patched code is generally better at avoiding the too-many-mcvs problem, but an mcv_cutoff of 30% or more may be too high. Looking for too-few-mcv cases: SELECT mcv_cutoff, count(*) FROM results WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt) GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 120 -50 | 116 -30 | 115 -35 | 114 -25 | 114 -10 | 114 -45 | 113 10 | 113 -40 | 113 -15 | 111 15 | 98 20 | 49 25 | 44 40 | 39 30 | 38 45 | 37 35 | 37 50 | 35 (18 rows) SELECT mcv_cutoff, count(*) FROM results WHERE (max_non_mcv_err - max_mcv_err > 100 AND num_mcvs < stats_tgt) GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 119 -50 | 116 -30 | 115 -40 | 113 -45 | 112 -25 | 112 -10 | 112 -35 | 112 -15 | 111 10 | 106 15 | 51 20 | 45 25 | 43 30 | 38 40 | 37 35 | 36 45 | 34 50 | 31 (18 rows) and looking specifically to see to what extent the problem persists when the stats target is increased to 1000: SELECT mcv_cutoff, count(*) FROM results WHERE max_non_mcv_err - max_mcv_err > 100 AND stats_tgt = 1000 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | count ------------+------- -20 | 69 -30 | 67 -50 | 67 -35 | 66 -40 | 66 -10 | 66 -15 | 65 -25 | 64 -45 | 63 10 | 60 30 | 8 20 | 8 45 | 8 15 | 8 35 | 7 50 | 7 25 | 7 40 | 6 (18 rows) So again, the patched code is generally better at avoiding the too-few-mcvs problem, but an mcv_cutoff of 10% is almost certainly too low. Looking at the effect of changing the stats target from 100 to 1000 in the tests: SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs) FROM results r1, results r2 WHERE r2.test_name = r1.test_name AND r2.mcv_cutoff = r1.mcv_cutoff AND r1.stats_tgt = 100 AND r2.stats_tgt = 1000 GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | sum ------------+------- 15 | 88582 20 | 87124 35 | 86828 10 | 86749 45 | 86632 40 | 86552 50 | 86384 25 | 85567 30 | 85352 -25 | 28596 -15 | 28567 -35 | 28559 -40 | 28517 -20 | 28509 -30 | 28483 -45 | 28464 -50 | 28452 -10 | 28411 (18 rows) it's clear that the patched code is much better at responding to increasing the stats target and producing more MCVs. Interestingly, I noticed while eyeballing the raw test output that with HEAD there are quite a few instances where increasing the stats target actually decreases the size of the MCV list, and also these are often in cases where the data is more non-uniformly distributed, and the MCV list is too small to start with: SELECT r1.mcv_cutoff, sum(r2.num_mcvs - r1.num_mcvs) FROM results r1, results r2 WHERE r2.test_name = r1.test_name AND r2.mcv_cutoff = r1.mcv_cutoff AND r1.stats_tgt = 100 AND r2.stats_tgt = 1000 AND r2.num_mcvs < r1.num_mcvs GROUP BY 1 ORDER BY 2 DESC; mcv_cutoff | sum ------------+------- 15 | -4 10 | -11 -35 | -5475 -20 | -5493 -50 | -5507 -40 | -5510 -45 | -5510 -30 | -5511 -25 | -5514 -10 | -5528 -15 | -5532 (11 rows) I could probably keep going, querying the test result data in all sorts of other ways (and it's attached, should anyone wish to do so, although bear in mind that it's subject to quite large random variations), but my general conclusions are: - The patched code is better than HEAD at avoiding the too-many-mcvs problem, unless the mcv_cutoff is set too high (maybe around 30% or more). - The patched code is much better at avoiding the too-few-mcvs problem, unless the mcv_cufoff is set too low (10% seems to be too low). - The patched code is much better at producing more MCVs as the stats target is increased, making it better able to handle more non-uniform distributions. - An mcv_cutoff of 20% looks like a good general-purpose value. Regards, Dean
Attachment
pgsql-hackers by date: