Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | CAEZATCXmzeHM8h63KM4HOO40XvYVw8jxkU0eBgMRBeaUKPjXfw@mail.gmail.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (John Naylor <jcnaylor@gmail.com>) |
Responses |
Re: MCV lists for highly skewed distributions
|
List | pgsql-hackers |
On 4 February 2018 at 12:18, John Naylor <jcnaylor@gmail.com> wrote: > I did the same basic eyeball testing I did on earlier patches, and > this is the best implementation so far. I've attached some typical > pg_stats contents for HEAD and this patch. More rigorous testing, > including of planner estimates on real data, is still needed of > course, but I think this is definitely in the right direction. > Thanks for testing. I agree, this new algorithm seems to stand up pretty well in the testing I've done too. One thing about it that can be tuned is the cutoff threshold for the relative standard error -- I chose 10% completely arbitrarily, but it could just as easily be 20%, 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09). Looking at the too-many-MCVs example from [1], the table has 100 million rows, and each distinct value occurs 10 times. With the default stats target of 100, the MCV-list is fully populated with values, most having been seen two (or occasionally three) times. Thus, if you query for one of those MCVs, the estimate is 6667 or 10,000 rather than 10, which is a pretty bad over-estimate. Increasing the stats target improves things, because the sample frequencies go down correspondingly. The problem is also lessened a bit if you randomise the order of the values in the second column so that it isn't correlated to the storage order: insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a; However in a sample of 30,000 rows it remains reasonably likely that the same value will be seen twice (I typically observed 40 to 50 MCVs in this case, c.f. the birthday paradox), and in a sample of 300,000 rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this isn't really the fault of the non-uniform ANALYSE sample, but rather of the current algorithm's belief that seeing a value just twice is sufficient for it to qualify as an MCV. The new MCV algorithm solves this by demanding that a value be seen many more times before being included in cases where the table size is much larger than the sample size. The exact threshold depends on the table size, the sample size and the relative error cutoff value. In this example, with a table size of 100 million rows, the sample size makes little difference, because its always going to be much smaller than the table size, so the number of instances required in the sample depends mostly on the RSE cutoff chosen: rse_cutoff | sample=30000 | sample=300000 | sample=3000000 ------------+--------------+---------------+---------------- 10% | 100 | 100 | 97 20% | 25 | 25 | 25 30% | 12 | 12 | 11 40% | 7 | 7 | 7 50% | 4 | 4 | 4 So any of those RSE cutoff's solves the too-many-MCVs problem in this particular case, giving no MCVs, although 50% is pushing it a bit, and in any case, the theoretical basis for this algorithm breaks down well before then. The advantage of a larger RSE cutoff is that it gives more MCVs for a non-uniform data distribution, where the current algorithm tends to give too few MCVs. In the attached example, the table has 1 million rows of integer data in a normal distribution with a standard deviation of 50. This typically gives somewhere in the region of 430 to 440 distinct values. Running against HEAD and with this patch, for varying sample sizes and RSE cutoffs gives the following MCV-list sizes: stats_target mcvs (HEAD) mcvs (10% RSE) mcvs (20% RSE) mcvs (30% RSE) 10 10 0 10 10 20 20 0 20 20 30 30 0 30 30 40 40 17 40 40 50 50 50 50 50 100 100 100 100 100 300 132 204 261 290 1000 205 266 315 338 3000 252 364 400 411 10000 296 413 414 412 One thing to note is that the HEAD algorithm never includes more than around 300 values, because of its requirement for a value to be more common than average. The new algorithm has no such requirement, so it will include nearly every value in the MCV list (except the most uncommon ones) if the stats target is set high enough. Also, the detailed output from the test shows that the resulting estimates based on those MCVs are pretty decent. (Note: this example shows that the too-few-MCVs problem can occur in any non-uniform distribution, not just highly skewed ones.) I've also tried this out against some real-world data, and in the testing I've done so far, the new algorithm is not actually that sensitive to the precise RSE cutoff chosen, but I'm leaning towards a value of 20% because there are cases where 10% is a little too strict. Regards, Dean [1] https://www.postgresql.org/message-id/flat/20170522132017.29944.48391@wrigleys.postgresql.org
Attachment
pgsql-hackers by date: