Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | CAEZATCXbC2wX0WYL5YvtiQy9=SDhRT9WHMLftujwEq5HAGosgg@mail.gmail.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: MCV lists for highly skewed distributions
Re: MCV lists for highly skewed distributions |
List | pgsql-hackers |
On 6 March 2018 at 16:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 6 March 2018 at 08:51, John Naylor <jcnaylor@gmail.com> wrote: >> On 3/5/18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >>> Attached is an updated patch. >> Nice. The results look good. > > Thanks for the review. > So I was about ready to commit this, but decided to do more testing, because I worry that there may be instances that a user could point to where it makes estimates worse. Maybe that's inevitable for any change we might make, and I don't think that should stop us from at least trying to improve things, but it does make me wary about committing this without a lot of testing. In all the testing I've done so far, this looks to be a definite improvement over the current algorithm, but it looks like it's possible to do better. Consider the following test case: drop table if exists foo; create temp table foo(a int); insert into foo select * from ( select x/2 from generate_series(0,19999999) g(x) union all select 0 from generate_series(0,9999999) union all select 1 from generate_series(0,999999) union all select 2 from generate_series(0,99999) union all select 3 from generate_series(0,9999) union all select 4 from generate_series(0,999) union all select 5 from generate_series(0,99) ) t order by random(); alter table foo alter column a set statistics 10000; analyse foo; So the table has around 31 million rows, and the stats target is maxed out -- we're sampling around 10% of the table, and it's not possible to sample more. Looking at the value a=5, it appears 102 times in the table, so we can expect it to be seen around 10 times in any sample, but the minimum count for the new algorithm in this case is 23, so it won't be included in the MCV list. This leads to it having the same estimate as all other non-MCV values, which turns out to be rows=5, a considerable under-estimate. By comparison, the current algorithm in HEAD does include this value in the MCV list, so it gets a good estimate. On the other hand, it generates a complete MCV-list with 10000 entries, most of which are just random values from the list that appear twice in the table and also happen to appear twice in the sample. So there are nearly 10000 values that are significantly over-estimated in this case. So arguably, the new algorithm is still better than the current one for this data, but the fact that it doesn't include a=5 in the list bugs me, because the estimate for that single value is consistently worse. Looking at the distribution of a=5 in the sample, it is a hypergeometric distribution with N=31111100, n=3000000 and K=102. This gives a sample mean of 9.8 and a variance of 8.9 (a standard deviation of 3.0). Also, this is common enough that in fact that distribution can be reasonably approximated by a normal distribution. The value is excluded from the MCV list because the spread is deemed too large, relative to the mean, so the relative error of the MCV-based estimate would be too high. However, not including it in the MCV list causes it to have an estimate of 5, which would correspond to a sample mean of 0.5, and the observed sample mean is more than 3 standard deviations higher than that. So we do actually have enough information to conclude that the value is almost certainly more common than the non-MCV selectivity would suggest, and I think that's sufficient justification for including it in the MCV list. The crux of this is that the relative-standard-error algorithm is making use of 2 pieces of information -- the sample frequency, and an estimate of its statistical spread. However, there is a 3rd piece of information that we know, that is not being made use of -- the selectivity that would be assigned to the value if it were not included in the MCV list. Looking at just the first 2 pieces of information ensures that we get decent estimates for the values in the MCV list (where what constitutes "decent" depends on an arbitrarily chosen RSE cutoff), but it doesn't take into account the fact that the values not in the list may have much worse estimates. Making use of the non-MCV-based estimate allows us to do better -- if the MCV-based estimate is statistically significantly higher than the non-MCV-based estimate, then it would almost certainly be better to include that value in the MCV list, even if its spread is quite large. This is somewhat similar to the initial idea from Jeff Janes, except that patch didn't take into account the spread, so it ended up making the too-many-mcvs problem worse for uniform distributions. There's also another subtlety with that patch -- when comparing frequencies of values not in the MCV list, you really have to start from a fully populated MCV list and work down, rather than starting with a empty one and working up, because if all the common values have about the same frequency, the most common amongst them may not be much more common than the initial average, so you may end up with an empty MCV list. So attached is an updated patch, based on this new idea. The principle behind the error estimating is similar, but the code ended up looking very different. Repeating the previous batch of tests, the results are broadly similar, in that it does quite well at preventing too many or too few MCVs, and it responds well to increasing the stats target. The main difference is that it tends to produce a few more MCVs for each of non-uniform distributions, making the estimates for the non-MCV values better. Running the test above with a variety of different stats target values against HEAD, Jeff's original patch (P1), the RSE patch, and this new v2 patch gives the following MCV-list sizes: stats_target HEAD P1 RSE v2 10000 10000 10000 5 6 8000 8000 8000 5 6 6000 6000 6000 5 6 4000 4000 4000 5 5/6 2000 2000 2000 4/5 5 1000 ~900 ~850 4 5 500 ~240 ~250 4 4/5 250 ~70 ~60 3/4 4 100 ~20 ~15 3 4 So HEAD and P1 are both producing too many MCVs. The latest 2 patches cure that problem, but the new v2 patch is better at picking out all the common values. I'm moving this back to a status of "Needs review" partly because the code has changed significantly, but also because I want to do more testing, particularly with larger datasets. Regards, Dean
Attachment
pgsql-hackers by date: