Re: MCV lists for highly skewed distributions - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: MCV lists for highly skewed distributions |
Date | |
Msg-id | CAEZATCWYP6n18dGCUAgKy3PJLy5WgLzkaf+wTbDyi-F=884TNA@mail.gmail.com Whole thread Raw |
In response to | Re: MCV lists for highly skewed distributions (Simon Riggs <simon@2ndquadrant.com>) |
Responses |
Re: MCV lists for highly skewed distributions
Re: MCV lists for highly skewed distributions |
List | pgsql-hackers |
On 1 February 2018 at 13:16, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 January 2018 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, since it looks like the next step is for someone to come >> up with a new proposal, I'm going to set this to Waiting on Author. > > Dean and John's results show that different algorithms work better for > different cases. > > How about we make ANALYZE's MCV algorithm pluggable? And then include, > say, 2 additional algorithms. > I don't think we've yet proved that that's needed. I'd rather attempt to come up with a decent general-purpose algorithm, if possible. The main problem I have with the originally proposed patch is the lack of any kind of rigorous justification for the approach of changing the algorithm to include values that are "significantly more common than average frequency for values which will not be represented in the MCV list". So there's no guarantee that the MCVs produced will be backed by sufficient evidence, and it risks making the too-many-MCVs case worse. Of course the current code suffers from both the too-many-MCVs and too-few-MCVs problems, depending on the data distribution: For a reasonably uniform distribution with quite a large number of distinct values, it is possible to generate MCVs on the basis of having seen only a few instances of the values. Those few values seen are then not sufficiently statistically significant, and extrapolating to the whole table produces wildly inaccurate estimates. For a highly skewed distribution, it is possible for there to be hardly any values (maybe only one) that appears more than 1.25 times the average frequency, and so lots of otherwise perfectly good common values get discarded. For such a distribution, I don't think that the average frequency is particularly interesting, and it shouldn't be used to filter the MCV list. There is also another variant of the too-many-MCVs problem that I believe is also possible -- if the sample contains a large number of NULLs or too-wide values, values_cnt could be reduced to the point where maxmincount is quite small, and again MCVs could get included on the basis of a very small number of occurrences. I think it would be better to try to come up with an alternative algorithm that has a better theoretical basis, and then test that to see how it holds up in practice. With that in mind, attached is a patch based on the idea of setting a bound on the relative standard error on the sample frequency -- so we include values in the MCV list if and only if they are seen enough times in the sample that the standard error on the sample frequency is small compared to the sample frequency itself, and thus we expect that the relative error resulting from extrapolating that to the whole table is also small. In theory then, it should never generate too many MCVs, although tuning of the relative error threshold might be required to prevent it from generating too few. Note, this is not a finished patch (e.g., I haven't touched compute_distinct_stats()). It's merely a possible starting point from which a lot more testing will be required. Testing it with the example from [1], it does indeed appear to solve the too-many-MCVs problem in that particular case (actually generating no MCVs). Testing it with the highly skewed example at the start of this thread, it typically generates a couple more MCVs, producing somewhat better estimates, but to get really good estimates for who=17, you need to crank up the stats target. It does at least appear to no longer be the case that cranking up the stats target has a weak effect. Regards, Dean [1] https://www.postgresql.org/message-id/flat/20170522132017.29944.48391@wrigleys.postgresql.org
Attachment
pgsql-hackers by date: