Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? |
Date | |
Msg-id | 80a466cb-de24-24d2-2c75-e14751950979@catalyst.net.nz Whole thread Raw |
In response to | Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? (Gavin Flower <GavinFlower@archidevsys.co.nz>) |
Responses |
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
|
List | pgsql-hackers |
On 06/06/17 10:12, Gavin Flower wrote: > On 06/06/17 09:41, Mark Kirkwood wrote: >> On 05/06/17 09:30, Tom Lane wrote: >> >>> I've been thinking about the behavior discussed in >>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org >>> >>> and it seems to me that there are a couple of things we ought to do >>> about >>> it. >>> >>> First, I think we need a larger hard floor on the number of occurrences >>> of a value that're required to make ANALYZE decide it is a "most common >>> value". The existing coding is willing to believe that anything that >>> appears at least twice in the sample is a potential MCV, but that >>> design >>> originated when we were envisioning stats samples of just a few >>> thousand >>> rows --- specifically, default_statistics_target was originally just >>> 10, >>> leading to a 3000-row sample size. So accepting two-appearance >>> values as >>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it >>> could be a tenth or a hundredth of that. >>> >>> As a round number, I'm thinking that a good floor would be a frequency >>> estimate of 1/1000. With today's typical sample size of 30000 rows, >>> a value would have to appear at least 30 times in the sample to be >>> believed to be an MCV. That seems like it gives us a reasonable margin >>> of error against the kind of sampling noise seen in the above-cited >>> thread. >>> >>> Second, the code also has a rule that potential MCVs need to have an >>> estimated frequency at least 25% larger than what it thinks the >>> "average" >>> value's frequency is. A rule of that general form seems like a good >>> idea, >>> but I now think the 25% threshold is far too small to do anything >>> useful. >>> In particular, in any case like this where there are more distinct >>> values >>> than there are sample rows, the "average frequency" estimate will >>> correspond to less than one occurrence in the sample, so that this >>> rule is >>> totally useless to filter anything that we would otherwise consider >>> as an >>> MCV. I wonder if we shouldn't make it be "at least double the >>> estimated >>> average frequency". >>> >> >> Or possibly calculate the sample standard deviation and make use of >> that to help decide on a more flexible cutoff than twice the avg >> frequency? >> >> Are there any research papers that might help us here (I'm drowning >> in a sea of barely relevant search results for most phrases I've >> tried so far)? I recall there were some that Tom referenced when this >> stuff was originally written. >> >> On the other hand I do have access to some mathematicians >> specializing in statistics - so can get their thoughts on this issue >> if you feel it would be worthwhile. >> >> Cheers >> >> Mark >> >> > The standard deviation (sd) is proportional to the square root of the > number in the sample in a Normal Distribution. > > In a Normal Distribution, about 2/3 the values will be within plus or > minus one sd of the mean. > > There seems to be an implicit assumption that the distribution of > values follows the Normal Distribution - has this been verified? I > suspect that real data will have a skewed distribution of values, and > may even be multi modal (multiple peaks) The Normal Distribution has > one central peak with 2 tails of the same shape & size. > > So in a sample of 100 the sd is proportional to 10%, > for 10,000 the sd is proportional to 1%. > > So essentially, the larger the sample size the more reliable is > knowledge of the most common values (ignoring pathologically extreme > distributions!) - the measure of reliability depends on the distribution. > > How about selecting the cut off as the mean plus one sd, or something > of that nature? Note that the cut off point may result in no mcv's > being selected - especially for small samples. > > If practicable, it would be good to sample real datasets. Suggest > looking at datasets were the current mechanism looks reasonable, and > ones were the estimates are too far off. Also, if possible, try any > new selection method on the datasets and see what the difference is. > > The above is based on what I remember from my university statistics > papers, I took it up to 4th year level many moons ago. > > > With respect to the assumption of Normal distribution - it is easy to come up with an example that shows it need not be: consider our our Pgbench 'accounts' table. The distribution of 'bid' values is not Normal (it is Uniform). Now I realized I made people think about Normal distributions by mentioning computing the standard deviation of the sample (and I probably should have said 'standard deviation of the *sample frequencies*') - but this was only a suggestion for working out how to (maybe) be less arbitrary about how we decide what values to put in the MCV list (currently 25% different from the mean frequency). regards Mark
pgsql-hackers by date: