Home > mailing lists

Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers

From	Mark Kirkwood
Subject	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Date	June 11, 2017 08:11:15
Msg-id	80a466cb-de24-24d2-2c75-e14751950979@catalyst.net.nz Whole thread Raw
In response to	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Responses	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
List	pgsql-hackers

Tree view

On 06/06/17 10:12, Gavin Flower wrote:

> On 06/06/17 09:41, Mark Kirkwood wrote:
>> On 05/06/17 09:30, Tom Lane wrote:
>>
>>> I've been thinking about the behavior discussed in
>>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 
>>>
>>> and it seems to me that there are a couple of things we ought to do 
>>> about
>>> it.
>>>
>>> First, I think we need a larger hard floor on the number of occurrences
>>> of a value that're required to make ANALYZE decide it is a "most common
>>> value".  The existing coding is willing to believe that anything that
>>> appears at least twice in the sample is a potential MCV, but that 
>>> design
>>> originated when we were envisioning stats samples of just a few 
>>> thousand
>>> rows --- specifically, default_statistics_target was originally just 
>>> 10,
>>> leading to a 3000-row sample size.  So accepting two-appearance 
>>> values as
>>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
>>> could be a tenth or a hundredth of that.
>>>
>>> As a round number, I'm thinking that a good floor would be a frequency
>>> estimate of 1/1000.  With today's typical sample size of 30000 rows,
>>> a value would have to appear at least 30 times in the sample to be
>>> believed to be an MCV.  That seems like it gives us a reasonable margin
>>> of error against the kind of sampling noise seen in the above-cited
>>> thread.
>>>
>>> Second, the code also has a rule that potential MCVs need to have an
>>> estimated frequency at least 25% larger than what it thinks the 
>>> "average"
>>> value's frequency is.  A rule of that general form seems like a good 
>>> idea,
>>> but I now think the 25% threshold is far too small to do anything 
>>> useful.
>>> In particular, in any case like this where there are more distinct 
>>> values
>>> than there are sample rows, the "average frequency" estimate will
>>> correspond to less than one occurrence in the sample, so that this 
>>> rule is
>>> totally useless to filter anything that we would otherwise consider 
>>> as an
>>> MCV.  I wonder if we shouldn't make it be "at least double the 
>>> estimated
>>> average frequency".
>>>
>>
>> Or possibly calculate the sample standard deviation and make use of 
>> that to help decide on a more flexible cutoff than twice the avg 
>> frequency?
>>
>> Are there any research papers that might help us here (I'm drowning 
>> in a sea of barely relevant search results for most phrases I've 
>> tried so far)? I recall there were some that Tom referenced when this 
>> stuff was originally written.
>>
>> On the other hand I do have access to some mathematicians 
>> specializing in statistics - so can get their thoughts on this issue 
>> if you feel it would be worthwhile.
>>
>> Cheers
>>
>> Mark
>>
>>
> The standard deviation (sd) is proportional to the square root of the 
> number in the sample in a Normal Distribution.
>
> In a Normal Distribution, about 2/3 the values will be within plus or 
> minus one sd of the mean.
>
> There seems to be an implicit assumption that the distribution of 
> values follows the Normal Distribution - has this been verified? I 
> suspect that real data will have a skewed distribution of values, and 
> may even be multi modal (multiple peaks)  The Normal Distribution has 
> one central peak with 2 tails of the same shape & size.
>
> So in a sample of 100 the sd is proportional to 10%,
> for 10,000 the sd is proportional to 1%.
>
> So essentially, the larger the sample size the more reliable is 
> knowledge of the most common values (ignoring pathologically extreme 
> distributions!) - the measure of reliability depends on the distribution.
>
> How about selecting the cut off as the mean plus one sd, or something 
> of that nature?  Note that the cut off point may result in no mcv's 
> being selected - especially for small samples.
>
> If practicable, it would be good to sample real datasets. Suggest 
> looking at datasets were the current mechanism looks reasonable, and 
> ones were the estimates are too far off.  Also, if possible, try any 
> new selection method on the datasets and see what the difference is.
>
> The above is based on what I remember from my university statistics 
> papers, I took it up to 4th year level many moons ago.
>
>
>

With respect to the assumption of Normal distribution - it is easy to 
come up with an example that shows it need not be: consider our our 
Pgbench 'accounts' table. The distribution of 'bid' values is not Normal 
(it is Uniform).

Now I realized I made people think about Normal distributions by 
mentioning computing the standard deviation of the sample (and I 
probably should have said 'standard deviation of the *sample 
frequencies*') - but this was only a suggestion for working out how to 
(maybe) be less arbitrary about how we decide what values to put in the 
MCV list (currently 25% different from the mean frequency).

regards

Mark

pgsql-hackers by date:

From: Alvaro Herrera
Date: 11 June 2017, 06:42:35
Subject: Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

From: Noah Misch
Date: 11 June 2017, 09:25:25
Subject: [HACKERS] RTE_NAMEDTUPLESTORE, enrtuples and comments

Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers

Previous

Next