Home > mailing lists

Re: serious under-estimation of n_distinct for clustered distributions - Mailing list pgsql-performance

From	Stefan Andreatta
Subject	Re: serious under-estimation of n_distinct for clustered distributions
Date	January 14, 2013 07:35:38
Msg-id	50F3B543.8020205@synedra.com Whole thread Raw
In response to	Re: serious under-estimation of n_distinct for clustered distributions (Stefan Andreatta <s.andreatta@synedra.com>)
Responses	Re: serious under-estimation of n_distinct for clustered distributions
List	pgsql-performance

Tree view

A status update on this problem:

1.) Workarounds (setting n_distinct manually) are tested and - as far as
workarounds go - OK.

2.) Source of the problem and possible solution:

The source of these troubles is the sampling method employed in
src/backend/commands/analyze.c. Judging from Tom Lane's comment for the
original implementation in 2004 this has never been thought to be
perfect. Does anybody see a chance to improve that part? Should this
discussion be taken elsewhere? Is there any input from my side that
could help?


btw: I do find this problem to be very frequent in our databases. And
considering the commonplace conditions leading to it, I would expect
many systems to be affected. But searching the forums and the web I
hardly found any references to it - which amazes me to no end.


Best Regards,
Stefan



On 12/30/2012 07:02 PM, Stefan Andreatta wrote:
> On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
>> On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com>
>> wrote:
>>> Now, the 2005 discussion goes into great detail on the advantages and
>>> disadvantages of this algorithm, particularly when using small
>>> sample sizes,
>>> and several alternatives are discussed. I do not know whether
>>> anything has
>>> been changed after that, but I know that the very distinct problem,
>>> which I
>>> will focus on here, still persists.
>>
>> It's a really hard problem to solve satisfactorily. It's a problem
>> that has been studied in much detail. Yes, the algorithm used is still
>> the same. See the comments within src/backend/commands/analyze.c (IBM
>> Research Report RJ 10025 is referenced there).
>
> Thanks a lot for this information! I looked through the code a bit.
> The Haas & Stokes Formula is fine. The problem really lies with the
> two phase random selection procedure:
>
> Starting from line 1039, there is a comment:
>  * As of May 2004 we use a new two-stage method: Stage one selects up
>  * to targrows random blocks (or all blocks, if there aren't so many).
>  * Stage two scans these blocks and uses the Vitter algorithm to create
>  * a random sample of targrows rows (or less, if there are less in the
>  * sample of blocks). The two stages are executed simultaneously: each
>  * block is processed as soon as stage one returns its number and while
>  * the rows are read stage two controls which ones are to be inserted
>  * into the sample.
>  *
>  * Although every row has an equal chance of ending up in the final
>  * sample, this sampling method is not perfect: not every possible
>  * sample has an equal chance of being selected. For large relations
>  * the number of different blocks represented by the sample tends to be
>  * too small. We can live with that for now. Improvements are welcome.
>
>
> Now the problem with clustered data is, that the probability of
> sampling a value twice is much higher when the same page is repeatedly
> sampled. As stage one takes a random sample of pages, and stage two
> samples rows from these pages, the probability of visiting the same
> page twice (or more often) is much higher than if random rows were
> selected from the whole table. Hence we get a lot more multiple values
> for clustered data and we end up with the severe under-estimation we
> can see in those cases.
>
> Probabilities do my brain in, as usual, but I tested the procedure for
> my test data with a simple python script. There is absolutely nothing
> wrong with the implementation. It seems to be a purely statistical
> problem.
>
> Not everything may be hopeless though ;-) The problem could
> theoretically be avoided if random rows were selected from the whole
> table. Again, that may not be feasible - the two phase approach was
> probably not implemented for nothing.
>
> Another possible solution would be to avoid much of the resampling
> (not all) in phase two. For that - in theory - every page visited
> would have to get a lower weight, so that revisiting this page is not
> any more likely as rows were selected from the whole column. That does
> not sound easy or elegant to implement. But perhaps there is some
> clever algorithm - unfortunately I do not know.
>
>
>> The general advice here is:
>>
>> 1) Increase default_statistics_target for the column.
>>
>> 2) If that doesn't help, consider using the following DDL:
>>
>> alter table foo alter column bar set ( n_distinct = 5.0);
>
> Yes, I will probably have to live with that for now - I will come back
> to these workarounds with one or two questions.
>
> Thanks again & Regards,
> Seefan
>
>

pgsql-performance by date:

From: Alexandre de Arruda Paes
Date: 13 January 2013, 21:49:23
Subject: From TODO: Simplify creation of partitioned tables

From: Peter Geoghegan
Date: 14 January 2013, 11:34:21
Subject: Re: serious under-estimation of n_distinct for clustered distributions

Re: serious under-estimation of n_distinct for clustered distributions - Mailing list pgsql-performance

Previous

Next