Re: serious under-estimation of n_distinct for clustered distributions - Mailing list pgsql-performance
From | Stefan Andreatta |
---|---|
Subject | Re: serious under-estimation of n_distinct for clustered distributions |
Date | |
Msg-id | 50F3B543.8020205@synedra.com Whole thread Raw |
In response to | Re: serious under-estimation of n_distinct for clustered distributions (Stefan Andreatta <s.andreatta@synedra.com>) |
Responses |
Re: serious under-estimation of n_distinct for clustered distributions
|
List | pgsql-performance |
A status update on this problem: 1.) Workarounds (setting n_distinct manually) are tested and - as far as workarounds go - OK. 2.) Source of the problem and possible solution: The source of these troubles is the sampling method employed in src/backend/commands/analyze.c. Judging from Tom Lane's comment for the original implementation in 2004 this has never been thought to be perfect. Does anybody see a chance to improve that part? Should this discussion be taken elsewhere? Is there any input from my side that could help? btw: I do find this problem to be very frequent in our databases. And considering the commonplace conditions leading to it, I would expect many systems to be affected. But searching the forums and the web I hardly found any references to it - which amazes me to no end. Best Regards, Stefan On 12/30/2012 07:02 PM, Stefan Andreatta wrote: > On 12/29/2012 10:57 PM, Peter Geoghegan wrote: >> On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@synedra.com> >> wrote: >>> Now, the 2005 discussion goes into great detail on the advantages and >>> disadvantages of this algorithm, particularly when using small >>> sample sizes, >>> and several alternatives are discussed. I do not know whether >>> anything has >>> been changed after that, but I know that the very distinct problem, >>> which I >>> will focus on here, still persists. >> >> It's a really hard problem to solve satisfactorily. It's a problem >> that has been studied in much detail. Yes, the algorithm used is still >> the same. See the comments within src/backend/commands/analyze.c (IBM >> Research Report RJ 10025 is referenced there). > > Thanks a lot for this information! I looked through the code a bit. > The Haas & Stokes Formula is fine. The problem really lies with the > two phase random selection procedure: > > Starting from line 1039, there is a comment: > * As of May 2004 we use a new two-stage method: Stage one selects up > * to targrows random blocks (or all blocks, if there aren't so many). > * Stage two scans these blocks and uses the Vitter algorithm to create > * a random sample of targrows rows (or less, if there are less in the > * sample of blocks). The two stages are executed simultaneously: each > * block is processed as soon as stage one returns its number and while > * the rows are read stage two controls which ones are to be inserted > * into the sample. > * > * Although every row has an equal chance of ending up in the final > * sample, this sampling method is not perfect: not every possible > * sample has an equal chance of being selected. For large relations > * the number of different blocks represented by the sample tends to be > * too small. We can live with that for now. Improvements are welcome. > > > Now the problem with clustered data is, that the probability of > sampling a value twice is much higher when the same page is repeatedly > sampled. As stage one takes a random sample of pages, and stage two > samples rows from these pages, the probability of visiting the same > page twice (or more often) is much higher than if random rows were > selected from the whole table. Hence we get a lot more multiple values > for clustered data and we end up with the severe under-estimation we > can see in those cases. > > Probabilities do my brain in, as usual, but I tested the procedure for > my test data with a simple python script. There is absolutely nothing > wrong with the implementation. It seems to be a purely statistical > problem. > > Not everything may be hopeless though ;-) The problem could > theoretically be avoided if random rows were selected from the whole > table. Again, that may not be feasible - the two phase approach was > probably not implemented for nothing. > > Another possible solution would be to avoid much of the resampling > (not all) in phase two. For that - in theory - every page visited > would have to get a lower weight, so that revisiting this page is not > any more likely as rows were selected from the whole column. That does > not sound easy or elegant to implement. But perhaps there is some > clever algorithm - unfortunately I do not know. > > >> The general advice here is: >> >> 1) Increase default_statistics_target for the column. >> >> 2) If that doesn't help, consider using the following DDL: >> >> alter table foo alter column bar set ( n_distinct = 5.0); > > Yes, I will probably have to live with that for now - I will come back > to these workarounds with one or two questions. > > Thanks again & Regards, > Seefan > >
pgsql-performance by date: