Re: estimating # of distinct values - Mailing list pgsql-hackers
From | tv@fuzzy.cz |
---|---|
Subject | Re: estimating # of distinct values |
Date | |
Msg-id | f739df2d3193e7b66e5c302c23439817.squirrel@sq.gransy.com Whole thread Raw |
In response to | Re: estimating # of distinct values (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: estimating # of distinct values
|
List | pgsql-hackers |
> >> The simple truth is >> >> 1) sampling-based estimators are a dead-end > > The Charikar and Chaudhuri paper does not, in fact, say that it is > impossible to improve sampling-based estimators as you claim it does. In > fact, the authors offer several ways to improve sampling-based > estimators. Further, 2000 was hardly the end of sampling-estimation > paper publication; there are later papers with newer ideas. Well, the paper states that there is a lower bound of the possible error of a sampling based estimator, depending on the sample size. The actual inequality is error(d) >= sqrt( (n-r)/2r * 1/q) where error(D) is "ratio error" (d - estimated number of distinct values, D - actual number of distinct values) error(d) = max{ D/d, d/D } And all this is with probability q >= e^{-1000} (you can choose this). Say you have a table with 1.000.000 rows and you use a sample of 1.000 rows to do an estimate. In that case you get erorr(d) >= 99 with q = 0.05 erorr(d) >= 70 with q = 0.1 error(d) >= 31 with q = 0.5 if you can 10% of the table, you get this error(d) >= 9.5 with q = 0.05 error(d) >= 6.7 with q = 0.1 error(d) >= 3 with q = 0.5 So even with 10% of the table, there's a 10% probability to get an estimate that's 7x overestimated or underestimated. With lower probability the interval is much wider. > For example, I still think we could tremendously improve our current > sampling-based estimator without increasing I/O by moving to block-based > estimation*. The accuracy statistics for block-based samples of 5% of > the table look quite good. Well, that's certainly possible. But you can only achieve the error(d) lower boundary consistently (for all datasets), you can't do better. And they've already presented an estimator that does exactly this (called AE - Adaptive Estimator in the paper). > I would agree that it's impossible to get a decent estimate of > n-distinct from a 1% sample. But there's a huge difference between 5% > or 10% and "a majority of the table". Sure we can do better. But there's a limit we can't cross no matter what estimator we choose and how large the sample will be. I've seen several post-2000 paper on sample-based estimators, but those are mostly hybrid estimators, i.e. estimators composed of several simple estimators. So in the end it's pretty complicated, you need to gather a lot of different stats, and still you can't get better error than the lower bound :-( So yes, we can improve the current estimator (making it more complex), but it does not solve the problem actually. > Again, don't let this discourage you from attempting to write a > steam-based estimator. But do realize that you'll need to *prove* its > superiority, head-to-head, against sxampling-based estimators. > > [* http://www.jstor.org/pss/1391058 (unfortunately, no longer > public-access)] It's available here http://www.stat.washington.edu/research/reports/1990s/ But it does not present an estimator contradicting the Charikar and Chaudhuri paper - it's still 'just' a sample-based estimator, alough they draw the sample at the block level. But yes, that's good idea and I've already mentioned it in the cross-column stats thread I think. The question is if a sample obtained in this way will be as good as the current samples. This way you could get quite separate 'clusters' of values, one cluster for each block, although in reality the values are uniformly distributed. And the resulting histogram would be crippled by this I guess too. But if you know about interesting papers on sample-based estimators (especially post-2000), let me know. I've searched for them, but those I found were not very interesting IMHO. Tomas
pgsql-hackers by date: