Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers
| From | Tom Lane |
|---|---|
| Subject | Re: Odd statistics behaviour in 7.2 |
| Date | |
| Msg-id | 16868.1013818182@sss.pgh.pa.us Whole thread Raw |
| In response to | Re: Odd statistics behaviour in 7.2 ("Gordon A. Runkle" <gar@integrated-dynamics.com>) |
| Responses |
Re: Odd statistics behaviour in 7.2
Re: Odd statistics behaviour in 7.2 |
| List | pgsql-hackers |
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes:
> [ tale of poor statistical results in a near-unique column ]
After some further digging in the literature I have come across a
number-of-distinct-values estimator that I like better than Chaudhuri's.
It runs like this:
n*d / (n - f1 + f1*n/N) where f1 is the number of distinct values that occurred
exactly once in our sample of n rows (from a total of N), and d is the total number of distinct values in
thesample.
The nice properties this has include:
1. At f1=0 (all values in sample occurred more than once), the estimate
reduces to d, which I had already determined to be a sensible result.
2. At f1=n (all values were distinct, hence also d=n), the estimate
reduces to N, which again is the most reasonable estimate.
3. The equation is numerically stable even when n is much smaller than
N, because the only cancellation is in the term (n - f1) which we can
compute exactly. A lot of the other equations I looked at depend on
series like (1 - n/N)**i which are going to be really nasty when n/N
is tiny.
In particular, point 2 means that this equation should perform
reasonably well for nearly-unique columns (f1 approaching n),
which was the case you were seeing bad results for.
Attached is a patch that implements this revised equation. Would
appreciate any feedback...
regards, tom lane
*** src/backend/commands/analyze.c.orig Sat Jan 5 19:37:44 2002
--- src/backend/commands/analyze.c Fri Feb 15 18:46:45 2002
***************
*** 1009,1018 **** { /*---------- * Estimate the number of distinct values using the
estimator
! * proposed by Chaudhuri et al (see citation above). This is
! * sqrt(n/r) * max(f1,1) + f2 + f3 + ...
! * where fk is the number of distinct values that occurred
! * exactly k times in our sample of r rows (from a total of n). * We assume (not very
reliably!)that all the multiply-occurring * values are reflected in the final track[] list, and the other
* nonnull values all appeared but once. (XXX this usually
--- 1009,1023 ---- { /*---------- * Estimate the number of distinct values using the
estimator
! * proposed by Haas and Stokes in IBM Research Report RJ 10025:
! * n*d / (n - f1 + f1*n/N)
! * where f1 is the number of distinct values that occurred
! * exactly once in our sample of n rows (from a total of N),
! * and d is the total number of distinct values in the sample.
! * This is their Duj1 estimator; the other estimators they
! * recommend are considerably more complex, and are numerically
! * very unstable when n is much smaller than N.
! * * We assume (not very reliably!) that all the multiply-occurring * values
arereflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this
usually
***************
*** 1021,1032 **** *---------- */ int f1 = nonnull_cnt - summultiple;
! double term1;
! if (f1 < 1)
! f1 = 1;
! term1 = sqrt(totalrows / (double) numrows) * f1;
! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /*
--- 1026,1044 ---- *---------- */ int f1 = nonnull_cnt - summultiple;
! int d = f1 + nmultiple;
! double numer, denom, stadistinct;
! numer = (double) numrows * (double) d;
! denom = (double) (numrows - f1) +
! (double) f1 * (double) numrows / totalrows;
! stadistinct = numer / denom;
! /* Clamp to sane range in case of roundoff error */
! if (stadistinct < (double) d)
! stadistinct = (double) d;
! if (stadistinct > totalrows)
! stadistinct = totalrows;
! stats->stadistinct = floor(stadistinct + 0.5); } /*
***************
*** 1313,1332 **** { /*---------- * Estimate the number of distinct values using the
estimator
! * proposed by Chaudhuri et al (see citation above). This is
! * sqrt(n/r) * max(f1,1) + f2 + f3 + ...
! * where fk is the number of distinct values that occurred
! * exactly k times in our sample of r rows (from a total of n). * Overwidth values are
assumedto have been distinct. *---------- */ int f1 = ndistinct -
nmultiple+ toowide_cnt;
! double term1;
! if (f1 < 1)
! f1 = 1;
! term1 = sqrt(totalrows / (double) numrows) * f1;
! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /*
--- 1325,1356 ---- { /*---------- * Estimate the number of distinct values using the
estimator
! * proposed by Haas and Stokes in IBM Research Report RJ 10025:
! * n*d / (n - f1 + f1*n/N)
! * where f1 is the number of distinct values that occurred
! * exactly once in our sample of n rows (from a total of N),
! * and d is the total number of distinct values in the sample.
! * This is their Duj1 estimator; the other estimators they
! * recommend are considerably more complex, and are numerically
! * very unstable when n is much smaller than N.
! * * Overwidth values are assumed to have been distinct. *----------
*/ int f1 = ndistinct - nmultiple + toowide_cnt;
! int d = f1 + nmultiple;
! double numer, denom, stadistinct;
! numer = (double) numrows * (double) d;
! denom = (double) (numrows - f1) +
! (double) f1 * (double) numrows / totalrows;
! stadistinct = numer / denom;
! /* Clamp to sane range in case of roundoff error */
! if (stadistinct < (double) d)
! stadistinct = (double) d;
! if (stadistinct > totalrows)
! stadistinct = totalrows;
! stats->stadistinct = floor(stadistinct + 0.5); } /*
pgsql-hackers by date: