Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Odd statistics behaviour in 7.2 |
Date | |
Msg-id | 16868.1013818182@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Odd statistics behaviour in 7.2 ("Gordon A. Runkle" <gar@integrated-dynamics.com>) |
Responses |
Re: Odd statistics behaviour in 7.2
Re: Odd statistics behaviour in 7.2 |
List | pgsql-hackers |
"Gordon A. Runkle" <gar@integrated-dynamics.com> writes: > [ tale of poor statistical results in a near-unique column ] After some further digging in the literature I have come across a number-of-distinct-values estimator that I like better than Chaudhuri's. It runs like this: n*d / (n - f1 + f1*n/N) where f1 is the number of distinct values that occurred exactly once in our sample of n rows (from a total of N), and d is the total number of distinct values in thesample. The nice properties this has include: 1. At f1=0 (all values in sample occurred more than once), the estimate reduces to d, which I had already determined to be a sensible result. 2. At f1=n (all values were distinct, hence also d=n), the estimate reduces to N, which again is the most reasonable estimate. 3. The equation is numerically stable even when n is much smaller than N, because the only cancellation is in the term (n - f1) which we can compute exactly. A lot of the other equations I looked at depend on series like (1 - n/N)**i which are going to be really nasty when n/N is tiny. In particular, point 2 means that this equation should perform reasonably well for nearly-unique columns (f1 approaching n), which was the case you were seeing bad results for. Attached is a patch that implements this revised equation. Would appreciate any feedback... regards, tom lane *** src/backend/commands/analyze.c.orig Sat Jan 5 19:37:44 2002 --- src/backend/commands/analyze.c Fri Feb 15 18:46:45 2002 *************** *** 1009,1018 **** { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Chaudhuri et al (see citation above). This is ! * sqrt(n/r) * max(f1,1) + f2 + f3 + ... ! * where fk is the number of distinct values that occurred ! * exactly k times in our sample of r rows (from a total of n). * We assume (not very reliably!)that all the multiply-occurring * values are reflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this usually --- 1009,1023 ---- { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Haas and Stokes in IBM Research Report RJ 10025: ! * n*d / (n - f1 + f1*n/N) ! * where f1 is the number of distinct values that occurred ! * exactly once in our sample of n rows (from a total of N), ! * and d is the total number of distinct values in the sample. ! * This is their Duj1 estimator; the other estimators they ! * recommend are considerably more complex, and are numerically ! * very unstable when n is much smaller than N. ! * * We assume (not very reliably!) that all the multiply-occurring * values arereflected in the final track[] list, and the other * nonnull values all appeared but once. (XXX this usually *************** *** 1021,1032 **** *---------- */ int f1 = nonnull_cnt - summultiple; ! double term1; ! if (f1 < 1) ! f1 = 1; ! term1 = sqrt(totalrows / (double) numrows) * f1; ! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /* --- 1026,1044 ---- *---------- */ int f1 = nonnull_cnt - summultiple; ! int d = f1 + nmultiple; ! double numer, denom, stadistinct; ! numer = (double) numrows * (double) d; ! denom = (double) (numrows - f1) + ! (double) f1 * (double) numrows / totalrows; ! stadistinct = numer / denom; ! /* Clamp to sane range in case of roundoff error */ ! if (stadistinct < (double) d) ! stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; ! stats->stadistinct = floor(stadistinct + 0.5); } /* *************** *** 1313,1332 **** { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Chaudhuri et al (see citation above). This is ! * sqrt(n/r) * max(f1,1) + f2 + f3 + ... ! * where fk is the number of distinct values that occurred ! * exactly k times in our sample of r rows (from a total of n). * Overwidth values are assumedto have been distinct. *---------- */ int f1 = ndistinct - nmultiple+ toowide_cnt; ! double term1; ! if (f1 < 1) ! f1 = 1; ! term1 = sqrt(totalrows / (double) numrows) * f1; ! stats->stadistinct = floor(term1 + nmultiple + 0.5); } /* --- 1325,1356 ---- { /*---------- * Estimate the number of distinct values using the estimator ! * proposed by Haas and Stokes in IBM Research Report RJ 10025: ! * n*d / (n - f1 + f1*n/N) ! * where f1 is the number of distinct values that occurred ! * exactly once in our sample of n rows (from a total of N), ! * and d is the total number of distinct values in the sample. ! * This is their Duj1 estimator; the other estimators they ! * recommend are considerably more complex, and are numerically ! * very unstable when n is much smaller than N. ! * * Overwidth values are assumed to have been distinct. *---------- */ int f1 = ndistinct - nmultiple + toowide_cnt; ! int d = f1 + nmultiple; ! double numer, denom, stadistinct; ! numer = (double) numrows * (double) d; ! denom = (double) (numrows - f1) + ! (double) f1 * (double) numrows / totalrows; ! stadistinct = numer / denom; ! /* Clamp to sane range in case of roundoff error */ ! if (stadistinct < (double) d) ! stadistinct = (double) d; ! if (stadistinct > totalrows) ! stadistinct = totalrows; ! stats->stadistinct = floor(stadistinct + 0.5); } /*
pgsql-hackers by date: