Re: Odd statistics behaviour in 7.2 - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Odd statistics behaviour in 7.2 |
Date | |
Msg-id | 29600.1013623356@sss.pgh.pa.us Whole thread Raw |
In response to | Odd statistics behaviour in 7.2 ("Gordon A. Runkle" <gar@integrated-dynamics.com>) |
Responses |
Re: Odd statistics behaviour in 7.2
|
List | pgsql-hackers |
Gordon Runkle <gar@www.integrated-dynamics.com> writes: > You can retrieve the dump of my data at: [snipped] Thanks. Indeed, the first time I did an ANALYZE I got: tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation -----------+---------+-----------+-----------+------------+-------------------------------+---------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------tom_help | tdnr | 0 | 17 | 56484 | {0557088000957,0557700369880} | {0.000666667,0.000666667} | {0551000386411,0551504108858,0557011074656,0557050633939,0557111430036,0557151012769,0557179871119,0557698138188,0557750158740,0557783053444,0558980779763} | -0.199108 Now, I happen to know that the default size of ANALYZE's statistical sample is 3000 rows. What evidently happened here is that the statistical sampling picked up two values appearing twice (namely, 0557088000957 and 0557700369880); the given frequencies for these two values establish that they appeared twice in the 3000-row sample. Since no other values are mentioned in most_common_vals, the remaining 2996 samples must have been values that appeared only once. Having computed these raw statistics, ANALYZE has to try to extrapolate the number of distinct values in the whole table. What it's using for the purpose is an equation I found in "Random sampling for histogram construction: how much is enough?"by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya,inProceedings of ACM SIGMOD International Conference on Managementof Data, 1998, Pages 436-447. namely sqrt(n/r) * max(f1,1) + f2 + f3 + ... where fk is the number of distinct values that occurred exactly k times in our sample of r rows (from a total of n). And indeed you get 56484 when you plug in these numbers. So the code is operating as designed, and we can't complain that the sample is an unreasonable sample given the true underlying distribution. We have to blame the equation: evidently this estimation equation doesn't apply very well to nearly-unique columns. I had already modified the Chaudhuri approach slightly: if the ANALYZE sample contains no duplicate values at all (ie, f1=r, f2=f3=f4=...=0) then their equation reduces to sqrt(n*r), but actually ANALYZE assumes the column is unique (ie, n distinct values, not sqrt(n*r)), which seems a lot better assumption in practice. The runs where you got n_distinct equal to -1 are presumably those where the ANALYZE sample chanced to contain no duplicates. I am thinking that we need to find another estimator equation, or at least shade away from their equation when f1 is close to r. Ideally the estimate for f1=r should be a logical extrapolation of the curve for f1 close to r, but right now it's quite discontinuous. There was some previous discussion about this, cf the thread at http://archives.postgresql.org/pgsql-general/2001-10/msg01032.php but nothing really emerged on how to do better. The Chaudhuri paper points out that estimating total number of distinct values from a sample is inherently a hard problem and subject to large estimation errors, so it may be that we can't do a lot better. I think we should be wary of ad-hoc answers, anyhow. Something with a little math behind it would make me feel more comfortable. Anyone have any thoughts on how to do better? regards, tom lane
pgsql-hackers by date: