RFC: planner statistics in 7.2 - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | RFC: planner statistics in 7.2 |
Date | |
Msg-id | 23052.987719865@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: RFC: planner statistics in 7.2y
Re: RFC: planner statistics in 7.2 Re: RFC: planner statistics in 7.2 Re: RFC: planner statistics in 7.2 |
List | pgsql-hackers |
Request for comments: Overview -------- The planner is currently badly handicapped by inadequate statistical information. The stats that VACUUM ANALYZE currently computes are: per-table:number of disk pagesnumber of tuples per-column:dispersionminimum and maximum values (if datatype has a suitable '<' operator)most common valuefraction of entriesthat are the most common valuefraction of entries that are NULL Not only is this an impoverished dataset, but the dispersion and most-common-value are calculated in a very unreliable way, and cannot be trusted. Even though the stats are meager and untrustworthy, they are expensive to get: the per-column data is updated only by VACUUM ANALYZE, which is quite unpleasant to do regularly on large tables. It is possible to get better statistics with less work. Here is a proposal for revising the stats mechanisms for 7.2. I believe that it's still a good idea for the stats to be updated by an explicit housekeeping command. Although we could think of updating them on-the-fly, that seems expensive and slow. Another objection is that planning results will become difficult to reproduce if the underlying statistics change constantly. But if we stick to the explicit-command approach then we need to make the command much faster than VACUUM ANALYZE. We can address that in two ways: (1) The statistics-gathering process should be available as a standalone command, ANALYZE [ tablename ], not only as part of VACUUM. (This was already discussed and agreed to for 7.1, but it never got done.) Note that a pure ANALYZE command needs only a read lock on the target table, not an exclusive lock as VACUUM needs, so it's much more friendly to concurrent transactions. (2) Statistics should be computed on the basis of a random sample of the target table, rather than a complete scan. According to the literature I've looked at, sampling a few thousand tuples is sufficient to give good statistics even for extremely large tables; so it should be possible to run ANALYZE in a short amount of time regardless of the table size. If we do both of these then I think that ANALYZE will be painless enough to do frequently, so there's no need to try to fold stats-updating into regular operations. More extensive statistics ------------------------- The min, max, and most common value are not enough data, particularly not for tables with heavily skewed data distributions. Instead, pg_statistic should store three small arrays for each column: 1. The K most common values and their frequencies, for some (adjustable) parameter K. This will be omitted if the column appears unique (no duplicate values found). 2. The M boundary values of an equi-depth histogram, ie, the values that divide the data distribution into equal-population sets. For example, if M is 3 this would consist of the min, median, and max values. K and M might both be about 10 for a typical column. In principle at least, these numbers could be user-settable parameters, to allow trading off estimation accuracy against amount of space used in pg_statistics. A further refinement is that the histogram should describe the distribution of the data *after excluding the given most-common values* (that is, it's a "compressed histogram"). This allows more accurate representation of the data distribution when there are just a few very-common values. For a column with not many distinct values (consider a boolean column), the most-common-value list might describe the complete dataset, in which case the histogram collapses to nothing. (We'd still have it store the min and max, just so that it's not necessary to scan the most-common-value array to determine those values.) I am also inclined to remove the attdispersion parameter in favor of storing an estimate of the number of distinct values in the column. We can compute more accurate selectivities using the most-common-value frequencies and the distinct-values estimate than we could get from dispersion. Another useful idea is to try to estimate the correlation between physical table order and logical order of any given column --- this would let us account for the effect of clustering when estimating indexscan costs. I don't yet know the appropriate formula to use, but this seems quite doable. Finally, it would be a good idea to add an estimate of average width of varlena fields to pg_statistic. This would allow the estimated tuple width computed by the planner to have something to do with reality, which in turn would improve the cost estimation for hash joins (which need to estimate the size of the in-memory hash table). Computing the statistics ------------------------ The best way to obtain these stats seems to be: 1. Scan the target table and obtain a random sample of R tuples. R is chosen in advance based on target histogram size (M) --- in practice R will be about 3000 or so. If the table contains fewer than that many tuples then the "sample" will be the whole table. The sample can be obtained in one pass using Vitter's algorithm or similar. Note that for expected values of R we shouldn't have any trouble storing all the sampled tuples in memory. 2. For each column in the table, extract the column value from each sampled tuple, and sort these values into order. A simple scan of the values then suffices to find the most common values --- we only need to count adjacent duplicates and remember the K highest counts. After we have those, simple arithmetic will let us find the positions that contain the histogram boundary elements. Again, all this can be done in-memory and so should be pretty fast. The sort step will require detoasting any toasted values. To avoid risk of memory overflow, we may exclude extremely wide toasted values from the sort --- this shouldn't affect the stats much, since such values are unlikely to represent duplicates. Such an exclusion will also prevent wide values from taking up lots of space in pg_statistic, if they happened to be chosen as histogram entries. Issue: for datatypes that have no '<' operator, we of course cannot compute a histogram --- but if there is an '=' then the most-common-value and number-of-distinct-values stats still make sense. Is there a way to derive these without O(R^2) brute-force comparisons? We could still do a scan of the R sample values using something like the existing comparison algorithm (but using S > K work slots); this would cost about S*R rather than R^2 comparisons. A different approach that's been discussed on pghackers is to make use of btree indexes for columns that have such indexes: we could scan the indexes to visit all the column values in sorted order. I have rejected that approach because (a) it doesn't help for columns without a suitable index; (b) our indexes don't distinguish deleted and live tuples, which would skew the statistics --- in particular, we couldn't tell a frequently-updated single tuple from a commonly repeated value; (c) scanning multiple indexes would likely require more total I/O than just grabbing sample tuples from the main table --- especially if we have to do that anyway to handle columns without indexes. User-visible changes -------------------- Aside from implementing the stand-alone ANALYZE statement, it seems that it would be a good idea to allow user control of the target numbers of statistical entries (K and M above). A simple approach would be a SET variable or explicit parameter for ANALYZE. But I am inclined to think that it'd be better to create a persistent per-column state for this, set by sayALTER TABLE tab SET COLUMN col STATS COUNT n (better names welcome). The target count for each column could be stored in pg_attribute. Doing it this way would let the dbadmin establish higher or lower targets for especially irregular or simple columns, and then forget about the numbers --- he wouldn't have to tweak his periodic cron script to apply the right parameters to the right tables/columns. regards, tom lane
pgsql-hackers by date: