Re: proposal : cross-column stats - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: proposal : cross-column stats |
Date | |
Msg-id | 4D0BE040.5070604@fuzzy.cz Whole thread Raw |
In response to | Re: proposal : cross-column stats (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Responses |
Re: proposal : cross-column stats
|
List | pgsql-hackers |
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a): > On 17.12.2010 23:13, Tomas Vondra wrote: >> Dne 17.12.2010 19:58, Robert Haas napsal(a): >>> I haven't read the paper yet (sorry) but just off the top of my head, >>> one possible problem here is that our n_distinct estimates aren't >>> always very accurate, especially for large tables. As we've discussed >>> before, making them accurate requires sampling a significant >>> percentage of the table, whereas all of our other statistics can be >>> computed reasonably accurately by sampling a fixed amount of an >>> arbitrarily large table. So it's possible that relying more heavily >>> on n_distinct could turn out worse overall even if the algorithm is >>> better. Not sure if that's an issue here, just throwing it out >>> there... >> >> Yes, you're right - the paper really is based on (estimates of) number >> of distinct values for each of the columns as well as for the group of >> columns. >> >> AFAIK it will work with reasonably precise estimates, but the point is >> you need an estimate of distinct values of the whole group of columns. >> So when you want to get an estimate for queries on columns (a,b), you >> need the number of distinct value combinations of these two columns. >> >> And I think we're not collecting this right now, so this solution >> requires scanning the table (or some part of it). > > Any idea how sensitive it is to the accuracy of that estimate on > distinct value combinations? If we get that off by a factor of ten or a > hundred, what kind of an effect does it have on the final cost estimates? Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and sel(X) is selectivity of column X. dependency on dist(A), dist(B) and dist(A,C) -------------------------------------------- So it seems to be proportional to dist(A) and dist(B), and inverse proportional to dist(A,B). So when increasing the dist(A) and dist(B) 10x, you'll get a 10x the original estimate. Similarly, by increasing the dist(A,B) 10x, you'll get 10x lower estimate. upper bound ----------- Unless dist(A) or dist(B) is greater than dist(A,B), the estimated selectivity is bounded by (sel(A) + sel(B)) / 2 I guess we can say that (sel(A) > sel(A,B)) is generally the same as sel(A) = sel(A,B) i.e. we can use the heuristict I've already offered in the PoC. lower bound ----------- Not sure what the lower bound is, but it might be 0 if the dist(A) and dist(B) are very small and/or dist(A,B) is huge. regards Tomas
pgsql-hackers by date: