Re: proposal : cross-column stats - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: proposal : cross-column stats |
| Date | |
| Msg-id | 4D0BE040.5070604@fuzzy.cz Whole thread Raw |
| In response to | Re: proposal : cross-column stats (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
| Responses |
Re: proposal : cross-column stats
|
| List | pgsql-hackers |
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables. As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table. So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better. Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
>
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.
dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------
So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.
upper bound
-----------
Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by
(sel(A) + sel(B)) / 2
I guess we can say that (sel(A) > sel(A,B)) is generally the same as
sel(A) = sel(A,B)
i.e. we can use the heuristict I've already offered in the PoC.
lower bound
-----------
Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.
regards
Tomas
pgsql-hackers by date: