Home > mailing lists

Re: proposal : cross-column stats - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: proposal : cross-column stats
Date	December 17, 2010 18:12:17
Msg-id	4D0BE040.5070604@fuzzy.cz Whole thread Raw
In response to	Re: proposal : cross-column stats (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses	Re: proposal : cross-column stats
List	pgsql-hackers

Tree view

Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables.  As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table.  So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better.  Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
> 
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?

Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------

So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
-----------

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by
     (sel(A) + sel(B)) / 2

I guess we can say that (sel(A) > sel(A,B)) is generally the same as
     sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
-----------

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas

pgsql-hackers by date:

From: Josh Berkus
Date: 17 December 2010, 18:06:08
Subject: Re: Why don't we accept exponential format for integers?

From: Peter Eisentraut
Date: 17 December 2010, 18:13:01
Subject: psql expanded auto

Re: proposal : cross-column stats - Mailing list pgsql-hackers

Previous

Next