Home > mailing lists

Re: proposal : cross-column stats - Mailing list pgsql-hackers

From	Florian Pflug
Subject	Re: proposal : cross-column stats
Date	December 20, 2010 22:29:09
Msg-id	8121FDEB-A57E-4A03-9119-055C62D0BA5A@phlo.org Whole thread Raw
In response to	Re: proposal : cross-column stats (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: proposal : cross-column stats Re: proposal : cross-column stats
List	pgsql-hackers

Tree view

On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
> It seems to me you're missing one very important thing - this was not
> meant as a new default way to do estimates. It was meant as an option
> when the user (DBA, developer, ...) realizes the current solution gives
> really bad estimates (due to correlation). In that case he could create
> 'cross-column' statistics on those columns, and the optimizer would use
> that info to do the estimates.

I do understand that. I just have the nagging feeling that there is a
way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
to apply the uniform bayesian approach or to assume the columns are
unrelated.

I play with this for a bit over the weekend, but unfortunately ran out
of time. So I'm writing up what I found, to prevent it from getting lost.

I tried to pick up Robert's idea of quantifying "Implicativeness" -
i.e., finding a number between 0 and 1 that describes how close the
(A,B) are to representing a function A -> B.

Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
estimates of dist(?) are consistent. From that you easily get
 dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and dist(A,B)/dist(A) <= dist(B) <= dist(A,B)

If dist(A) == dist(A,B), then there is a functional dependency
A -> B, and conversely if dist(B) == dist(A,B) there is a functional
dependency B -> A. Note that you can have both at the same time!

On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
smallest number of distinct values possible for a given combination
of dist(A,B) and dist(A). This is the anti-function case.

This motivates the definition
 F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]

(You can probably drop the "-1", it doesn't make much of a difference
for larger values of dist(B).

F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
a value of 1 indicates that dist(A) == dist(A,B).

So F(A,B) is a suitable measure of "Implicativeness" - it's higher
if the table (A,B) looks more like a function A -> B.

You might use that to decide if either A->B or B->a looks function-like
enough to use the uniform bayesian approach. Or you might even go further,
and decide *with* bayesian formula to use - the paper you cited always
averages
 P(A=x|B=y)*P(B=y) and P(B=y|A=x)*P(A=x)

but they offer no convincing reason for that other than "We don't know
which to pick".

I'd like to find a statistical explanation for that definition of
F(A,B), but so far I couldn't come up with any. I created a Maple 14
worksheet while playing around with this - if you happen to have a
copy of Maple available I'd be happy to send it to you..

This is what I got so far - I hope it may prove to be of use somehow.

best regards,
Florian Pflug

pgsql-hackers by date:

From: Bruce Momjian
Date: 20 December 2010, 22:04:28
Subject: Re: pg_ctl and port number detection

From: Florian Pflug
Date: 20 December 2010, 22:30:35
Subject: Re: pg_ctl and port number detection

Re: proposal : cross-column stats - Mailing list pgsql-hackers

Previous

Next