Re: proposal : cross-column stats - Mailing list pgsql-hackers
| From | Florian Pflug | 
|---|---|
| Subject | Re: proposal : cross-column stats | 
| Date | |
| Msg-id | 8121FDEB-A57E-4A03-9119-055C62D0BA5A@phlo.org Whole thread Raw | 
| In response to | Re: proposal : cross-column stats (Tomas Vondra <tv@fuzzy.cz>) | 
| Responses | Re: proposal : cross-column stats Re: proposal : cross-column stats | 
| List | pgsql-hackers | 
On Dec18, 2010, at 17:59 , Tomas Vondra wrote: > It seems to me you're missing one very important thing - this was not > meant as a new default way to do estimates. It was meant as an option > when the user (DBA, developer, ...) realizes the current solution gives > really bad estimates (due to correlation). In that case he could create > 'cross-column' statistics on those columns, and the optimizer would use > that info to do the estimates. I do understand that. I just have the nagging feeling that there is a way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense to apply the uniform bayesian approach or to assume the columns are unrelated. I play with this for a bit over the weekend, but unfortunately ran out of time. So I'm writing up what I found, to prevent it from getting lost. I tried to pick up Robert's idea of quantifying "Implicativeness" - i.e., finding a number between 0 and 1 that describes how close the (A,B) are to representing a function A -> B. Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the estimates of dist(?) are consistent. From that you easily get dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and dist(A,B)/dist(A) <= dist(B) <= dist(A,B) If dist(A) == dist(A,B), then there is a functional dependency A -> B, and conversely if dist(B) == dist(A,B) there is a functional dependency B -> A. Note that you can have both at the same time! On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the smallest number of distinct values possible for a given combination of dist(A,B) and dist(A). This is the anti-function case. This motivates the definition F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ] (You can probably drop the "-1", it doesn't make much of a difference for larger values of dist(B). F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while a value of 1 indicates that dist(A) == dist(A,B). So F(A,B) is a suitable measure of "Implicativeness" - it's higher if the table (A,B) looks more like a function A -> B. You might use that to decide if either A->B or B->a looks function-like enough to use the uniform bayesian approach. Or you might even go further, and decide *with* bayesian formula to use - the paper you cited always averages P(A=x|B=y)*P(B=y) and P(B=y|A=x)*P(A=x) but they offer no convincing reason for that other than "We don't know which to pick". I'd like to find a statistical explanation for that definition of F(A,B), but so far I couldn't come up with any. I created a Maple 14 worksheet while playing around with this - if you happen to have a copy of Maple available I'd be happy to send it to you.. This is what I got so far - I hope it may prove to be of use somehow. best regards, Florian Pflug
pgsql-hackers by date: