Re: New hashed IN code ignores distinctiveness of subquery - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: New hashed IN code ignores distinctiveness of subquery |
Date | |
Msg-id | 19010.1043635398@sss.pgh.pa.us Whole thread Raw |
In response to | Re: New hashed IN code ignores distinctiveness of subquery (Bradley Baetz <bbaetz@acm.org>) |
Responses |
Re: New hashed IN code ignores distinctiveness of subquery
|
List | pgsql-bugs |
Bradley Baetz <bbaetz@acm.org> writes: > On Sun, Jan 26, 2003 at 06:51:18PM -0500, Tom Lane wrote: >> But the DISTINCT case is a different query. The question that's really >> at hand is why does the planner mistakenly prefer merge to hash join in >> the non-DISTINCT case. > Well, its a different input, but the output is the same, so it may as > well be the same query. IN (1,1,1) is the same as IN (1). That's a good observation with respect to the problem of estimating the number of output tuples; but it's irrelevant to the problem of estimating execution cost. The plan that we're trying to cost here is specifically a plan that runs a mergejoin *without* first uniq-ifying the righthand relation. As such, the cost of the mergejoin is nearly the same as the cost of an ordinary mergejoin of the two relations. We save the costs of outputting join tuples that aren't wanted by the IN, but we still have to run the mergejoin over those tuples. > Well, in this case the hash node could prune duplicates as they went > into the table :). Not sure if that would be a win, though. We're already checking that as a separate plan alternative. The implementation could be improved a little, though --- we could combine the uniq-ification into the Hash node. > I don't follow that. We have 50000 outer rows, with a selectivity of say > 0.1 (to make the maths nicer). That gives the roughly 5000 rows printed > above. However, thats not the whole story. The result of the join will > be 5000 rows _for each (unique) tuple in the subselect result_. For the > 10 distinct values I have, that gives 50000 total rows, which is > correct. AFAICS there are two or three different concepts of selectivity being tossed about here. If we want to continue defining "selectivity" as the ratio of the number of output tuples to the cross product size, then you need different selectivity numbers depending on whether you take the cross product size to be num_outer * num_inner or num_outer * num_distinct_inner or just num_outer. Only in the last case will the valid range of selectivity range from 0 to 1; if the selectivity estimator returns something approaching 1 then the first two equations will give very wrong answers. As an example, consider the case where all the entries in both tables are the same value (say 42). The present selectivity estimator *will* say 1.0, as it should for an ordinary join. If you multiply by anything but num_outer you get the wrong answer for the number of output tuples. It might be that we have to bite the bullet and set up a different selectivity estimation procedure for IN. I'd prefer to avoid that but haven't really figured out if it's necessary or not. In the meantime, I think you are right that it's bogus that set_joinrel_size_estimates uses different equations for JOIN_IN and JOIN_UNIQUE_INNER. Whatever the decision is about how to do the estimation, these should be done the same way. regards, tom lane
pgsql-bugs by date: