Re: Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Performance improvement for joins where outer side is unique |
Date | |
Msg-id | 5671FF0D.2010403@2ndquadrant.com Whole thread Raw |
In response to | Re: Performance improvement for joins where outer side is unique (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: Performance improvement for joins where outer side is unique
|
List | pgsql-hackers |
Hi, On 12/16/2015 11:40 PM, David Rowley wrote: > On 17 December 2015 at 05:02, Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > 0) I know the patch does not tweak costing - any plans in this > > direction? Would it be possible to simply use the costing used by > semijoin? > > > Many thanks for looking at this. > > The patch does tweak the costings so that unique joins are costed in the > same way as semi joins. It's a very small change, so easy to miss. Thanks. I missed that bit somehow. > > For example, see the change in initial_cost_nestloop() > > - if (jointype == JOIN_SEMI || jointype == JOIN_ANTI) > + if (jointype == JOIN_SEMI || > + jointype == JOIN_ANTI || > + unique_inner) > { > > Also see the changes in final_cost_nestloop() and final_cost_hashjoin() > > > 1) nodeHashjoin.c (and other join nodes) > > I've noticed we have this in the ExecHashJoin() method: > > /* > * When the inner side is unique or we're performing a > * semijoin, we'll consider returning the first match, but > * after that we're done with this outer tuple. > */ > if (node->js.unique_inner) > node->hj_JoinState = HJ_NEED_NEW_OUTER; > > That seems a bit awkward because the comment speaks about unique > joins *OR* semijoins, but the check only references unique_inner. > That of course works because we do this in ExecInitHashJoin(): > > switch (node->join.jointype) > { > case JOIN_SEMI: > hjstate->js.unique_inner = true; > /* fall through */ > > Either some semijoins may not be unique - in that case the naming is > misleading at best, and we either need to find a better name or > simply check two conditions like this: > > if (node->js.unique_inner || node->join.type == JOIN_SEMI) > node->hj_JoinState = HJ_NEED_NEW_OUTER; > > Or all semijoins are unique joins, and in that case the comment may > need rephrasing. But more importantly, it begs the question why > we're detecting this in the executor and not in the planner? Because > if we detect it in executor, we can't use this bit in costing, for > example. > > > The reason not to detect that in the planner is simply that unique_join > is meant to mean that the relation on the inner side of the join will at > most contain a single Tuple which matches each outer relation tuple, > based on the join condition. I've added no detection for this in semi > joins, and I'd rather not go blindly setting the flag to true in the > planner as it simply may not be true for the semi join. At the moment > that might not matter as we're only using the unique_join flag as an > optimization in the join nodes, but I'd prefer not to do this as its > likely we'll want to do more with this flag later, and I'd rather we > keep the meaning well defined. You might argue that this is also true > for semi joins, but if down the road somewhere we want to perform some > checks on the inner relation before the join takes place, and in that > case the Tuples of the relation might not have the same properties we > claim they do. > > But you're right that reusing the flag in the join nodes is not ideal, > and the comment is not that great either. I'd really rather go down the > path of either renaming the variable, or explaining this better in the > comment. It seems unnecessary to check both for each tuple being joined. > I'd imagine that might add a little overhead to joins which are not unique. I'd be very surprised it that had any measurable impact. > How about changing the comment to: > > /* > * In the event that the inner side has been detected to be > * unique, as an optimization we can skip searching for any > * subsequent matching inner tuples, as none will exist. > * For semijoins unique_inner will always be true, although > * in this case we don't match another inner tuple as this > * is the required semi join behavior. > */ > > Alternatively or additionally we can rename the variable in the executor > state, although I've not thought of a name yet that I don't find overly > verbose: unique_inner_or_semi_join, match_first_tuple_only. I'd go with match_first_tuple_only. > > > 2) analyzejoins.c > > I see that this code in join_is_removable() > > innerrel = find_base_rel(root, innerrelid); > > if (innerrel->reloptkind != RELOPT_BASEREL) > return false; > > was rewritten like this: > > innerrel = find_base_rel(root, innerrelid); > > Assert(innerrel->reloptkind == RELOPT_BASEREL); > > That suggests that previously the function expected cases where > reloptkind may not be RELOPT_BASEREL, but now it'll error out int > such cases. I haven't noticed any changes in the surrounding code > that'd guarantee this won't happen, but I also haven't been able to > come up with an example triggering the assert (haven't been trying > too hard). How do we know the assert() makes sense? > > > I'd have changed this as this should be covered by the if > (!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only > sets is_unique_join to true if specialjoin_is_unique_join() returns true > and that function tests reloptkind to ensure its set to RELOPT_BASEREL > and return false if the relation is not. Perhaps what is missing is a > comment to explain the function should only be used on RELOPT_BASEREL > type relations. Yeah, I think it'd be good to document this contract somewhere. Either in the comment before the function, or perhaps right above the Assert(). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: