Thread: Projection while performing joins.
Hi, In the merge join as well as in the nested loop join, we do ExecProject() after we have found tuples from the relations involved in the join.For a join query involving just two relations and merge join being used, the outer plan will be NodeSort. Now, NodeSort will create Temp files and keep a sorted version of the relation in these Temp files. (correct?) I was wondering,Why do we not just store the attributes required in the join (i.e. those in the join qual conditions and the ones in the select list) and then perform sorting and retrieval on these tuples rather than on the possibly larger tuples with more attributes which we project out in the end anyway. Is there any reason why this is infeasible to implement or is this simply a wrong approach theoretically? If none, is this being used anywhere? (hope the answers is not postgres... i would have embarrassed myself no limit then) Another way of putting this would be, why don't we push the projection operator as far down as possible. In this case we also add the attributes in the selection operator (of relational algebra) to the projection. And then perform a join on the smaller resulting relation. Regards, Anagh Lal. __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com
Anagh Lal <anaghlal2001@yahoo.com> writes: > Why do we not just store the attributes required in > the join (i.e. those in the join qual conditions and > the ones in the select list) and then perform sorting > and retrieval on these tuples rather than on the > possibly larger tuples with more attributes which we > project out in the end anyway. We already do --- the scan nodes project out only the needed columns. At least, that's true in released versions. Just a few days ago I put a hack into CVS tip to skip the projection step for a scan node that's not topmost in the plan. As you indicate, that's probably a net loss when there's a Sort node directly above the scan node. Needs more thought... regards, tom lane
Hi, Two parts to the mail 1) > We already do --- the scan nodes project out only > the needed columns. ok..thats great. I tried looking for what you are saying in the source code... [before and after doing a cvs update].. but I am still confused by the following: In /backend/executor/nodeMergeJoin.c in ExecMergeJoin() In the state (the switch case) EXEC_MJ_JOINTUPLES we still do ExecProject(), what does this do? (this is what prompted my incorrect assumption in the first place). I was able to trace the ExecProject() which takes place (due to the SeqScan node) in the ExecScan function of /executor/execScan.c (I made an error in assuming the tuple returned by the function passed as an argument to ExecScan was being returned directly )... so now i see what you were saying in the mail but am not clear about why we have a ExecProject() higher up in the join nodes..its there in the NestedLoop node too. part 2) > As you indicate, that's probably a net loss > when there's a Sort node directly above the scan > node. Needs more thought... Some food for thought, Let's ignore the attributes listed in the select clause and work only with the where clause (join condition) attributes. And as a back reference store the tupleid of the original whole tuple in the "working" tuple. At the final output stage perform a lookup to retrieve the select clause attributes of only the qualifying tuple. Thus enabling us to work with really small sized data. worth trying out? Thanks Anagh Lal __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com
Anagh Lal <anaghlal2001@yahoo.com> writes: > ... I am still confused by the following: > In /backend/executor/nodeMergeJoin.c > in ExecMergeJoin() > In the state (the switch case) EXEC_MJ_JOINTUPLES > we still do ExecProject(), what does this do? Well, sure. A join node *must* do a projection, no? It can't simply return either the left or the right input tuple (except in the vanishingly-small fraction of cases where you don't actually need any columns from the right or the left respectively; which are cases that we don't currently bother to optimize). To create a tuple that's not exactly one or the other you must project. > Some food for thought, > Let's ignore the attributes listed in the select > clause > and work only with the where clause (join condition) > attributes. And as a back reference store the > tupleid of the original whole tuple in the "working" > tuple. At the final output stage perform a lookup to > retrieve the select clause attributes of only the > qualifying tuple. Thus enabling us to work with really > small sized data. > worth trying out? Not sure there's a lot of traction here. In many cases, the bottom-level scan gets advanced one or more rows before the top-level nodes can pop out a result. (Consider GROUP BY as an example.) I don't see the advantage of adding bookkeeping/copying for past rows in order to avoid copying current-row data around. But feel free to prove me wrong ;-) regards, tom lane