Reimplementing UNION/INTERSECT/EXCEPT - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Reimplementing UNION/INTERSECT/EXCEPT |
Date | |
Msg-id | 4939.970506269@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Reimplementing UNION/INTERSECT/EXCEPT
|
List | pgsql-hackers |
I've been looking at UNION/INTERSECT/EXCEPT with an eye to making them work in views and subselect-in-FROM (same thing really ;-)). I had first thought that some marginal hacking on the parsetree representation might be enough, but after study I am realizing just how broken this code really is. It turns out that it's not even very close to implementing the SQL spec. SQL92 7.10, general rule 1b says that if a row R has m duplicates in T1 and n duplicates in T2 (m >= 0, n >= 0) then: Set operation: contains this many duplicates of R: T1 UNION T2 1 if m > 0 or n > 0, else 0 T1 INTERSECT T2 1 if m > 0 and n > 0, else 0 T1 EXCEPT T2 1 if m > 0 and n = 0, else 0 T1 UNION ALL T2 m + n T1 INTERSECT ALL T2 min(m, n) T1 EXCEPT ALL T2 max(m - n, 0) We are OK for UNION (which we do as append, sort, unique) and for UNION ALL (which we do as append). We are not OK for INTERSECT and EXCEPT, which the code presently tries to do as SELECT T1 INTERSECT SELECT T2 => SELECT T1 WHERE T1 IN (SELECT T2) SELECT T1 EXCEPT SELECT T2 => SELECT T1 WHERE T1 NOT IN (SELECT T2) This will give the wrong number of duplicates when m > 1. It could be made to give the right answer by adding a DISTINCT to the select, but there's still no expansion path for implementing INTERSECT ALL and EXCEPT ALL. There are a bunch of internal problems too (which is why views didn't support UNION et al to begin with), mostly due to bad choices of data structures. I have come to the conclusion that there's no point in half measures: throwing out this code and rewriting from scratch will take less time than trying to patch it. Here's what I have in mind: Parse-tree data structure (output of gram.y): remove unionClause/intersectClause from SelectStmt. Add a SetOperation node type that indicates the operation type (UNION/INTERSECT/EXCEPT plus an "all" boolean) and links to two component SelectStmts or SetOperations. There will be no loops in this data structure, unlike the present situation. The top-level info (optional ORDER BY) added by the SelectStmt production will be attached to the leftmost SelectStmt leaf in the tree. Query-tree structure (output of parse_analyze): process the leaf SelectStmts into Queries individually, after removing the OrderBy etc info from the leftmost leaf. Create a top-level Query that contains the leaf Queries as subselects-in-FROM. In place of unionClause/intersectClause, Query nodes will have a Node *setOperations field. In the top-level Query this will contain the SetOperation tree emitted by gram.y, but with the leaf SelectStmt nodes replaced by RangeTblRef references to the range table entries occupied by the leaf Queries. (Thus, still no loops or multiple links.) The top-level Query has a dummy targetlist that exists mainly to show the union'd datatype of each output column, and it carries any sortClause, limitClause, etc needed for the output of the entire operation. Note that we do not need to transform the datatypes of the leaf queries' targetlists, which eliminates a large class of bugs that exist presently with cross-datatype UNIONs. Rewriter: need pay no attention at all to setOperation tree. Plan-tree structure (output of planner): UNION/UNION ALL are handled same as now, except that what we are appending together is not directly the top-level plan of each leaf query, but a SubqueryScan plan scanning the output of each leaf query. This gives us one extra level of projection (targetlist evaluation) in which to put conversions to the common union datatype --- without breaking the semantics of GROUP BY, DISTINCT, and so forth in the leaf queries. INTERSECT and EXCEPT will be handled by building SubqueryScan plans that emit the common-datatype columns plus a resjunk boolean column that shows whether the tuple is coming from the left or right input. The outputs of these plans are then appended together, sorted, and fed to a new plan node type that implements INTERSECT(ALL) and EXCEPT(ALL). It will be a simple generalization of the Unique plan type: scan a set of successive tuples that agree in all the non-resjunk columns, counting the number of tuples that came from left and right sides. This gives us 'm' and 'n' for each set, from which the spec-defined behavior can be implemented immediately. Executor: just need new plan-node-type implementation, which is easily derived from nodeUnique. I am not planning to try to implement SQL's "CORRESPONDING" option for 7.1. Whenever we do get to it, it should be a fairly straightforward extension: add the corresponding-column lists to SetOperation nodes, and then in the plan tree, make the SubqueryScan nodes emit just these columns and not the entire targetlists of the leaf Queries. (This is another reason why we need the extra level of targetlist...) Comments? regards, tom lane
pgsql-hackers by date: