proposal: add window function to 8.4 - Mailing list pgsql-hackers
From | H.Harada |
---|---|
Subject | proposal: add window function to 8.4 |
Date | |
Msg-id | e08cc0400806090532p307f0687i3a246aa5f9f293e8@mail.gmail.com Whole thread Raw |
Responses |
Re: proposal: add window function to 8.4
|
List | pgsql-hackers |
This topic has been discussed on this list and many user expect that PostgreSQL implements it. I'd like to work on this feature and hope that we can include it on 8.4. Former discussions are here: http://archives.postgresql.org/pgsql-hackers/2004-11/msg01093.php http://archives.postgresql.org/pgsql-hackers/2007-01/msg00861.php How it works and examples:SELECT dept, empno,RANK() OVER(PARTITION BY dept ORDER BY age) as age_rank,RANK() OVER(PARTITIONBY dept ORDER BY salary) as salary_rank,SUM(salary) OVER(PARTITION BY dept ORDER BY age) as run_totalFROM employeesORDER BY 1, 3, 4; dept empno age_rank salary_rank run_totalENG 2 1 2 40000ENG 1 2 1 90000QA 3 1 2 30000QA 4 2 1 65000(ref.: http://www.gavinsherry.org/talks/window_talk.pdf) My current idea and concept: - add "window function" and treat it specially such like aggregate function and setof function. - some features may be dropped at the first release, considering to support them later. - to formulate and to let it work properly are primary, performance optimization is secondary. From my survey around web and list archive, the points are: - what is "window function" rank(), rank_dense(), lead() and others? First of all, we should define the window function suchlike aggregate function. In my opinion, there are two types of functions in OVER () call context. One of them is aggregate, and the other is window (ranking) function. Sum() in a query like SELECT empno, sum(salary) OVER (PARTITION BY depno) FROM empsalary; is obviously aggregate function. This type of function can be used as it is now. Only executer will change its behavior.Current pgsql feature sets lack window function like rank(). This type of function must 1) have a context such as SETOF functions, 2) return values until executor says "DONE", rather than function itself says "DONE" as in SETOF function, and 3) know about other tuples (mainly ORDER BY clause), as rank() doesn't take any arguments but knows the ranking boundary. I suppose that pgsql have new function system for these types called "window function".Once we can define window function, users have extensibility to this type of function. - do we really need FRAME clause?From my survey, Oracle and DB2 have FRAME clause SELECT empno, sum(salary) OVER (ORDER BY empno ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM empsalary; while MS SQL Server doesn't (note that the literal from "ROWS" to "CURRENT ROW" is called FRAME clause). To implement FRAME clause is much more complicated than only PARTITION and ORDER clause support because we need random access to result tuples. Though we will be expected to support FAME clause in long-term, for the first release it might be able to be put off. Even though rank() doesn't support FRAME clause (only PARTITION and ORDER) it is so useful, more than now at least. - execution orderIn SQL:2003, the execution order is defined as where -> group by -> having -> (windowing) * N -> order by (outer, currently existing one)where windowing ispartition by -> order by -> (framing) * N But Oracle seems that it has another order (windowing) * N -> where -> group by ... and so on. which is better for us? With Oracle's one you can say SELECT empno, rank() OVER (PARTITION BY depno ORDER BY saraly) AS topsalary FROM empsalary WHERE topsaraly < 3to get the top 3 people taking heighest salary. In the SQL standard, you should the nest query.I insist the first (standard) one is better because we may want use the result of normal aggregate in OVER clause. - plan and nodeCurrently in my mind the execution strategy could be: 1. Where & GroupBy & Having|2. SortBy partitionClause, orderByClause|3. Window foreach partition: if not there_is_frame(): aggvalue = null foreach row in partition: aggvalue = agg_trans_func(aggvalue) aggvalue= agg_final_func(aggvalue) foreach row in partition: if has frame clause: aggvalue = null frame = make_frame() foreachrow_in_frame: aggvalue = aggregate_trans_func(aggvalue) aggvalue = aggregate_final_func(aggvalue) set aggvalue to row val = window_func() set val to row goto 2. if another window remained|4. SortBy ORDERBY clause (outer) & Limit|5. Output This pseudo code is quite simple and stupid. We may optimize it by splitting tasks with MergeJoin, etc. or think about process 2. that collect same PARTITION clauses to reduce sort operation. Or to use Hash Agg to create PARTITION. But let's be stupid at first. Optimization is secondary. References:description by Stefan DeBloch http://wwwdvs.informatik.uni-kl.de/courses/NEDM/WS0607/Vorlesungsunterlagen/NEDM.Chapter.06.Windows_and_Query_Functions_in_SQL.pdfvia Wikipedia[Select(SQL)] http://en.wikipedia.org/wiki/Select_(SQL) BTW, what about Bizgres now? I appreciate for any comments and dis -:)
pgsql-hackers by date: