Thread: introduction of WIP window function patch
Hi, As I proposed a month before, I am working on window function. Although this work is at quite early step, I would like to introduce it since a part of it have been finished. If you can afford and are interested in it, please review the document and patch, or compile the applied source to execute an attached sample SQL. http://umitanuki.net/pgsql/wfv01/design.html Currently, only aggregation over window does work. I am planning to work for the combination of window and normal aggregation from now on, which I guess I can manage to do. The problem is, as written in the "Things to discussed" section of the document, how you define window functions (e.g. RANK()). My idea is to treat them as specialized functions such as SET OF functions and mark it in pg_proc. But this doesn't resolve RANK() boundary problem. I am so happy with any kind of comments, reviews or critiques. Regards, -- Hitoshi Harada
On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: > Hi, > > As I proposed a month before, I am working on window function. Very nice! > http://umitanuki.net/pgsql/wfv01/design.html > > The problem is, as written in the "Things to discussed" section of the > document, how you define window functions (e.g. RANK()). My idea is to > treat them as specialized functions such as SET OF functions and mark > it in pg_proc. But this doesn't resolve RANK() boundary problem. Actually, I would make RANK() and ROW_NUMBER() act more like aggregates. ISTM you have two kinds of window functions: - aggregation: a result is calculated over a set and the result copied across all the rows. - order depenadant: same as above, but the result is different for each row. I think you could make the latter work using the current aggregation setup, just by calling the final_func for each row rather than just once at the end. That would make RANK() a normal aggrgate which returns the number of distinct values seen so far (assuming input is ordered) and ROW_NUMBER() is just an alias for COUNT(). I hope this is clear, let me know if it doesn't make sense. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote: > On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: > > http://umitanuki.net/pgsql/wfv01/design.html > > > > The problem is, as written in the "Things to discussed" section of the > > document, how you define window functions (e.g. RANK()). My idea is to > > treat them as specialized functions such as SET OF functions and mark > > it in pg_proc. But this doesn't resolve RANK() boundary problem. > > Actually, I would make RANK() and ROW_NUMBER() act more like > aggregates. ISTM you have two kinds of window functions: > > - aggregation: a result is calculated over a set and the result copied > across all the rows. > - order depenadant: same as above, but the result is different for each > row. > > I think you could make the latter work using the current aggregation > setup, just by calling the final_func for each row rather than just > once at the end. AFAICS there's no overlap between windowed aggregates and normal aggregates, so we can different infrastructure for each. I like the suggestion of doing it very similarly to current aggregates, but I would introduce a new function hook for windowed aggregates, wfunc. i.e. to create a windowed aggregate you would do CREATE AGGREGATE window_func() (sfunc = ...stype = ...wfunc = ...initcond = ) For each row we would execute the transition function (sfunc) then, if there is a window function (wfunc) then we call that to return a value for this tuple (so in that case we execute two functions per tuple in the window). If wfunc is not set then we return the transition datatype itself. Doing it this way * it will be clear which aggregates are windowed and which non-windowed, so we can avoid errors running a windowed aggregate in a non-windowed context * it also allows us to avoid executing two functions when the windowed function is very simple - denserank() for example just returns the number of rows seen so far in the window. Denserank is fairly simple CREATE AGGREGATE denserank() (sfunc = incrementstype = bigintinitcond = 0 ) rank() is fairly complex because the state data must track 3 things: * the number of tuples seen so far (bigint) * the value of the last tuple seen (anyelement) * the rank of the last tuple seen (bigint) sfunc would compare the new value with the last value, if they match then we return the rank of the last tuple. If they don't match then we set the stored value and rank, then return the number of tuples seen so far as the rank. > That would make RANK() a normal aggrgate which returns the number of > distinct values seen so far (assuming input is ordered) and > ROW_NUMBER() is just an alias for COUNT(). > I hope this is clear, let me know if it doesn't make sense. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hi, 2008/7/6 Simon Riggs <simon@2ndquadrant.com>: > > On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote: >> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: > >> > http://umitanuki.net/pgsql/wfv01/design.html >> > >> > The problem is, as written in the "Things to discussed" section of the >> > document, how you define window functions (e.g. RANK()). My idea is to >> > treat them as specialized functions such as SET OF functions and mark >> > it in pg_proc. But this doesn't resolve RANK() boundary problem. >> >> Actually, I would make RANK() and ROW_NUMBER() act more like >> aggregates. ISTM you have two kinds of window functions: >> >> - aggregation: a result is calculated over a set and the result copied >> across all the rows. >> - order depenadant: same as above, but the result is different for each >> row. >> >> I think you could make the latter work using the current aggregation >> setup, just by calling the final_func for each row rather than just >> once at the end. > > AFAICS there's no overlap between windowed aggregates and normal > aggregates, so we can different infrastructure for each. I like the > suggestion of doing it very similarly to current aggregates, but I would > introduce a new function hook for windowed aggregates, wfunc. I think there are two types of functions for windowed mode. - windowed aggregate this type of function is exactly same as normal aggregate. So we use functions that have been in pgsql already. Actually in my patch above, I didn't introduce any new function. This type of function includes simply sum(), avg(), etc. which returns same values on a partition or a window frame. - windowed function this is the NEW type of function. I guess we should add a new function type to pgsql. This type of function includes rank(), rank_dense(), row_number(), etc. Windowed functions returns different values per tuple. The difference between two types is if the function returns the same value during a partition or different values. So, windowed aggregate and normal aggregate overlap each other. How you know which one is that you see OVER clause in SQL just after the function call. When you see OVER after func(), and pg_proc says it's an aggregate, it's a windowed aggregate. Otherwise, it's a windowed function. If I misunderstood about those definitions please correct me. > Denserank is fairly simple > > CREATE AGGREGATE denserank() > ( > sfunc = increment > stype = bigint > initcond = 0 > ) I think this is ROW_NUMBER(), not DENSE_RANK(), isn't it? > rank() is fairly complex because the state data must track 3 things: > * the number of tuples seen so far (bigint) > * the value of the last tuple seen (anyelement) > * the rank of the last tuple seen (bigint) > > sfunc would compare the new value with the last value, if they match > then we return the rank of the last tuple. If they don't match then we > set the stored value and rank, then return the number of tuples seen so > far as the rank. Yeah, I think RANK() needs to know some or all of tuple of current row. But it seems not to take any argument, which is the "boundary problem" I said. Definitely RANK() or windowed function should know about current tuple so that when to increment rank. But how? As explicit argument? or specialized method? -- Hitoshi Harada
2008/7/5 Martijn van Oosterhout <kleptog@svana.org>: > On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: >> Hi, >> >> As I proposed a month before, I am working on window function. > > Very nice! > >> http://umitanuki.net/pgsql/wfv01/design.html >> >> The problem is, as written in the "Things to discussed" section of the >> document, how you define window functions (e.g. RANK()). My idea is to >> treat them as specialized functions such as SET OF functions and mark >> it in pg_proc. But this doesn't resolve RANK() boundary problem. > > Actually, I would make RANK() and ROW_NUMBER() act more like > aggregates. ISTM you have two kinds of window functions: > > - aggregation: a result is calculated over a set and the result copied > across all the rows. > - order depenadant: same as above, but the result is different for each > row. So I agree the definition of these two types. > I think you could make the latter work using the current aggregation > setup, just by calling the final_func for each row rather than just > once at the end. How do you know which type of two above is used in the same SQL syntax? -- Hitoshi Harada
On Sun, 2008-07-06 at 03:40 +0900, H.Harada wrote: > Hi, > > 2008/7/6 Simon Riggs <simon@2ndquadrant.com>: > > > > On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote: > >> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: > > > >> > http://umitanuki.net/pgsql/wfv01/design.html > >> > > >> > The problem is, as written in the "Things to discussed" section of the > >> > document, how you define window functions (e.g. RANK()). My idea is to > >> > treat them as specialized functions such as SET OF functions and mark > >> > it in pg_proc. But this doesn't resolve RANK() boundary problem. > >> > >> Actually, I would make RANK() and ROW_NUMBER() act more like > >> aggregates. ISTM you have two kinds of window functions: > >> > >> - aggregation: a result is calculated over a set and the result copied > >> across all the rows. > >> - order depenadant: same as above, but the result is different for each > >> row. > >> > >> I think you could make the latter work using the current aggregation > >> setup, just by calling the final_func for each row rather than just > >> once at the end. > > > > AFAICS there's no overlap between windowed aggregates and normal > > aggregates, so we can different infrastructure for each. I like the > > suggestion of doing it very similarly to current aggregates, but I would > > introduce a new function hook for windowed aggregates, wfunc. > > I think there are two types of functions for windowed mode. > - windowed aggregate > this type of function is exactly same as normal aggregate. So we use > functions that have been in pgsql already. Actually in my patch above, > I didn't introduce any new function. This type of function includes > simply sum(), avg(), etc. which returns same values on a partition or > a window frame. > > - windowed function > this is the NEW type of function. I guess we should add a new function > type to pgsql. This type of function includes rank(), rank_dense(), > row_number(), etc. Windowed functions returns different values per > tuple. > > The difference between two types is if the function returns the same > value during a partition or different values. > > So, windowed aggregate and normal aggregate overlap each other. How > you know which one is that you see OVER clause in SQL just after the > function call. When you see OVER after func(), and pg_proc says it's > an aggregate, it's a windowed aggregate. Otherwise, it's a windowed > function. > > If I misunderstood about those definitions please correct me. Yes, I understand that and I think Martijn does also. I've done some thinking and rooting around on this and I think I have a different proposal for you, different to what we just discussed. SQL2008 specifies window functions as * rank functions * distribution functions: percent_rank() and cume_dist() * rownumber() * ntile() * lead() and lag() * first, last and n-th value functions * inverse distribution functions (similar to n-th value, based upon distribution function results) plus window aggregate functions (the normal aggregates COUNT, SUM etc) Now looking through all of those, I don't see *any* window functions that need access to different datatypes, or actually need to see the values of the attributes. The normal aggregates work with windows identically to the way they do without windows, so no change needed there. AFAICS we could define all of the non-aggregate window functions on the above list *without* defining them as functions in pg_proc. That would be a benefit because the window functions are very powerful and we'd need to give them access to any/all tuples in the window. So that would mean we don't provide a mechanism for user-defined windowed aggregate functions at all. Which solves the discussion about how to pass generic info through to them (at least long enough to get the first implementation done). We do already have such functions in code, e.g. greatest(). Sure they need to be defined in code, but we don't need to come up with a generic API for them. If you disagree, think about how we'd implement lag() or ntile() and what info we'd need to pass them. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
2008/7/6 Simon Riggs <simon@2ndquadrant.com>: >> I think there are two types of functions for windowed mode. >> - windowed aggregate >> this type of function is exactly same as normal aggregate. So we use >> functions that have been in pgsql already. Actually in my patch above, >> I didn't introduce any new function. This type of function includes >> simply sum(), avg(), etc. which returns same values on a partition or >> a window frame. >> >> - windowed function >> this is the NEW type of function. I guess we should add a new function >> type to pgsql. This type of function includes rank(), rank_dense(), >> row_number(), etc. Windowed functions returns different values per >> tuple. >> >> The difference between two types is if the function returns the same >> value during a partition or different values. >> >> So, windowed aggregate and normal aggregate overlap each other. How >> you know which one is that you see OVER clause in SQL just after the >> function call. When you see OVER after func(), and pg_proc says it's >> an aggregate, it's a windowed aggregate. Otherwise, it's a windowed >> function. >> >> If I misunderstood about those definitions please correct me. > > Yes, I understand that and I think Martijn does also. > > I've done some thinking and rooting around on this and I think I have a > different proposal for you, different to what we just discussed. > > SQL2008 specifies window functions as > > * rank functions > * distribution functions: percent_rank() and cume_dist() > * rownumber() > * ntile() > * lead() and lag() > * first, last and n-th value functions > * inverse distribution functions (similar to n-th value, based upon > distribution function results) > > plus window aggregate functions (the normal aggregates COUNT, SUM etc) > > Now looking through all of those, I don't see *any* window functions > that need access to different datatypes, or actually need to see the > values of the attributes. > > The normal aggregates work with windows identically to the way they do > without windows, so no change needed there. > > AFAICS we could define all of the non-aggregate window functions on the > above list *without* defining them as functions in pg_proc. That would > be a benefit because the window functions are very powerful and we'd > need to give them access to any/all tuples in the window. > > So that would mean we don't provide a mechanism for user-defined > windowed aggregate functions at all. Which solves the discussion about > how to pass generic info through to them (at least long enough to get > the first implementation done). > > We do already have such functions in code, e.g. greatest(). Sure they > need to be defined in code, but we don't need to come up with a generic > API for them. > > If you disagree, think about how we'd implement lag() or ntile() and > what info we'd need to pass them. Well, your idea is one of considerable choices. But I like pgsql's extensibility that enables pgsql more powerful DBMS. So, I design it as you propsed though trying to unify the function form somehow. Just idea, how about pass window object to a function? We'll provide window operation API then in the function you take window object through fcinfo: Datum func(PG_FUNCTION_ARGS) { Datum v; WindowObject w = get_window(fcinfo); HeapTuple htup_current = window_current_row(w); HeapTuple htup_prev = window_preceding(w,1); /* do something */ PG_RETURN_DATUM(v); } so that a function access whole the window. APIs include - current row - preceding row - following row - current key - preceding key - following key - iterate for the window where "key" means ORDER BY values in OVER clause. Fortunately, my patch uses tuplestore/tuplesort to create window, which allows random access operation such above. Is there security/performance issue about this? -- Hitoshi Harada
On Sun, 2008-07-06 at 17:39 +0900, H.Harada wrote: > Is there security/performance issue about this? Performance, yes. If we need access to more rows than will fit within work_mem we have a problem and will need to restart sort. Giving random access to all tuples in the current window, whatever its size would be very costly, which is why we have optimized that access for merge joins. So we need to know how far back access is required, if any - think of that as an "access window" definition. For example, rownumber() doesn't need access to prior tuples at all. lag(col, 1) requires access only to the prior row of the current window ntile() needs to know the size of the window before we begin processing In some cases the window itself is redefined for each tuple, e.g. avg() over (order by ... range between 5 preceeding and current row) In that case, we want the tuples no longer in the window to scroll out of memory. We already have the mechanism for this: a dynamic tuplestore (materialize node) in front of the tuplesort (sort node). Most of that tuning can be done after the initial implementation, but my point here is this: there needs to be a mechanism by which the window access requirements can be specified for a function so the executor can understand how to optimise access. So if you go the route of defining an extensible API then you must include this also. I know I rattle on about performance, but with window functions it will be critical to their usability to have them perform well. We can already do the types of analysis that window functions allow, it just requires hand written procedures to do it. So the window functions must perform acceptably well against very large tables (i.e. much bigger than memory). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
2008/7/6 Simon Riggs <simon@2ndquadrant.com>: > > On Sun, 2008-07-06 at 17:39 +0900, H.Harada wrote: > >> Is there security/performance issue about this? > > Performance, yes. > > If we need access to more rows than will fit within work_mem we have a > problem and will need to restart sort. Giving random access to all > tuples in the current window, whatever its size would be very costly, > which is why we have optimized that access for merge joins. So we need > to know how far back access is required, if any - think of that as an > "access window" definition. Is this about tuplesort, not tuplestore? At a glance, tuplestore seems to be able to access tuples randomly without any performance problem, just fitting its pointer. So I thought planner should always insert Sort node before Window node to let Window not to sort, as I explained in the document. But anyways, I understand some kind of optimization mechanism to scroll in/out window is needed. > I know I rattle on about performance, but with window functions it will > be critical to their usability to have them perform well. We can already > do the types of analysis that window functions allow, it just requires > hand written procedures to do it. So the window functions must perform > acceptably well against very large tables (i.e. much bigger than > memory). I know, the probable use case is against large data set. There's no reason to add this feature if it is slower than self joins or other kludge methods. -- Hitoshi Harada
On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: > Hi, > > As I proposed a month before, I am working on window function. > Although this work is at quite early step, I would like to introduce > it since a part of it have been finished. If you can afford and are > interested in it, please review the document and patch, or compile the > applied source to execute an attached sample SQL. > > http://umitanuki.net/pgsql/wfv01/design.html > > Currently, only aggregation over window does work. I am planning to > work for the combination of window and normal aggregation from now on, > which I guess I can manage to do. > > The problem is, as written in the "Things to discussed" section of the > document, how you define window functions (e.g. RANK()). My idea is to > treat them as specialized functions such as SET OF functions and mark > it in pg_proc. But this doesn't resolve RANK() boundary problem. > > I am so happy with any kind of comments, reviews or critiques. > > Regards, Many people have been waiting for years for this functionality. Thanks so much for doing this. When will you have another patch that applies against CVS HEAD? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
2008/7/17 David Fetter <david@fetter.org>: > On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote: >> Hi, >> >> As I proposed a month before, I am working on window function. >> Although this work is at quite early step, I would like to introduce >> it since a part of it have been finished. If you can afford and are >> interested in it, please review the document and patch, or compile the >> applied source to execute an attached sample SQL. >> >> http://umitanuki.net/pgsql/wfv01/design.html >> >> Currently, only aggregation over window does work. I am planning to >> work for the combination of window and normal aggregation from now on, >> which I guess I can manage to do. >> >> The problem is, as written in the "Things to discussed" section of the >> document, how you define window functions (e.g. RANK()). My idea is to >> treat them as specialized functions such as SET OF functions and mark >> it in pg_proc. But this doesn't resolve RANK() boundary problem. >> >> I am so happy with any kind of comments, reviews or critiques. >> >> Regards, > > Many people have been waiting for years for this functionality. > Thanks so much for doing this. > > When will you have another patch that applies against CVS HEAD? > > Cheers, > David. > -- > David Fetter <david@fetter.org> http://fetter.org/ > Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter > Skype: davidfetter XMPP: david.fetter@gmail.com > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate > As you see the patch is not completed yet, lacking RANK() etc. I am now struggling for this design which I assume will finish by the end of this month. Since then, I will apply it against HEAD. So another patch to -hackers will be around first half of August, or by the next Commit Fest at worst. There must be more discussion about the RANK() design after the post. But I hope it will be in 8.4. Regards, -- Hitoshi Harada