Re: Window functions patch v04 for the September commit fest - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Window functions patch v04 for the September commit fest |
Date | |
Msg-id | 48BD36BD.2070209@enterprisedb.com Whole thread Raw |
In response to | Re: Window functions patch v04 for the September commit fest ("Hitoshi Harada" <umi.tanuki@gmail.com>) |
Responses |
Re: Window functions patch v04 for the September commit fest
Re: Window functions patch v04 for the September commit fest |
List | pgsql-hackers |
Hitoshi Harada wrote: > 2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > In my understanding, the "Window Frame" is defined > by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so, > contrast to "Window Partition" defined by "PARTITION BY" clause. A > frame slides within a partition or there's only one frame if those > clauses are not specified. The current patch has not implemented this > yet. I'll update the docs. Yes, that's how I read it as well. Another way to think of it is that there's always a window frame, but if no ROWS BETWEEN or similar clause is given, the window frame spans the whole partition (or from the 1st row of the partition, up to the current row, if ORDER BY is given). > I don't like to call the second type "ranking aggregates" because it > may refer to only ranking functions though there are more types of > function like ntile(), lead() and lag(). But "window functions" > doesn't seem appropriate either since it's ambiguous with the general > name of window expressions. Yep, confusing :-(. The SQL standard says that "a window function is one of: a rank function, a distribution function, a row number function, a window aggregate function, the ntile function, the lead function, the lag function, the first-value function, the last-value function, the nth-value function". So, window aggregate functions are a special class of window functions, and there's no term to refer to all the rest of the window functions excluding window aggregate functions. Your doc SQL spec Window expression Window function Window function Any window function other than a window aggregate function Window aggregate Window aggregate function I tried to coin the term "ranking aggregate" for the SQL2008 term "Any window function other than a window aggregate function", but you're right that that's still confusing, because the SQL2008 term "rank function" includes only RANK() and DENSE_RANK(). The spec calls them "group aggregate functions", when they're used with GROUP BY, rather than as a window function. I think we could use that term. > Your proposal is smarter than the current implementation. But it > doesn't seem complete in some way. From logical point of view, the > window functions should be allowed to access whole of rows in the > frame the current row belongs to (c.f. inverse distribution > functions). By the whole of rows, do you mean a) the chosen value or expression of all rows, or b) all columns of the input rows? Different window functions have different needs. RANK() for example does need to see all columns, to compare them, but it only needs access to the previous and current row. CUME_DIST on the other hand needs access to all columns of all rows, and LAG needs access to a specific column of a fixed number of rows. And row_number needs nothing. The needs of access to the rows are so different that it seems best to me to delegate the buffering to the window function. Actually, looking closer to the ranking functions, they don't really need access to all columns. They just need to be able to compare them, according to the window ordering, so we should probably only provide access to the arguments of the aggregate, evaluated for any row in the frame, and a comparator function, that can determine if any given rows in the frame are <, = or >. > From implementing and performance point of view, there > need to consider such case like mixing window aggregates and ranking > aggregates in the same window, which means it is better that the two > types of functions are processed in the similar flow. Also logically, > SQL spec doesn't forbid ranking aggregates in sliding window frames. It doesn't? Oh, bummer. It seems we need a grand unified theory of ranking and other aggregates then :-(. How does something like RANK work with a window frame? What does it return, if the current row is excluded from the window frame, e.g with EXCLUDE CURRENT ROW? Let's look at the trivial, generic, and slow implementation first, and then figure out what tricks we can do to make it faster. I gather that that generic algorithm, following how the SQL spec defines the window frame, looks like this: 0. Construct the ordered set of tuples, containing all the tuples in the partition. For each row, start with the ordered set constructed in step 0, and: 1. Remove all tuples from the set that are before the start of the sliding frame 2. Remove all tuples from the set that are after the end of the sliding frame 3. Remove all tuples specified by the window-frame-exclusion clause (EXCLUDE CURRENT ROW/GROUP/TIES/NO OTHERS). 4. Run the aggregate over all tuples still in the set. Now, the first obvious optimization is that we don't want to construct the window frame from scratch for each row. Instead: 0. Start with an empty ordered set S. For each row: 1. Read forward until we hit the end of the new sliding frame, and add those rows to S. 2. Remove all tuples from S that are before the start of the sliding frame. 3. Construct a new set T, which is the same as S, but all tuples specified by the window-frame-exclusion clause (EXCLUDE CURRENT ROW/GROUP/TIES/NO OTHERS) are removed. 5. Run the aggregate over all tuples in T. This is applicable to all ranking and window aggregates. Per spec, if no window-frame-clause is given, the start of the frame is always the 1st row in the partition. If an ORDER BY is given, the end of the frame is the the current row. Otherwise it's the end of partition, which means that we can just run the aggregate once and return the same value for all rows in the partition. Now, let's consider how to speed up step 5. There's the difference between ranking and window aggregates. Window aggregates don't care about the position of the current row within the window frame, so the interface I proposed earlier, with the subfunc, would work fine. But for ranking aggregates, the position of the current row matters. Another must-have optimization is that we must be able to eagerly discard rows that we know won't be accessed anymore. And likewise, we mustn't read ahead many rows that we don't need to calculate the aggregate for the current row. I think I'm coming around to your suggestion of passing a Window object to the aggregate function, allowing random access to the current frame. As Simon pointed out, there needs to be a way for the aggregate function to signal when it doesn't need some rows in the frame anymore, for performance. The sliding window frame always moves forward, never backwards. Therefore it's sufficient if the user-defined function can indicate a cutoff point, allowing the executor node to discard any preceding rows. However, the window-frame-exclusion clause complicates that, because the current row and its peers might or might not be in the frame, which means that rows can disappear and reappear into the middle of the window frame. We don't need to implement window-frame-exclusion in this release, but we should consider how that affects the API. The API could be like this (in pseudo-syntax): CREATE AGGREGATE name ( <type> evalfunc(Window object) enterfunc(Window object, int pos) exitfunc(Window object,int pos) ) where evalfunc, enterfunc and exitfunc are functions provided by the aggregate function author. Window object { /* Returns the position of current row within the frame. */ int get_current_pos(); /* Returns the numberof rows in the frame. */ int get_max_pos(); /* Compares two tuple in the window frame, according to window ordering */ int compare(int posa, int posb) /* Returns the value of an aggregate argument for tuple at given position */ Datum get_arg_value(int pos, int argno); /* Promise that we won't call get_arg_value or compare for any pos< cutoff_pos anymore */ void signal_cutoff(int cutoff_pos); } The executor calls enterfunc for each row that enters the window frame, and exitfunc for each row that exits the frame (before removing the tuple from the frame, so that it's still accessible to get_arg_value() and compare()). After calling enter/exitfuncs for all rows that enter/exit the frame, evalfunc is called once, to return the value for the current row. I believe this allows for an efficient implementation of all the window functions we've discussed. It works fine for implementing regular aggregates as well, so there no need to treat ranking and other aggregates any differently. We can completely replace the current method of defining aggregates with this if we want to, as long as we provide some sort of glue for backwards-compatibility with aggregates in external projects. Here's sketches of some aggregate implementations using this API: RANK() ------ enterfunc: if position of the new row is > current row, do nothing. Otherwise, decrement current rank. exitfunc: if position of the new row is > current row, do nothing. Otherwise, increment current rank. LEAD(offset) ------------ evalfunc: Call get_row(get_current_row() + offset). Call signal_cutoff(get_current_row()); LAG(offset) ----------- evalfunc: Call get_row(get_current_row() - offset). Call signal_cutoff(get_current_row() - offset); MIN/MAX ------- enterfunc: if new value is </> the current min/max value, replace it with the new value exitfunc: if removed value = the current min/max value, rescan the current frame for new min/max value, iterating through all values (1..get_max_row()). COUNT ----- enterfunc: increment counter exitfunc: decrement counter CUME_DIST --------- CUME_DIST is defined as NP/NR, where NP is the number of rows preceding or peer with the current row, and NR is the total number of rows in the frame. NR = get_max_pos(). Maintain NP-counter in enterfunc/exitfunc, incrementing it whenever a row preceding current row enters or exits the frame. In evalfunc, increment the counter to reflect the new current row position. > I feel horrible in imagining to manage the combination of functions that > don't support subfunc, ones that does and ranking aggregates but it's > possible. Well, if you handle each window/ranking aggregate in a separate Window node, like you did, it should be much easier to manage. If it gets overwhelmingly complex, you could also split it into two slightly different Window nodes, one for the generic case, for aggregates that don't have subfunc, and one for ones that do. (the point is moot, if we go with the API I proposed above, using a Window object) All in all, this is such a monstrous piece of functionality, that if we try to implement everything in one patch, it'll never get finished. So we clearly have to implement this in phases, and just make sure we don't paint ourselves in the corner with the design. I'd suggest: 1. Implement Window node, with the capability to invoke an aggregate function, using the above API. Implement required parser/planner changes. Implement a few simple ranking aggregates using the API. 2. Implement glue to call normal aggregates through the new API 3. Implement the optimization to drop rows that are no longer needed (signal_cutoff can be a no-op until this phase) 4. Implement window framing (the frame can always be all rows in the partition, or all rows until the current row, until this phase) 5. Expose the new API to user-defined aggregates. It can be an internal API only used by built-in functions until this phase I believe you already have phase 1 in your patch, except for the API changes. >> PS. Have you looked at the "hypothetical set functions" in SQL2003? > > I had a glance but not deeply, since I found I would be lost in design > if deeply diving into it. :) Yeah ;-). A quick glance suggests that it won't affect aggregate the API we come up with, but will require changes to parser/executor. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: