Thread: Combining Aggregates
KaiGai, David Rowley and myself have all made mention of various ways we could optimize aggregates. Following WIP patch adds an extra function called a "combining function", that is intended to allow the user to specify a semantically correct way of breaking down an aggregate into multiple steps. Gents, is this what you were thinking? If not... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Dec 17, 2014 at 3:23 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
KaiGai, David Rowley and myself have all made mention of various ways
we could optimize aggregates.
Following WIP patch adds an extra function called a "combining
function", that is intended to allow the user to specify a
semantically correct way of breaking down an aggregate into multiple
steps.
Gents, is this what you were thinking? If not...
A quick look at the patch makes me assume that the patch does not handle the problem of combining transvals or move at all in that direction (which is fine, just reconfirming).
So, essentially, we are adding a "grand total" on top of individual sum() or count() operations,right?
Also, should we not have a sanity check for the user function provided?
On 17/12/14 11:02, Atri Sharma wrote: > > > On Wed, Dec 17, 2014 at 3:23 PM, Simon Riggs <simon@2ndquadrant.com > <mailto:simon@2ndquadrant.com>> wrote: > > KaiGai, David Rowley and myself have all made mention of various ways > we could optimize aggregates. > > Following WIP patch adds an extra function called a "combining > function", that is intended to allow the user to specify a > semantically correct way of breaking down an aggregate into multiple > steps. > > > Also, should we not have a sanity check for the user function provided? Looking at the code, yes, there seems to be XXX comment about that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 17 December 2014 at 10:02, Atri Sharma <atri.jiit@gmail.com> wrote: > > > On Wed, Dec 17, 2014 at 3:23 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> KaiGai, David Rowley and myself have all made mention of various ways >> we could optimize aggregates. >> >> Following WIP patch adds an extra function called a "combining >> function", that is intended to allow the user to specify a >> semantically correct way of breaking down an aggregate into multiple >> steps. >> >> Gents, is this what you were thinking? If not... >> >> > > A quick look at the patch makes me assume that the patch does not handle the > problem of combining transvals or move at all in that direction (which is > fine, just reconfirming). There are no applications of the new function at present. Each would be similar, but unsure as to whether they would be identical. > So, essentially, we are adding a "grand total" on top of individual sum() or > count() operations,right? The idea is to be able to do aggregation in multiple parts. For example, allowing parallel query to have a plan like this Aggregate ->Gather (subPlan is repeated N times on each parallel worker) ->Aggregate -->ParallelSeqScan (scans a distinct portionof a table) or to allow a serial plan where an aggregate was pushed down below a join, like this CURRENT PLAN Aggregate -> Join -> Scan BaseTable1 -> Scan BaseTable2 PRE-AGGREGATED PLAN Aggregate -> Join -> PreAggregate (doesn't call finalfn) -> Scan BaseTable1 -> Scan BaseTable2 and also allow the above plan to be replaced by a Matview plan like this Aggregate -> Join -> Scan BaseTable1.matviewA -> Scan BaseTable2 where we would assume that the contents of matviewA are pre-aggregations that could be further combined. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 December 2014 at 22:53, Simon Riggs <simon@2ndquadrant.com> wrote:
KaiGai, David Rowley and myself have all made mention of various ways
we could optimize aggregates.
Following WIP patch adds an extra function called a "combining
function", that is intended to allow the user to specify a
semantically correct way of breaking down an aggregate into multiple
steps.
Gents, is this what you were thinking? If not...
Very much so! You must have missed my patch.
The cases I think that may benefit from such infrastructure would be:
1. Parallel queries, where each worker could work on a portion of the tuples being aggregated and then the combine/merge function is called in the end in order to produce the final aggregated result. We currently don't yet have parallel query, so we can't really commit anything yet, for that reason.
2. Queries such as:
SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id = s.product_id GROUP BY p.name;
Such a query could be transformed into:
SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales GROUP BY product_id) s
INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
Of course the outer query's SUM and GROUP BY would not be required if there happened to be a UNIQUE index on product(name), but assuming there's not then the above should produce the results faster. This of course works ok for SUM(), but for something like AVG() or STDDEV() the combine/merge aggregate functions would be required to process those intermediate aggregate results that were produced by the sub-query.
Regards
David Rowley
On 17 December 2014 at 10:20, David Rowley <dgrowleyml@gmail.com> wrote: > On 17 December 2014 at 22:53, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> KaiGai, David Rowley and myself have all made mention of various ways >> we could optimize aggregates. >> >> Following WIP patch adds an extra function called a "combining >> function", that is intended to allow the user to specify a >> semantically correct way of breaking down an aggregate into multiple >> steps. >> >> Gents, is this what you were thinking? If not... >> > > Very much so! You must have missed my patch. > > http://www.postgresql.org/message-id/CAApHDvrZG5Q9rNxU4WOga8AgvAwQ83bF83CFvMbOQcCg8vk=Zw@mail.gmail.com Very strange that you should post an otherwise unrelated patch on someone else's thread AND not add the patch to the CommitFest. Stealth patch submission is a new one on me. Looks like great minds think alike, at least. Or fools seldom differ. Please add your patch to the CF app. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon, Its concept is good to me. I think, the new combined function should be responsible to take a state data type as argument and update state object of the aggregate function. In other words, combined function performs like transition function but can update state object according to the summary of multiple rows. Right? It also needs some enhancement around Aggref/AggrefExprState structure to inform which function shall be called on execution time. Combined functions are usually no-thank-you. AggrefExprState updates its internal state using transition function row-by-row. However, once someone push-down aggregate function across table joins, combined functions have to be called instead of transition functions. I'd like to suggest Aggref has a new flag to introduce this aggregate expects state object instead of scalar value. Also, I'd like to suggest one other flag in Aggref not to generate final result, and returns state object instead. Let me use David's example but little bit adjusted. original) SELECT p.name, AVG(s.qty) FROM sales s INNER JOIN product p ON s.product_id = s.product_id GROUP BY p.name; modified) SELECT p.name, AVG(qty) FROM (SELECT product_id, AVG(qty) AS qty FROM sales GROUP BY product_id) s INNER JOIN productp ON p.product_id = s.product_id GROUP BY p_name; Let's assume the modifier set a flag of use_combine_func on the AVG(qty) of the main query, and also set a flag of not_generate_final on the AVG(qty) of the sub-query. It shall work as we expected. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: Simon Riggs [mailto:simon@2ndQuadrant.com] > Sent: Wednesday, December 17, 2014 6:53 PM > To: Kaigai Kouhei(海外 浩平); David Rowley; PostgreSQL-development; Amit > Kapila > Subject: Combining Aggregates > > KaiGai, David Rowley and myself have all made mention of various ways we > could optimize aggregates. > > Following WIP patch adds an extra function called a "combining function", > that is intended to allow the user to specify a semantically correct way > of breaking down an aggregate into multiple steps. > > Gents, is this what you were thinking? If not... > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 17, 2014 at 6:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
Regards,
Atri
Simon,
Its concept is good to me. I think, the new combined function should be
responsible to take a state data type as argument and update state object
of the aggregate function. In other words, combined function performs like
transition function but can update state object according to the summary
of multiple rows. Right?
It also needs some enhancement around Aggref/AggrefExprState structure to
inform which function shall be called on execution time.
Combined functions are usually no-thank-you. AggrefExprState updates its
internal state using transition function row-by-row. However, once someone
push-down aggregate function across table joins, combined functions have
to be called instead of transition functions.
I'd like to suggest Aggref has a new flag to introduce this aggregate expects
state object instead of scalar value.
Also, I'd like to suggest one other flag in Aggref not to generate final
result, and returns state object instead.
So are you proposing not calling transfuncs at all and just use combined functions?
That sounds counterintuitive to me. I am not able to see why you would want to avoid transfns totally even for the case of pushing down aggregates that you mentioned.
From Simon's example mentioned upthread:
PRE-AGGREGATED PLAN
Aggregate
-> Join
-> PreAggregate (doesn't call finalfn)
-> Scan BaseTable1
-> Scan BaseTable2
Aggregate
-> Join
-> PreAggregate (doesn't call finalfn)
-> Scan BaseTable1
-> Scan BaseTable2
finalfn wouldnt be called. Instead, combined function would be responsible for getting preaggregate results and combining them (unless of course, I am missing something).
Special casing transition state updating in Aggref seems like a bad idea to me. I would think that it would be better if we made it more explicit i.e. add a new node on top that does the combination (it would be primarily responsible for calling combined function).
Not a good source of inspiration, but seeing how SQL Server does it (Exchange operator + Stream Aggregate) seems intuitive to me, and having combination operation as a separate top node might be a cleaner way.
I may be wrong though.
Regards,
Atri
On 17 December 2014 at 12:35, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > Its concept is good to me. I think, the new combined function should be > responsible to take a state data type as argument and update state object > of the aggregate function. In other words, combined function performs like > transition function but can update state object according to the summary > of multiple rows. Right? That wasn't how I defined it, but I think your definition is better. > It also needs some enhancement around Aggref/AggrefExprState structure to > inform which function shall be called on execution time. > Combined functions are usually no-thank-you. AggrefExprState updates its > internal state using transition function row-by-row. However, once someone > push-down aggregate function across table joins, combined functions have > to be called instead of transition functions. > I'd like to suggest Aggref has a new flag to introduce this aggregate expects > state object instead of scalar value. > > Also, I'd like to suggest one other flag in Aggref not to generate final > result, and returns state object instead. > > Let me use David's example but little bit adjusted. > > original) > SELECT p.name, AVG(s.qty) > FROM sales s INNER JOIN product p ON s.product_id = s.product_id > GROUP BY p.name; > > modified) > SELECT p.name, AVG(qty) > FROM (SELECT product_id, AVG(qty) AS qty FROM sales GROUP BY product_id) s > INNER JOIN product p > ON p.product_id = s.product_id GROUP BY p_name; > > Let's assume the modifier set a flag of use_combine_func on the AVG(qty) of > the main query, and also set a flag of not_generate_final on the AVG(qty) of > the sub-query. > It shall work as we expected. That matches my thinking exactly. David, if you can update your patch with some docs to explain the behaviour, it looks complete enough to think about committing it in early January, to allow other patches that depend upon it to stand a chance of getting into 9.5. (It is not yet ready, but I see it could be). The above example is probably the best description of the need, since user defined aggregates must also understand this. Could we please call these "combine functions" or other? MERGE is an SQL Standard statement type that we will add later, so it will be confusing if we use the "merge" word in this context. David, your patch avoids creating any mergefuncs for existing aggregates. We would need to supply working examples for at least a few of the builtin aggregates, so we can test the infrastructure. We can add examples for all cases later. Is there a way of testing this in existing code? Or do we need to add something to test it? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 17, 2014 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 December 2014 at 12:35, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Its concept is good to me. I think, the new combined function should be
> responsible to take a state data type as argument and update state object
> of the aggregate function. In other words, combined function performs like
> transition function but can update state object according to the summary
> of multiple rows. Right?
That wasn't how I defined it, but I think your definition is better.
> It also needs some enhancement around Aggref/AggrefExprState structure to
> inform which function shall be called on execution time.
> Combined functions are usually no-thank-you. AggrefExprState updates its
> internal state using transition function row-by-row. However, once someone
> push-down aggregate function across table joins, combined functions have
> to be called instead of transition functions.
> I'd like to suggest Aggref has a new flag to introduce this aggregate expects
> state object instead of scalar value.
>
> Also, I'd like to suggest one other flag in Aggref not to generate final
> result, and returns state object instead.
>
> Let me use David's example but little bit adjusted.
>
> original)
> SELECT p.name, AVG(s.qty)
> FROM sales s INNER JOIN product p ON s.product_id = s.product_id
> GROUP BY p.name;
>
> modified)
> SELECT p.name, AVG(qty)
> FROM (SELECT product_id, AVG(qty) AS qty FROM sales GROUP BY product_id) s
> INNER JOIN product p
> ON p.product_id = s.product_id GROUP BY p_name;
>
> Let's assume the modifier set a flag of use_combine_func on the AVG(qty) of
> the main query, and also set a flag of not_generate_final on the AVG(qty) of
> the sub-query.
> It shall work as we expected.
That matches my thinking exactly.
David, if you can update your patch with some docs to explain the
behaviour, it looks complete enough to think about committing it in
early January, to allow other patches that depend upon it to stand a
chance of getting into 9.5. (It is not yet ready, but I see it could
be).
The above example is probably the best description of the need, since
user defined aggregates must also understand this.
Could we please call these "combine functions" or other? MERGE is an
SQL Standard statement type that we will add later, so it will be
confusing if we use the "merge" word in this context.
David, your patch avoids creating any mergefuncs for existing
aggregates. We would need to supply working examples for at least a
few of the builtin aggregates, so we can test the infrastructure. We
can add examples for all cases later.
I am still missing how and why we skip transfns. What am I missing,please?
On 18 December 2014 at 01:31, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 December 2014 at 10:20, David Rowley <dgrowleyml@gmail.com> wrote:
> On 17 December 2014 at 22:53, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> KaiGai, David Rowley and myself have all made mention of various ways
>> we could optimize aggregates.
>>
>> Following WIP patch adds an extra function called a "combining
>> function", that is intended to allow the user to specify a
>> semantically correct way of breaking down an aggregate into multiple
>> steps.
>>
>> Gents, is this what you were thinking? If not...
>>
>
> Very much so! You must have missed my patch.
>
> http://www.postgresql.org/message-id/CAApHDvrZG5Q9rNxU4WOga8AgvAwQ83bF83CFvMbOQcCg8vk=Zw@mail.gmail.com
Very strange that you should post an otherwise unrelated patch on
someone else's thread AND not add the patch to the CommitFest.
Stealth patch submission is a new one on me.
Apologies about that, It was a bad decision.
I had thought that it's a bit of a chicken and the egg problem... This is the egg, we just need a chicken to come and lay it.
I had imagined that it would be weird to commit something that's dead in code and not all that testable until someone adds some other code to utilise it.
Regards
David Rowley
On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote:
On 17 December 2014 at 12:35, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Its concept is good to me. I think, the new combined function should be
> responsible to take a state data type as argument and update state object
> of the aggregate function. In other words, combined function performs like
> transition function but can update state object according to the summary
> of multiple rows. Right?
That wasn't how I defined it, but I think your definition is better.
> It also needs some enhancement around Aggref/AggrefExprState structure to
> inform which function shall be called on execution time.
> Combined functions are usually no-thank-you. AggrefExprState updates its
> internal state using transition function row-by-row. However, once someone
> push-down aggregate function across table joins, combined functions have
> to be called instead of transition functions.
> I'd like to suggest Aggref has a new flag to introduce this aggregate expects
> state object instead of scalar value.
>
> Also, I'd like to suggest one other flag in Aggref not to generate final
> result, and returns state object instead.
>
> Let me use David's example but little bit adjusted.
>
> original)
> SELECT p.name, AVG(s.qty)
> FROM sales s INNER JOIN product p ON s.product_id = s.product_id
> GROUP BY p.name;
>
> modified)
> SELECT p.name, AVG(qty)
> FROM (SELECT product_id, AVG(qty) AS qty FROM sales GROUP BY product_id) s
> INNER JOIN product p
> ON p.product_id = s.product_id GROUP BY p_name;
>
> Let's assume the modifier set a flag of use_combine_func on the AVG(qty) of
> the main query, and also set a flag of not_generate_final on the AVG(qty) of
> the sub-query.
> It shall work as we expected.
That matches my thinking exactly.
David, if you can update your patch with some docs to explain the
behaviour, it looks complete enough to think about committing it in
early January, to allow other patches that depend upon it to stand a
chance of getting into 9.5. (It is not yet ready, but I see it could
be).
Yes I'll add something to it and post here.
The above example is probably the best description of the need, since
user defined aggregates must also understand this.
Could we please call these "combine functions" or other? MERGE is an
SQL Standard statement type that we will add later, so it will be
confusing if we use the "merge" word in this context.
Yeah I think you're right, combine may help remove some confusion when we get MERGE.
David, your patch avoids creating any mergefuncs for existing
aggregates. We would need to supply working examples for at least a
few of the builtin aggregates, so we can test the infrastructure. We
can add examples for all cases later.
I added merge/combine functions for all the aggregates I could do by making use of existing functions. I did all the MAX() and MIN() ones and bit_and(), bit_or(), and a few sum() ones that didn't have a final function.
It felt a bit premature to add new functions to support avg and stddev, since that's probably the bulk of the work.
Is there a way of testing this in existing code? Or do we need to add
something to test it?
I can't think of anyway to test it. Apart from the create aggregate syntax test I also added.
Standalone calls to the combine/merge functions I don't think would be testing anything new.
That's the reason I thought it wasn't really acceptable until we have a use for this. That's why I posted on the thread about parallel seqscan. I hoped that Amit could add something which needed it and I could get it committed that way.
Regards
David Rowley
On 17 December 2014 at 19:06, David Rowley <dgrowleyml@gmail.com> wrote: > Standalone calls to the combine/merge functions I don't think would be > testing anything new. Guess its a simple enough API, doesn't really need a specific test. > That's the reason I thought it wasn't really acceptable until we have a use > for this. That's why I posted on the thread about parallel seqscan. I hoped > that Amit could add something which needed it and I could get it committed > that way. My view is that its infrastructure to cover a variety of possibilities. It's best if if it goes in with enough time to get some user cases soon. If not, its there at the start of the cycle for next release. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Atri, > So are you proposing not calling transfuncs at all and just use combined > functions? > No. It is discretion of software component that distribute an aggregate into multiple partial-aggregates. > That sounds counterintuitive to me. I am not able to see why you would want > to avoid transfns totally even for the case of pushing down aggregates that > you mentioned. > > From Simon's example mentioned upthread: > > PRE-AGGREGATED PLAN > Aggregate > -> Join > -> PreAggregate (doesn't call finalfn) > -> Scan BaseTable1 > -> Scan BaseTable2 > > > > finalfn wouldnt be called. Instead, combined function would be responsible > for getting preaggregate results and combining them (unless of course, I > am missing something). > In this case, aggregate distributor is responsible to mark Aggref correctly. Aggref in Aggregate-node will call combined-function, then final-function to generate expected result, but no transition-function shall be called because we expect state date as its input stream. On the other hands, Aggref in PreAggregate-node will call transition-function to accumulate its state-data, but does not call final-function because it is expected to return state-data as is. > Special casing transition state updating in Aggref seems like a bad idea > to me. I would think that it would be better if we made it more explicit > i.e. add a new node on top that does the combination (it would be primarily > responsible for calling combined function). > I'm neutral towards above idea. Here is no difference in amount of information, isn't it? If we define explicit node types, instead of special flags, it seems to me the following new nodes are needed.- Aggref that takes individual rows and populate a state data (trans + no-final)- Aggrefthat takes state data and populate a state data (combined + no-final)- Aggref that takes state data and populate afinal result (combined + final) Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote:
David, if you can update your patch with some docs to explain the
behaviour, it looks complete enough to think about committing it in
early January, to allow other patches that depend upon it to stand a
chance of getting into 9.5. (It is not yet ready, but I see it could
be).
Sure, I've more or less added the docs from your patch. I still need to trawl through and see if there's anywhere else that needs some additions.
The above example is probably the best description of the need, since
user defined aggregates must also understand this.
Could we please call these "combine functions" or other? MERGE is an
SQL Standard statement type that we will add later, so it will be
confusing if we use the "merge" word in this context.
Ok, changed.
David, your patch avoids creating any mergefuncs for existing
aggregates. We would need to supply working examples for at least a
few of the builtin aggregates, so we can test the infrastructure. We
can add examples for all cases later.
In addition to MIN(), MAX(), BIT_AND(), BIT_OR, SUM() for floating point types, cash and interval. I've now added combine functions for count(*) and count(col). It seems that int8pl() is suitable for this.
Do you think it's worth adding any new functions at this stage, or is it best to wait until there's a patch which can use these?
Regards
David Rowley
Attachment
Hi Rowley, Let me put some comments on this patch. This patch itself looks good as an infrastructure towards the big picture, however, we still don't reach the consensus how combined functions are used instead of usual translation functions. Aggregate function usually consumes one or more values extracted from a tuple, then it accumulates its internal state according to the argument. Exiting transition function performs to update its internal state with assumption of a function call per records. On the other hand, new combined function allows to update its internal state with partial aggregated values which is processed by preprocessor node. An aggregate function is represented with Aggref node in plan tree, however, we have no certain way to determine which function shall be called to update internal state of aggregate. For example, avg(float) has an internal state with float[3] type for number of rows, sum of X and X^2. If combined function can update its internal state with partially aggregated values, its argument should be float[3]. It is obviously incompatible to float8_accum(float) that is transition function of avg(float). I think, we need a new flag on Aggref node to inform executor which function shall be called to update internal state of aggregate. Executor cannot decide it without this hint. Also, do you have idea to push down aggregate function across joins? Even though it is a bit old research, I could find a systematic approach to push down aggregate across join. https://cs.uwaterloo.ca/research/tr/1993/46/file.pdf I think, it is great if core functionality support this query rewriting feature based on cost estimation, without external modules. How about your opinions? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of David Rowley > Sent: Friday, December 19, 2014 8:39 PM > To: Simon Riggs > Cc: Kaigai Kouhei(海外 浩平); PostgreSQL-development; Amit Kapila > Subject: Re: [HACKERS] Combining Aggregates > > On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote: > > > David, if you can update your patch with some docs to explain the > behaviour, it looks complete enough to think about committing it > in > early January, to allow other patches that depend upon it to stand > a > chance of getting into 9.5. (It is not yet ready, but I see it could > be). > > > > > Sure, I've more or less added the docs from your patch. I still need to > trawl through and see if there's anywhere else that needs some additions. > > > The above example is probably the best description of the need, > since > user defined aggregates must also understand this. > > Could we please call these "combine functions" or other? MERGE is > an > SQL Standard statement type that we will add later, so it will be > confusing if we use the "merge" word in this context. > > > > > Ok, changed. > > > David, your patch avoids creating any mergefuncs for existing > aggregates. We would need to supply working examples for at least > a > few of the builtin aggregates, so we can test the infrastructure. > We > can add examples for all cases later. > > > > > In addition to MIN(), MAX(), BIT_AND(), BIT_OR, SUM() for floating point > types, cash and interval. I've now added combine functions for count(*) > and count(col). It seems that int8pl() is suitable for this. > > > Do you think it's worth adding any new functions at this stage, or is it > best to wait until there's a patch which can use these? > > Regards > > David Rowley
Hi, On 18.2.2015 09:13, Kouhei Kaigai wrote: >> >> In addition to MIN(), MAX(), BIT_AND(), BIT_OR, SUM() for floating >> point types, cash and interval. I've now added combine functions >> for count(*) and count(col). It seems that int8pl() is suitable for >> this. >> >> >> Do you think it's worth adding any new functions at this stage, or >> is it best to wait until there's a patch which can use these? A few comments about this patch: 1) another use case / grouping sets ----------------------------------- I agree this would be nice to have in-core. I remember discussing this functionality (combining partial aggregate results) as an alternative implementation to grouping sets. IIRC the grouping sets patch implements this by repeatedly sorting the tuples, but in some cases we could compute the aggregates at the most detailed level, and then build the results by combining the partial results. Just an idea, at this moment, though. 2) serialize/deserialize functions ---------------------------------- This thread mentions "parallel queries" as a use case, but that means passing data between processes, and that requires being able to serialize and deserialize the aggregate state somehow. For actual data types that's not overly difficult I guess (we can use in/out functions), but what about aggretates using 'internal' state? I.e. aggregates passing pointers that we can't serialize? And we do have plenty of those, including things like array_agg avg cume_dist dense_rank json_agg jsonb_agg jsonb_object_agg json_object_agg mode percentile_cont percentile_disc percent_rank rank stddev stddev_pop stddev_samp string_agg sum variance var_pop var_samp Those are pretty important aggregates IMHO, and ISTM we won't be able to use them with this patch. Or am I missing something? So maybe this patch should include support for serialize/deserialize functions too? Or maybe a follow-up patch ... I'm not entirely comfortable with a patch without an actual use case, except for a simple example. But maybe that's OK. FWIW the serialize/deserialize functions would be useful also for implementing a truly 'memory-bounded hash aggregate' (discussed elsewhere, currently stuck because of difficulty with implementing memory accounting). So that's yet another use case for this (both the 'combine' function and the 'serialize/deserialize'). regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Is there a case where the combining function is different from the transition function, other than for count?
On 20.2.2015 21:01, Peter Eisentraut wrote: > Is there a case where the combining function is different from the > transition function, other than for count? It's different in all the cases when the aggregate state is not identical to a single value - for example the usual avg(), sum() and stddev() aggregates keep state which is equal to {count(X), sum(X), sum(X*X)} The 'combine' function gets two such 'state' values, while transition gets 'state' + next value. I'm inclined to say that 'combinefn == transfn' is a minority case. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/20/15 3:09 PM, Tomas Vondra wrote: > On 20.2.2015 21:01, Peter Eisentraut wrote: >> Is there a case where the combining function is different from the >> transition function, other than for count? > > It's different in all the cases when the aggregate state is not > identical to a single value - for example the usual avg(), sum() and > stddev() aggregates keep state which is equal to > > {count(X), sum(X), sum(X*X)} > > The 'combine' function gets two such 'state' values, while transition > gets 'state' + next value. That's just because the count is hidden there in an opaque custom transition function. If, say, we had instead an array of transition functions {inc, plus, plussq} and we knew that plus and plussq are associative operators, all we'd need to special case is the count case.This would avoid a lot of repetitive code for stddev,avg, etc. (As a bonus, you could use this knowledge to compute count, sum, avg, and stddev in parallel using only three counters.)
On 20.2.2015 21:23, Peter Eisentraut wrote: > On 2/20/15 3:09 PM, Tomas Vondra wrote: >> On 20.2.2015 21:01, Peter Eisentraut wrote: >>> Is there a case where the combining function is different from the >>> transition function, other than for count? >> >> It's different in all the cases when the aggregate state is not >> identical to a single value - for example the usual avg(), sum() and >> stddev() aggregates keep state which is equal to >> >> {count(X), sum(X), sum(X*X)} >> >> The 'combine' function gets two such 'state' values, while transition >> gets 'state' + next value. > > That's just because the count is hidden there in an opaque custom > transition function. If, say, we had instead an array of transition > functions {inc, plus, plussq} and we knew that plus and plussq are > associative operators, all we'd need to special case is the count > case. This would avoid a lot of repetitive code for stddev, avg, etc. Ummm, I'm not entirely sure I understand that, but the main point was that the current implementation does not work like that. We have no idea what transition functions are transitive, and we do have opaque aggregate states. Also, there are aggregate functions like array_agg() or string_agg() that make this impossible, just like for many custom aggregates (like hyperloglog for example). Again, I might not understand the idea correctly ... > (As a bonus, you could use this knowledge to compute count, sum, avg, > and stddev in parallel using only three counters.) Yeah, sharing the aggregate state by multiple aggregates would be nice. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/20/15 3:32 PM, Tomas Vondra wrote: >> That's just because the count is hidden there in an opaque custom >> transition function. If, say, we had instead an array of transition >> functions {inc, plus, plussq} and we knew that plus and plussq are >> associative operators, all we'd need to special case is the count >> case. This would avoid a lot of repetitive code for stddev, avg, etc. > > Ummm, I'm not entirely sure I understand that, but the main point was > that the current implementation does not work like that. We have no idea > what transition functions are transitive, and we do have opaque > aggregate states. Well, my point is that you could make it work that way and make your current patch a lot smaller and simpler. > Also, there are aggregate functions like array_agg() or string_agg() > that make this impossible, just like for many custom aggregates (like > hyperloglog for example). Again, I might not understand the idea > correctly ... How would a combining function work for something like array_agg()? I don't think it would, at least if you want to preserve the ordering option for the user.
On 2/20/15 3:09 PM, Tomas Vondra wrote: > The 'combine' function gets two such 'state' values, while transition > gets 'state' + next value. I think the combine function is not actually a property of the aggregate, but a property of the transition function. If two aggregates have the same transition function, they will also have the same combine function. The combine function really just says, how do you combine two series of these function calls. Say maybe this should be put into pg_proc instead. (Or you make the transition functions transition operators and put the info into pg_operator instead, which is where function call optimization information is usually kept.)
On Tue, Feb 24, 2015 at 2:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 2/20/15 3:09 PM, Tomas Vondra wrote: >> The 'combine' function gets two such 'state' values, while transition >> gets 'state' + next value. > > I think the combine function is not actually a property of the > aggregate, but a property of the transition function. If two aggregates > have the same transition function, they will also have the same combine > function. The combine function really just says, how do you combine two > series of these function calls. Say maybe this should be put into > pg_proc instead. (Or you make the transition functions transition > operators and put the info into pg_operator instead, which is where > function call optimization information is usually kept.) This seems like a weird design to me. It's probably true that if the transition function is the same, the state-combiner function will also be the same. But the state-combiner function is only going to exist for aggregate transition functions, not functions or operators in general. So linking it from pg_proc or pg_operator feels wrong to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Feb 24, 2015 at 2:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> I think the combine function is not actually a property of the >> aggregate, but a property of the transition function. If two aggregates >> have the same transition function, they will also have the same combine >> function. The combine function really just says, how do you combine two >> series of these function calls. Say maybe this should be put into >> pg_proc instead. (Or you make the transition functions transition >> operators and put the info into pg_operator instead, which is where >> function call optimization information is usually kept.) > This seems like a weird design to me. It's probably true that if the > transition function is the same, the state-combiner function will also > be the same. But the state-combiner function is only going to exist > for aggregate transition functions, not functions or operators in > general. So linking it from pg_proc or pg_operator feels wrong to me. FWIW, I agree with Robert. It's better to keep this in pg_aggregate. We don't need yet another mostly-empty column in pg_proc; and as for pg_operator, there is no expectation that an aggregate transition function is in pg_operator. Also, I don't think it's a foregone conclusion that same transition function implies same combiner function: that logic falls apart when you consider that one of them might be polymorphic and the other not. (Admittedly, that probably means the aggregate creator is missing a bet; but we should not design the catalog schema in a way that says that only maximally efficient aggregate designs are allowed.) regards, tom lane
On 25 February 2015 at 08:15, Peter Eisentraut <peter_e@gmx.net> wrote:
David Rowley
On 2/20/15 3:32 PM, Tomas Vondra wrote:
> Also, there are aggregate functions like array_agg() or string_agg()
> that make this impossible, just like for many custom aggregates (like
> hyperloglog for example). Again, I might not understand the idea
> correctly ...
How would a combining function work for something like array_agg()? I
don't think it would, at least if you want to preserve the ordering
option for the user.
They just wouldn't work in that case. We'd simply not have a combine function for that aggregate.
The yet to be written code, (say parallel hash aggregate), the planner would have to ensure that each aggregate function being used had a combine function, if any aggregate in the current query level didn't have one then it would not parallelise the query.
Regards
David Rowley
On 18 February 2015 at 21:13, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
This patch itself looks good as an infrastructure towards
the big picture, however, we still don't reach the consensus
how combined functions are used instead of usual translation
functions.
Thank you for taking the time to look at the patch.
Aggregate function usually consumes one or more values extracted
from a tuple, then it accumulates its internal state according
to the argument. Exiting transition function performs to update
its internal state with assumption of a function call per records.
On the other hand, new combined function allows to update its
internal state with partial aggregated values which is processed
by preprocessor node.
An aggregate function is represented with Aggref node in plan tree,
however, we have no certain way to determine which function shall
be called to update internal state of aggregate.
This is true, there's nothing done in the planner to set any sort of state in the aggregation nodes to tell them weather to call the final function or not. It's quite hard to know how far to go with this patch. It's really only intended to provide the necessary infrastructure for things like parallel query and various other possible usages of aggregate combine functions. I don't think it's really appropriate for this patch to go adding such a property to any nodes as there would still be nothing in the planner to actually set those properties... The only thing I can think of to get around this is implement the most simple use for combine aggregate functions, the problem with that is, that the most simple case is not at all simple.
For example, avg(float) has an internal state with float[3] type
for number of rows, sum of X and X^2. If combined function can
update its internal state with partially aggregated values, its
argument should be float[3]. It is obviously incompatible to
float8_accum(float) that is transition function of avg(float).
I think, we need a new flag on Aggref node to inform executor
which function shall be called to update internal state of
aggregate. Executor cannot decide it without this hint.
Also, do you have idea to push down aggregate function across
joins? Even though it is a bit old research, I could find
a systematic approach to push down aggregate across join.
https://cs.uwaterloo.ca/research/tr/1993/46/file.pdf
I've not read the paper yet, but I do have a very incomplete WIP patch to do this. I've just not had much time to work on it.
I think, it is great if core functionality support this query
rewriting feature based on cost estimation, without external
modules.
Regards
David Rowley
On 21 February 2015 at 07:16, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
2) serialize/deserialize functions
----------------------------------
This thread mentions "parallel queries" as a use case, but that means
passing data between processes, and that requires being able to
serialize and deserialize the aggregate state somehow. For actual data
types that's not overly difficult I guess (we can use in/out functions),
but what about aggretates using 'internal' state? I.e. aggregates
passing pointers that we can't serialize?
This is a good question. I really don't know the answer to it as I've not looked at the Robert's API for passing data between backends yet.
Maybe Robert or Amit can answer this?
Regards
David Rowley
On Wed, Mar 4, 2015 at 4:41 AM, David Rowley <dgrowleyml@gmail.com> wrote: >> This thread mentions "parallel queries" as a use case, but that means >> passing data between processes, and that requires being able to >> serialize and deserialize the aggregate state somehow. For actual data >> types that's not overly difficult I guess (we can use in/out functions), >> but what about aggretates using 'internal' state? I.e. aggregates >> passing pointers that we can't serialize? > > This is a good question. I really don't know the answer to it as I've not > looked at the Robert's API for passing data between backends yet. > > Maybe Robert or Amit can answer this? I think parallel aggregation will probably require both the infrastructure discussed here, namely the ability to combine two transition states into a single transition state, and also the ability to serialize and de-serialize transition states, which has previously been discussed in the context of letting hash aggregates spill to disk. My current thinking is that the parallel plan will look something like this: HashAggregateFinish -> Funnel -> HashAggregatePartial -> PartialHeapScan So the workers will all pull from a partial heap scan and each worker will aggregate its own portion of the data. Then it'll need to pass the results of that step back to the master, which will aggregate the partial results and produce the final results. I'm guessing that if we're grouping on, say, column a, the HashAggregatePartial nodes will kick out 2-column tuples of the form (<value for a>, <serialized transition state for group with that value for a>). Of course this is all pie in the sky right now, but I think it's a pretty good bet that partial aggregation is a useful thing to be able to do. Postgres-XC put a bunch of work into this, too, so there's lots of people saying "hey, we want that". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Wed, Mar 4, 2015 at 4:41 AM, David Rowley <dgrowleyml@gmail.com> wrote: > >> This thread mentions "parallel queries" as a use case, but that means > >> passing data between processes, and that requires being able to > >> serialize and deserialize the aggregate state somehow. For actual data > >> types that's not overly difficult I guess (we can use in/out functions), > >> but what about aggretates using 'internal' state? I.e. aggregates > >> passing pointers that we can't serialize? > > > > This is a good question. I really don't know the answer to it as I've not > > looked at the Robert's API for passing data between backends yet. > > > > Maybe Robert or Amit can answer this? > > I think parallel aggregation will probably require both the > infrastructure discussed here, namely the ability to combine two > transition states into a single transition state, and also the ability > to serialize and de-serialize transition states, which has previously > been discussed in the context of letting hash aggregates spill to > disk. My current thinking is that the parallel plan will look > something like this: > > HashAggregateFinish > -> Funnel > -> HashAggregatePartial > -> PartialHeapScan > > So the workers will all pull from a partial heap scan and each worker > will aggregate its own portion of the data. Then it'll need to pass > the results of that step back to the master, which will aggregate the > partial results and produce the final results. I'm guessing that if > we're grouping on, say, column a, the HashAggregatePartial nodes will > kick out 2-column tuples of the form (<value for a>, <serialized > transition state for group with that value for a>). > It may not be an urgent topic to be discussed towards v9.5, however, I'd like to raise a topic about planner and aggregates. Once we could get the two phase aggregation, planner needs to pay attention possibility of partial aggregate during plan construction, even though our current implementation attach Agg node after the join/scan plan construction. Probably, a straightforward design is to add FunnelPath with some child nodes on set_rel_pathlist() or add_paths_to_joinrel(). Its child node may be PartialAggregate node (or some other parallel safe node of course). In this case, it must inform the planner this node (for more correctness, targetlist of the node) returns partial aggregation, instead of the values row-by-row. Then, planner need to track and care about which type of data shall be returned to the upper node. Path node may have a flag to show it, and we also may need to inject dummy PartialAggregate if we try to join a relation that returns values row-by-row and another one that returns partial aggregate. It also leads another problem. The RelOptInfo->targetlist will depend on the Path node chosen. Even if float datum is expected as an argument of AVG(), its state to be combined is float[3]. So, we will need to adjust the targetlist of RelOptInfo, once Path got chosen. Anyway, I could find out, at least, these complicated issues around two-phase aggregate integration with planner. Someone can suggest minimum invasive way for these integration? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Wed, Mar 4, 2015 at 11:00 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > Anyway, I could find out, at least, these complicated issues around > two-phase aggregate integration with planner. Someone can suggest > minimum invasive way for these integration? I don't think I have the brain space to think about that until we get the basic parallelism stuff in. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 5, 2015 at 9:30 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
It may not be an urgent topic to be discussed towards v9.5, however,> On Wed, Mar 4, 2015 at 4:41 AM, David Rowley <dgrowleyml@gmail.com> wrote:
> >> This thread mentions "parallel queries" as a use case, but that means
> >> passing data between processes, and that requires being able to
> >> serialize and deserialize the aggregate state somehow. For actual data
> >> types that's not overly difficult I guess (we can use in/out functions),
> >> but what about aggretates using 'internal' state? I.e. aggregates
> >> passing pointers that we can't serialize?
> >
> > This is a good question. I really don't know the answer to it as I've not
> > looked at the Robert's API for passing data between backends yet.
> >
> > Maybe Robert or Amit can answer this?
>
> I think parallel aggregation will probably require both the
> infrastructure discussed here, namely the ability to combine two
> transition states into a single transition state, and also the ability
> to serialize and de-serialize transition states, which has previously
> been discussed in the context of letting hash aggregates spill to
> disk. My current thinking is that the parallel plan will look
> something like this:
>
> HashAggregateFinish
> -> Funnel
> -> HashAggregatePartial
> -> PartialHeapScan
>
> So the workers will all pull from a partial heap scan and each worker
> will aggregate its own portion of the data. Then it'll need to pass
> the results of that step back to the master, which will aggregate the
> partial results and produce the final results. I'm guessing that if
> we're grouping on, say, column a, the HashAggregatePartial nodes will
> kick out 2-column tuples of the form (<value for a>, <serialized
> transition state for group with that value for a>).
>
I'd like to raise a topic about planner and aggregates.
Once we could get the two phase aggregation, planner needs to pay
attention possibility of partial aggregate during plan construction,
even though our current implementation attach Agg node after the
join/scan plan construction.
Probably, a straightforward design is to add FunnelPath with some
child nodes on set_rel_pathlist() or add_paths_to_joinrel().
Its child node may be PartialAggregate node (or some other parallel
safe node of course). In this case, it must inform the planner this
node (for more correctness, targetlist of the node) returns partial
aggregation, instead of the values row-by-row.
Then, planner need to track and care about which type of data shall
be returned to the upper node. Path node may have a flag to show it,
and we also may need to inject dummy PartialAggregate if we try to
join a relation that returns values row-by-row and another one that
returns partial aggregate.
It also leads another problem. The RelOptInfo->targetlist will
depend on the Path node chosen. Even if float datum is expected as
an argument of AVG(), its state to be combined is float[3].
So, we will need to adjust the targetlist of RelOptInfo, once Path
got chosen.
Anyway, I could find out, at least, these complicated issues around
two-phase aggregate integration with planner. Someone can suggest
minimum invasive way for these integration?
Postgres-XC solved this question by creating a plan with two Agg/Group nodes, one for combining transitioned result and one for creating the distributed transition results (one per distributed run per group). So, Agg/Group for combining result had as many Agg/Group nodes as there are distributed/parallel runs. But XC chose this way to reduce the code footprint. In Postgres, we can have different nodes for combining and transitioning as you have specified above. Aggregation is not pathified in current planner, hence XC took the approach of pushing the Agg nodes down the plan tree when there was distributed/parallel execution possible. If we can get aggregation pathified, we can go by path-based approach which might give a better judgement of whether or not to distribute the aggregates itself.
Looking at Postgres-XC might be useful to get ideas. I can help you there.
Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 6 March 2015 at 19:01, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Postgres-XC solved this question by creating a plan with two Agg/Group nodes, one for combining transitioned result and one for creating the distributed transition results (one per distributed run per group).
So, Agg/Group for combining result had as many Agg/Group nodes as there are distributed/parallel runs.
This sounds quite like the planner must be forcing the executor to having to execute the plan on a fixed number of worker processes.
I really hoped that we could, one day, have a load monitor process that decided what might be the best number of threads to execute a parallel plan on. Otherwise how would we decide how many worker processes to allocate to a plan? Surely there must be times where only utilising half of the processors for a query would be better than trying to use all processors and having many more context switched to perform.
Probably the harder part about dynamically deciding the number of workers would be around the costing. Where maybe the plan will execute the fastest with 32 workers, but if it was only given 2 workers then it might execute better as a non-parallel plan.
But XC chose this way to reduce the code footprint. In Postgres, we can have different nodes for combining and transitioning as you have specified above. Aggregation is not pathified in current planner, hence XC took the approach of pushing the Agg nodes down the plan tree when there was distributed/parallel execution possible. If we can get aggregation pathified, we can go by path-based approach which might give a better judgement of whether or not to distribute the aggregates itself.Looking at Postgres-XC might be useful to get ideas. I can help you there.
Regards
David Rowley
On Fri, Mar 6, 2015 at 12:41 PM, David Rowley <dgrowley@gmail.com> wrote:
On 6 March 2015 at 19:01, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:Postgres-XC solved this question by creating a plan with two Agg/Group nodes, one for combining transitioned result and one for creating the distributed transition results (one per distributed run per group).So, Agg/Group for combining result had as many Agg/Group nodes as there are distributed/parallel runs.This sounds quite like the planner must be forcing the executor to having to execute the plan on a fixed number of worker processes.I really hoped that we could, one day, have a load monitor process that decided what might be the best number of threads to execute a parallel plan on. Otherwise how would we decide how many worker processes to allocate to a plan? Surely there must be times where only utilising half of the processors for a query would be better than trying to use all processors and having many more context switched to perform.Probably the harder part about dynamically deciding the number of workers would be around the costing. Where maybe the plan will execute the fastest with 32 workers, but if it was only given 2 workers then it might execute better as a non-parallel plan.
XC does that, because it knew on how many nodes it had to distribute the aggregation while creating the plan. To keep that dynamic, we can add a place-holder planner node for producing transitioned results for a given distributed run. At the time of execution, that node creates one executor node (corresponding to the place-holder node) per parallel run. I haven't seen a precedence in PG code to create more than one executor node for a given planner node, but is that a rule?
But XC chose this way to reduce the code footprint. In Postgres, we can have different nodes for combining and transitioning as you have specified above. Aggregation is not pathified in current planner, hence XC took the approach of pushing the Agg nodes down the plan tree when there was distributed/parallel execution possible. If we can get aggregation pathified, we can go by path-based approach which might give a better judgement of whether or not to distribute the aggregates itself.Looking at Postgres-XC might be useful to get ideas. I can help you there.RegardsDavid Rowley
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote:
David, if you can update your patch with some docs to explain the
behaviour, it looks complete enough to think about committing it in
early January, to allow other patches that depend upon it to stand a
chance of getting into 9.5. (It is not yet ready, but I see it could
be).
I've been thinking of bumping this patch to the June commitfest as the patch only exists to provide the basic infrastructure for things like parallel aggregation, aggregate before join, and perhaps auto updating materialised views.
It seems unlikely that any of those things will happen for 9.5.
Does anybody object to me moving this to June's commitfest?
Regards
David Rowley
On Mon, Mar 30, 2015 at 2:08 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote:David, if you can update your patch with some docs to explain the
behaviour, it looks complete enough to think about committing it in
early January, to allow other patches that depend upon it to stand a
chance of getting into 9.5. (It is not yet ready, but I see it could
be).I've been thinking of bumping this patch to the June commitfest as the patch only exists to provide the basic infrastructure for things like parallel aggregation, aggregate before join, and perhaps auto updating materialised views.It seems unlikely that any of those things will happen for 9.5.
Yeah, I guess so...
Does anybody object to me moving this to June's commitfest?
Not from my side FWIW. I think it actually makes sense.
--
--
Michael
On 30 March 2015 at 01:08, David Rowley <dgrowleyml@gmail.com> wrote: > On 18 December 2014 at 02:48, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> David, if you can update your patch with some docs to explain the >> behaviour, it looks complete enough to think about committing it in >> early January, to allow other patches that depend upon it to stand a >> chance of getting into 9.5. (It is not yet ready, but I see it could >> be). >> > > I've been thinking of bumping this patch to the June commitfest as the patch > only exists to provide the basic infrastructure for things like parallel > aggregation, aggregate before join, and perhaps auto updating materialised > views. > > It seems unlikely that any of those things will happen for 9.5. > Does anybody object to me moving this to June's commitfest? If the patch is ready, it should stay in the queue. A global decision to move all uncommitted patches to June might occur later, but that's a different decision. I don't think we should be prejudging that situation, all it would do is hide the extent of the real problem with reviews. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I've been thinking of bumping this patch to the June commitfest as the >> patch only exists to provide the basic infrastructure for things like >> parallel aggregation, aggregate before join, and perhaps auto updating >> materialised views. >> >> It seems unlikely that any of those things will happen for 9.5. > > Yeah, I guess so... > >> Does anybody object to me moving this to June's commitfest? > > Not from my side FWIW. I think it actually makes sense. +1. I'd like to devote some time to looking at this, but I don't have the time right now. The chances that we can do something useful with it in 9.6 seem good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/01/2015 06:28 PM, Robert Haas wrote: > On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> I've been thinking of bumping this patch to the June commitfest as the >>> patch only exists to provide the basic infrastructure for things like >>> parallel aggregation, aggregate before join, and perhaps auto updating >>> materialised views. >>> >>> It seems unlikely that any of those things will happen for 9.5. >> >> Yeah, I guess so... >> >>> Does anybody object to me moving this to June's commitfest? >> >> Not from my side FWIW. I think it actually makes sense. > > +1. I'd like to devote some time to looking at this, but I don't have > the time right now. The chances that we can do something useful with > it in 9.6 seem good. And the June commitfest is now in progress. This patch seems sane to me, as far as it goes. However, there's no planner or executor code to use the aggregate combining for anything. I'm not a big fan of dead code, I'd really like to see something to use this. The main use case people have been talking about is parallel query, but is there some other case this would be useful right now, without the parallel query feature? You and Simon talked about this case: > 2. Queries such as: > > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id > = p.product_id GROUP BY p.name; > > Such a query could be transformed into: > > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales > GROUP BY product_id) s > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name; > > Of course the outer query's SUM and GROUP BY would not be required if there > happened to be a UNIQUE index on product(name), but assuming there's not > then the above should produce the results faster. This of course works ok > for SUM(), but for something like AVG() or STDDEV() the combine/merge > aggregate functions would be required to process those intermediate > aggregate results that were produced by the sub-query. Any chance you could implement that in the planner? - Heikki
> On 04/01/2015 06:28 PM, Robert Haas wrote: > > On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >>> I've been thinking of bumping this patch to the June commitfest as the > >>> patch only exists to provide the basic infrastructure for things like > >>> parallel aggregation, aggregate before join, and perhaps auto updating > >>> materialised views. > >>> > >>> It seems unlikely that any of those things will happen for 9.5. > >> > >> Yeah, I guess so... > >> > >>> Does anybody object to me moving this to June's commitfest? > >> > >> Not from my side FWIW. I think it actually makes sense. > > > > +1. I'd like to devote some time to looking at this, but I don't have > > the time right now. The chances that we can do something useful with > > it in 9.6 seem good. > > And the June commitfest is now in progress. > > This patch seems sane to me, as far as it goes. However, there's no > planner or executor code to use the aggregate combining for anything. > I'm not a big fan of dead code, I'd really like to see something to use > this. > +1, this patch itself looks good for me, but... > The main use case people have been talking about is parallel query, but > is there some other case this would be useful right now, without the > parallel query feature? You and Simon talked about this case: > > > 2. Queries such as: > > > > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id > > = p.product_id GROUP BY p.name; > > > > Such a query could be transformed into: > > > > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales > > GROUP BY product_id) s > > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name; > > > > Of course the outer query's SUM and GROUP BY would not be required if there > > happened to be a UNIQUE index on product(name), but assuming there's not > > then the above should produce the results faster. This of course works ok > > for SUM(), but for something like AVG() or STDDEV() the combine/merge > > aggregate functions would be required to process those intermediate > > aggregate results that were produced by the sub-query. > > Any chance you could implement that in the planner? > It likely needs planner enhancement prior to other applications... http://www.postgresql.org/message-id/CA+TgmobgWKHfZc09B+s2LxJTwoRD5Ht-avoVDvaQ4+RpwrO4bg@mail.gmail.com Once planner allowed to have both of normal path and partial aggregation paths to compare according to the cost, it is the straightforward way to do. Here are various academic research, for example, below is the good starting point to clarify aggregate queries that we can run with 2-phase approach. http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 27 July 2015 at 04:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 04/01/2015 06:28 PM, Robert Haas wrote:On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:I've been thinking of bumping this patch to the June commitfest as the
patch only exists to provide the basic infrastructure for things like
parallel aggregation, aggregate before join, and perhaps auto updating
materialised views.
It seems unlikely that any of those things will happen for 9.5.
Yeah, I guess so...Does anybody object to me moving this to June's commitfest?
Not from my side FWIW. I think it actually makes sense.
+1. I'd like to devote some time to looking at this, but I don't have
the time right now. The chances that we can do something useful with
it in 9.6 seem good.
And the June commitfest is now in progress.
This patch seems sane to me, as far as it goes. However, there's no planner or executor code to use the aggregate combining for anything. I'm not a big fan of dead code, I'd really like to see something to use this.
Thanks for looking. I partially agree with that, it is a little weird to put in code that's yet to be used. I'd certainly agree 95% if this was the final commitfest of 9.5, but we're in the first commitfest of 9.6 and I think there's a very high probability of this getting used in 9.6, and likely that probability would be even higher if the code is already in. Perhaps it's a little bit in the same situation as to Robert's parallel worker stuff?
The main use case people have been talking about is parallel query, but is there some other case this would be useful right now, without the parallel query feature? You and Simon talked about this case:2. Queries such as:
SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
= p.product_id GROUP BY p.name;
Such a query could be transformed into:
SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
GROUP BY product_id) s
INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
Of course the outer query's SUM and GROUP BY would not be required if there
happened to be a UNIQUE index on product(name), but assuming there's not
then the above should produce the results faster. This of course works ok
for SUM(), but for something like AVG() or STDDEV() the combine/merge
aggregate functions would be required to process those intermediate
aggregate results that were produced by the sub-query.
Any chance you could implement that in the planner?
Yes! I'm actually working on it now and so far have it partially working. Quite likely I'll be able to submit for CF2. There's still some costing tweaks to do. So far it just works for GROUP BY with no aggs. I plan to fix that later using this patch.
I don't want to talk too much about it on this thread, but in a test query which is the one in my example, minus the SUM(qty), with 1 million sales records, and 100 products, performance goes from 350ms to 200ms on my machine, so looking good so far.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 27 July 2015 at 12:14, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> The main use case people have been talking about is parallel query, but
> is there some other case this would be useful right now, without the
> parallel query feature? You and Simon talked about this case:
>
> > 2. Queries such as:
> >
> > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
> > = p.product_id GROUP BY p.name;
> >
> > Such a query could be transformed into:
> >
> > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
> > GROUP BY product_id) s
> > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> >
> > Of course the outer query's SUM and GROUP BY would not be required if there
> > happened to be a UNIQUE index on product(name), but assuming there's not
> > then the above should produce the results faster. This of course works ok
> > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > aggregate functions would be required to process those intermediate
> > aggregate results that were produced by the sub-query.
>
> Any chance you could implement that in the planner?
>
It likely needs planner enhancement prior to other applications...
http://www.postgresql.org/message-id/CA+TgmobgWKHfZc09B+s2LxJTwoRD5Ht-avoVDvaQ4+RpwrO4bg@mail.gmail.com
I had thought this too, but I'm not sure that's 100% correct. As I just said in the my previous email on this thread, I am now working on "group by before join". I had started it with the intentions of path-ifying the grouping planner, but I soon realised that the grouping_planner() does quite a bit more at that final stage that I propose to do to allow "group by before join". This is mostly around handling of DISTINCT, HAVING and LIMIT. I don't think those need to be handled in my patch, perhaps with the exception of DISTINCT without GROUP BY, but not when both are present.
Instead I've started by inventing GroupingPath and I'm now building these new path types once there's enough tables joined for all Vars of the GROUP BY and agg parameters.
I believe this optimisation needs to be costed as pushing the GROUP BY down to happen as early as possible is *not* always a win. Perhaps the JOIN is very selective and eliminates many groups. Hence I've added costing and allowed the planner to choose what it thinks is cheapest.
Once planner allowed to have both of normal path and partial aggregation
paths to compare according to the cost, it is the straightforward way to
do.
I've ended up with 2 boolean members to struct GroupingPath, combine_states and finalize_aggs. I plan to modify nodeAgg.c to use these, and EXPLAIN to give a better description of what its doing.
Here are various academic research, for example, below is the good starting
point to clarify aggregate queries that we can run with 2-phase approach.
http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join
Thanks, I've been using that very paper.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 27 July 2015 at 04:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This patch seems sane to me, as far as it goes. However, there's no planner or executor code to use the aggregate combining for anything. I'm not a big fan of dead code, I'd really like to see something to use this.
I've attached an updated version of the patch. The main change from last time is that I've added executor support and exposed this to the planner via two new parameters in make_agg().
I've also added EXPLAIN support, this will display "Partial [Hash|Group]Aggregate" for cases where the final function won't be called and displays "Finalize [Hash|Group]Aggregate" when combining states and finalizing aggregates.
This patch is currently intended for foundation work for parallel aggregation.
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Dec 3, 2015 at 12:01 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 27 July 2015 at 04:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> This patch seems sane to me, as far as it goes. However, there's no >> planner or executor code to use the aggregate combining for anything. I'm >> not a big fan of dead code, I'd really like to see something to use this. > > I've attached an updated version of the patch. The main change from last > time is that I've added executor support and exposed this to the planner via > two new parameters in make_agg(). > > I've also added EXPLAIN support, this will display "Partial > [Hash|Group]Aggregate" for cases where the final function won't be called > and displays "Finalize [Hash|Group]Aggregate" when combining states and > finalizing aggregates. > > This patch is currently intended for foundation work for parallel > aggregation. Given Tom's dislike of executor nodes that do more than one thing, I fear he's not going to be very happy about combining Aggregate, PartialAggregate, FinalizeAggregate, GroupAggregate, PartialGroupAggregate, FinalizeGroupAggregate, HashAggregate, PartialHashAggregate, and FinalizeHashAggregate under one roof. However, it's not at all obvious to me what would be any better. nodeAgg.c is already a very big file, and this patch adds a surprisingly small amount of code to it. I think the parameter in CREATE AGGREGATE ought to be called COMBINEFUNC rather than CFUNC. There are a lot of English words that begin with C, and it's not self-evident which one is meant. "iif performing the final aggregate stage we'll check the qual" should probably say "If" with a capital letter rather than "iif" without one. I think it would be useful to have a test patch associated with this patch that generates a FinalizeAggregate + PartialAggregate combo instead of an aggregate, and run that through the regression tests. There will obviously be EXPLAIN differences, but in theory nothing else should blow up. Of course, such tests will become more meaningful as we add more combine-functions, but this would at least be a start. Generally, I think that this patch will be excellent infrastructure for parallel aggregate and I think we should try to get it committed soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 December 2015 at 06:18, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 3, 2015 at 12:01 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 27 July 2015 at 04:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This patch seems sane to me, as far as it goes. However, there's no
>> planner or executor code to use the aggregate combining for anything. I'm
>> not a big fan of dead code, I'd really like to see something to use this.
>
> I've attached an updated version of the patch. The main change from last
> time is that I've added executor support and exposed this to the planner via
> two new parameters in make_agg().
>
> I've also added EXPLAIN support, this will display "Partial
> [Hash|Group]Aggregate" for cases where the final function won't be called
> and displays "Finalize [Hash|Group]Aggregate" when combining states and
> finalizing aggregates.
>
> This patch is currently intended for foundation work for parallel
> aggregation.
Given Tom's dislike of executor nodes that do more than one thing, I
fear he's not going to be very happy about combining Aggregate,
PartialAggregate, FinalizeAggregate, GroupAggregate,
PartialGroupAggregate, FinalizeGroupAggregate, HashAggregate,
PartialHashAggregate, and FinalizeHashAggregate under one roof.
However, it's not at all obvious to me what would be any better.
nodeAgg.c is already a very big file, and this patch adds a
surprisingly small amount of code to it.
Hmm yes. I read that too. It's a tricky one. I'm not sure where the line is and when we go over it. At least nodeAgg.c does not need any additional work again for parallel enabling... This should be all that's required for that.
I think the parameter in CREATE AGGREGATE ought to be called
COMBINEFUNC rather than CFUNC. There are a lot of English words that
begin with C, and it's not self-evident which one is meant.
Good idea, I've changed this in the attached.
"iif performing the final aggregate stage we'll check the qual" should
probably say "If" with a capital letter rather than "iif" without one.
While building the test code I ended up deciding it's best to not change this part at all and just always check the qual. In the case of partial aggregation I think it should just be up to the planner not to pass the HAVING clause as the node's qual, and only pass this when Finalizing the aggregates. Seem fair?
I think it would be useful to have a test patch associated with this
patch that generates a FinalizeAggregate + PartialAggregate combo
instead of an aggregate, and run that through the regression tests.
There will obviously be EXPLAIN differences, but in theory nothing
else should blow up. Of course, such tests will become more
meaningful as we add more combine-functions, but this would at least
be a start.
I've been working on this, and I've attached what I've got so far. The plans will look something like:
# explain select sum(a) from test;
QUERY PLAN
--------------------------------------------------------------------
Finalize Aggregate (cost=18.53..18.54 rows=1 width=4)
-> Partial Aggregate (cost=18.51..18.52 rows=1 width=4)
-> Seq Scan on test (cost=0.00..16.01 rows=1001 width=4)
I seem to have got all of this working with the test patch, and the only regression tests which failed were due to EXPLAIN output changing, rather than results changing. I *was* all quite happy with it until I thought that I'd better write a aggregate combine function which has an INTERNAL state. I had thought that I could get this working by insisting that the combine function either update the existing state, or in the case there's no existing state, just write all the values from the combining state into a newly created state which is allocated in the correct memory context. I tried this by implementing a combinefn for SUM(NUMERIC) and all seems to work fine for hash aggregates, but falls flat for sorted or plain aggregates.
# select x,sum(x::numeric) from generate_series(1,3) x(x) group by x;
x | sum
---+-----
1 | 1
3 | 3
2 | 2
(3 rows)
This one works ok using hash aggregate.
But sorted and plain aggregates fail:
# select sum(x::numeric) from generate_series(1,3) x(x);
ERROR: invalid memory alloc request size 18446744072025250716
The reason that this happens is down to the executor always thinking that Aggref returns the aggtype, but in cases where we're not finalizing the aggregate the executor needs to know that we're actually returning aggtranstype instead. Hash aggregates appear to work as the trans value is just stuffed into a hash table, but with plain and sorted aggregates this gets formed into a Tuple again, and forming a tuple naturally requires the types to be set correctly! I'm not sure exactly what the best way to fix this will be. I've hacked something up in the attached test patch which gets around the problem by adding a new aggtranstype to Aggref and also an 'aggskipfinalize' field which I manually set to true in a bit of a hacky way inside the grouping planner. Then in exprType() for Aggref I conditionally return the aggtype or aggtranstype based on the aggskipfinalize setting. This is most likely not the way to properly fix this, but I'm running out of steam today to think of how it should be done, so I'm currently very open to ideas on this.
I should also mention that the setrefs stuff in the test patch is borrowed from Haribabu's Parallel Aggregate patch. I'm not quite clear on which patch that part should go into at the moment.
Generally, I think that this patch will be excellent infrastructure
for parallel aggregate and I think we should try to get it committed
soon.
Thanks.
Attachment
On 18 December 2015 at 01:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
# select sum(x::numeric) from generate_series(1,3) x(x);ERROR: invalid memory alloc request size 18446744072025250716The reason that this happens is down to the executor always thinking that Aggref returns the aggtype, but in cases where we're not finalizing the aggregate the executor needs to know that we're actually returning aggtranstype instead. Hash aggregates appear to work as the trans value is just stuffed into a hash table, but with plain and sorted aggregates this gets formed into a Tuple again, and forming a tuple naturally requires the types to be set correctly! I'm not sure exactly what the best way to fix this will be. I've hacked something up in the attached test patch which gets around the problem by adding a new aggtranstype to Aggref and also an 'aggskipfinalize' field which I manually set to true in a bit of a hacky way inside the grouping planner. Then in exprType() for Aggref I conditionally return the aggtype or aggtranstype based on the aggskipfinalize setting. This is most likely not the way to properly fix this, but I'm running out of steam today to think of how it should be done, so I'm currently very open to ideas on this.
Ok, so it seems that my mindset was not in parallel process space when I was thinking about this. I think having the pointer in the Tuple is probably going to be fine for this multiple stage aggregation when that's occurring in a single backend process, but obviously if the memory that the pointer points to belongs to a worker process in a parallel aggregate situation, then bad things will happen.
Now, there has been talk of this previously, on various threads, but I don't believe any final decisions were made on how exactly it should be done. At the moment I plan to make changes as follows:
- Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and aggserialtype These will only be required when aggtranstype is INTERNAL. Perhaps we should disallow CREATE AGGREAGET from accepting them for any other type... The return type of aggserialfn should be aggserialtype, and it should take a single parameter of aggtranstype. aggdeserialfn will be the reverse of that.
- Add a new bool field to nodeAgg's state named serialStates. If this is field is set to true then when we're in finalizeAgg = false mode, we'll call the serialfn on the agg state instead of the finalfn. This will allow the serialized state to be stored in the tuple and sent off to the main backend. The combine agg node should also be set to serialStates = true, so that it knows to deserialize instead of just assuming that the agg state is of type aggtranstype.
I believe this should allow us to not cause any performance regressions by moving away from INTERNAL agg states. It should also be very efficient for internal states such as Int8TransTypeData, as this struct merely has 2 int64 fields which should be very simple to stuff into a bytea varlena type. We don't need to mess around converting the ->count and ->sum into a strings or anything.
Then in order for the planner to allow parallel aggregation all aggregates must:
- Not have a DISTINCT or ORDER BY clause
- Have a combinefn
- If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.
We can relax the requirement on 3 if we're using 2-stage aggregation, but not parallel aggregation.
Any objections, or better ideas?
That just leaves me to figure out how to set the correct Aggref->aggtype during planning, as now there's 3 possible types:
if (finalizeAggs == false)
{
if (serialStates == true)
type = aggserialtype;
else
type = aggtranstype;
}
else
type = prorettype; /* normal case */
--
On Mon, Dec 21, 2015 at 4:02 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Ok, so it seems that my mindset was not in parallel process space when I > was thinking about this. I think having the pointer in the Tuple is > probably going to be fine for this multiple stage aggregation when that's > occurring in a single backend process, but obviously if the memory that the > pointer points to belongs to a worker process in a parallel aggregate > situation, then bad things will happen. > > Now, there has been talk of this previously, on various threads, but I don't > believe any final decisions were made on how exactly it should be done. At > the moment I plan to make changes as follows: > > Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and > aggserialtype These will only be required when aggtranstype is INTERNAL. > Perhaps we should disallow CREATE AGGREAGET from accepting them for any > other type... The return type of aggserialfn should be aggserialtype, and it > should take a single parameter of aggtranstype. aggdeserialfn will be the > reverse of that. > Add a new bool field to nodeAgg's state named serialStates. If this is field > is set to true then when we're in finalizeAgg = false mode, we'll call the > serialfn on the agg state instead of the finalfn. This will allow the > serialized state to be stored in the tuple and sent off to the main backend. > The combine agg node should also be set to serialStates = true, so that it > knows to deserialize instead of just assuming that the agg state is of type > aggtranstype. > > I believe this should allow us to not cause any performance regressions by > moving away from INTERNAL agg states. It should also be very efficient for > internal states such as Int8TransTypeData, as this struct merely has 2 int64 > fields which should be very simple to stuff into a bytea varlena type. We > don't need to mess around converting the ->count and ->sum into a strings or > anything. > > Then in order for the planner to allow parallel aggregation all aggregates > must: > > Not have a DISTINCT or ORDER BY clause > Have a combinefn > If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn. > > We can relax the requirement on 3 if we're using 2-stage aggregation, but > not parallel aggregation. > > Any objections, or better ideas? Can we use Tom's expanded-object stuff instead of introducing aggserialfn and aggdeserialfn? In other words, if you have a aggtranstype = INTERNAL, then what we do is: 1. Create a new data type that represents the transition state. 2. Use expanded-object notation for that data type when we're just within a single process, and flatten it when we need to send it between processes. One thing to keep in mind is that we also want to be able to support a plan that involves having one or more remote servers do partial aggregation, send us the partial values, combine them across servers and possibly also with locally computed-values, and the finalize the aggregation. So it would be nice if there were a way to invoke the aggregate function from SQL and get a transition value back rather than a final value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 December 2015 at 01:30, Robert Haas <robertmhaas@gmail.com> wrote:
aggserialfn and aggdeserialfn? In other words, if you have aCan we use Tom's expanded-object stuff instead of introducing
aggtranstype = INTERNAL, then what we do is:
1. Create a new data type that represents the transition state.
2. Use expanded-object notation for that data type when we're just
within a single process, and flatten it when we need to send it
between processes.
I'd not seen this before, but on looking at it I'm not sure if using it will be practical to use for this. I may have missed something, but it seems that after each call of the transition function, I'd need to ensure that the INTERNAL state was in the varlana format. This might be ok for a state like Int8TransTypeData, since that struct has no pointers, but I don't see how that could be done efficiently for NumericAggState, which has two NumericVar, which will have pointers to other memory. The trans function also has no idea whether it'll be called again for this state, so it does not seem possible to delay the conversion until the final call of the trans function.
One thing to keep in mind is that we also want to be able to support a
plan that involves having one or more remote servers do partial
aggregation, send us the partial values, combine them across servers
and possibly also with locally computed-values, and the finalize the
aggregation. So it would be nice if there were a way to invoke the
aggregate function from SQL and get a transition value back rather
than a final value.
This will be possible with what I proposed. The Agg Node will just need to be setup with finalizeAggs=false, serialState=true. That way the returned aggregate values will be the states converted into the serial type, to which we can call the output function on and send where ever we like.
On 21 December 2015 at 22:02, David Rowley <david.rowley@2ndquadrant.com> wrote:
-- At the moment I plan to make changes as follows:
- Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and aggserialtype These will only be required when aggtranstype is INTERNAL. Perhaps we should disallow CREATE AGGREAGET from accepting them for any other type... The return type of aggserialfn should be aggserialtype, and it should take a single parameter of aggtranstype. aggdeserialfn will be the reverse of that.
- Add a new bool field to nodeAgg's state named serialStates. If this is field is set to true then when we're in finalizeAgg = false mode, we'll call the serialfn on the agg state instead of the finalfn. This will allow the serialized state to be stored in the tuple and sent off to the main backend. The combine agg node should also be set to serialStates = true, so that it knows to deserialize instead of just assuming that the agg state is of type aggtranstype.
I've attached an updated patch which implements this.
combine_aggs_test_v3.patch can be applied on top of the other patch to have the planner generate partial aggregate plans. It is also possible to further hack this test patch to remove the combine Agg node from planner.c in order to have the intermediate states output. I've also invented a serialize and deseriaize function for avg(NUMERIC), and SUM(NUMERIC). Quite possible this is what we'd want a remote server to send us if we get remote aggregation on a server grid one day in the future. With the patch modified to not generate the combine aggs node we get:
# select sum(x::numeric) from generate_series(1,1000) x(x);
----------------------
1000 500500 0 1000 0
(1 row)
This is just my serial format that I've come up with for NumericAggState, which is basically: "N sumX maxScale maxScaleCount NaNcount". Perhaps we can come up with something better, maybe just packing the ints and int64s into a bytea type and putting the text version of sumX on the end... I'm sure we can think of something more efficient between us, but I think the serial state should definitely be cross platform e.g if we do the bytea thing, then the ints should be in network byte order so that a server cluster can have a mix of little and big-endian processors.
I've also left a failing regression test because I'm not quite sure how to deal with it:
--- 283,292 ----
WHERE p1.prorettype = 'internal'::regtype AND NOT
'internal'::regtype = ANY (p1.proargtypes);
oid | proname
! ------+-------------------------
! 3319 | numeric_avg_deserialize
2304 | internal_in
This is basically making sure that there's only 1 function that takes zero internal types as parameters, but returns an INTERNAL type. I've made it so that numeric_avg_deserialize() can only be called from an aggregate context, which hopefully makes that safe again, but I'm perhaps not imagining hard enough for ways that this could be abused, so I've held off for now in updated the expected output on that to include the new function.
One other part that I'm not too sure on how to deal with is how to set the data type for the Aggrefs when we're not performing finalization on the aggregate node. The return type for the Aggref in this case will be either the transtype, or the serialtype, depending on if we're serializing the states or not. To do this, I've so far just come up with set_partialagg_aggref_types() which is called during setrefs. The only other time that I can think to do this return type update would be when building the partial agg node's target list. I'm open to better ideas on this part.
Apart from the problems I listed above, I'm reasonably happy with the patch now, and I'm ready for someone else to take a serious look at it.
Many thanks
Attachment
On Wed, Dec 23, 2015 at 7:50 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > One other part that I'm not too sure on how to deal with is how to set the > data type for the Aggrefs when we're not performing finalization on the > aggregate node. The return type for the Aggref in this case will be either > the transtype, or the serialtype, depending on if we're serializing the > states or not. To do this, I've so far just come up with > set_partialagg_aggref_types() which is called during setrefs. The only other > time that I can think to do this return type update would be when building > the partial agg node's target list. I'm open to better ideas on this part. Thanks for the patch. I am not sure about the proper place of this change, but changing it with transtype will make all float4 and float8 aggregates to work in parallel. Most of these aggregates return type is typbyval and transition type is a variable length. we may need to write better combine functions for these types to avoid wrong results because of parallel. Regards, Hari Babu Fujitsu Australia
On Wed, Dec 23, 2015 at 7:50 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > This is just my serial format that I've come up with for NumericAggState, > which is basically: "N sumX maxScale maxScaleCount NaNcount". Perhaps we can > come up with something better, maybe just packing the ints and int64s into a > bytea type and putting the text version of sumX on the end... I'm sure we > can think of something more efficient between us, but I think the serial > state should definitely be cross platform e.g if we do the bytea thing, then > the ints should be in network byte order so that a server cluster can have a > mix of little and big-endian processors. Instead of adding serial and de-serial functions to all aggregates which have transition type as internal, how about adding these functions as send and recv functions for internal type? can be used only in aggregate context. The data can send and receive similar like other functions. Is it possible? Regards, Hari Babu Fujitsu Australia
On 23 December 2015 at 21:50, David Rowley <david.rowley@2ndquadrant.com> wrote:
-- Apart from the problems I listed above, I'm reasonably happy with the patch now, and I'm ready for someone else to take a serious look at it.
I forgot to mention another situation where this patch might need a bit more thought. This does not affect any of the built in aggregate functions, but if someone created a user defined aggregate such as:
CREATE AGGREGATE sumplusten (int)
(
stype = int,
sfunc = int4pl,
combinefunc = int4pl,
initcond = '10'
);
Note the initcond which initialises the state to 10. Then let's say we later add the ability to push aggregates down below a Append.
create table t1 as select 10 as value from generate_series(1,10);
create table t2 as select 10 as value from generate_series(1,10);
With the following we get the correct result:
# select sumplusten(value) from (select * from t1 union all select * from t2) t;
sumplusten
------------
210
(1 row)
But if we simulate how it would work in the aggregate push down situation:
# select sum(value) from (select sumplusten(value) as value from t1 union all select sumplusten(value) as value from t2) t;
sum
-----
220
(1 row)
Here I'm using sum() as a mock up of the combine function, but you get the idea... Since we initialize the state twice, we get the wrong result.
Now I'm not too sure if anyone would have an aggregate defined like this in the real world, but I don't think that'll give us a good enough excuse to go returning wrong results.
In the patch I could fix this by changing partial_aggregate_walker() to disable partial aggregation if the aggregate has an initvalue, but that would exclude a number of built-in aggregates unnecessarily. Alternatively if there was some way to analyze the initvalue to see if it was "zero"... I'm just not quite sure how we might do that at the moment.
Any thoughts on a good way to fix this that does not exclude built-in aggregates with an initvalue are welcome.
On 24 December 2015 at 14:07, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Dec 23, 2015 at 7:50 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> This is just my serial format that I've come up with for NumericAggState,
> which is basically: "N sumX maxScale maxScaleCount NaNcount". Perhaps we can
> come up with something better, maybe just packing the ints and int64s into a
> bytea type and putting the text version of sumX on the end... I'm sure we
> can think of something more efficient between us, but I think the serial
> state should definitely be cross platform e.g if we do the bytea thing, then
> the ints should be in network byte order so that a server cluster can have a
> mix of little and big-endian processors.
Instead of adding serial and de-serial functions to all aggregates which have
transition type as internal, how about adding these functions as send and
recv functions for internal type? can be used only in aggregate context.
The data can send and receive similar like other functions. Is it possible?
The requirement that needs to be met is that we have the ability to completely represent the INTERNAL aggregate state inside a tuple. It's these tuples that are sent to the main process in the parallel aggregate situation, and also in Sorted and Plain Aggregate too, although in the case of Sorted and Plain Aggregate we can simply put the pointer to the INTERNAL state in the tuple, and this can be dereferenced.
Perhaps I'm lacking imagination, but right now I'm not sure how send and recv functions can be used to help us here. Although we perhaps could make serial and deserial functions skip on type conversions by using similar methods as to what's used in send and recv.
On 24 December 2015 at 13:55, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
-- On Wed, Dec 23, 2015 at 7:50 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> One other part that I'm not too sure on how to deal with is how to set the
> data type for the Aggrefs when we're not performing finalization on the
> aggregate node. The return type for the Aggref in this case will be either
> the transtype, or the serialtype, depending on if we're serializing the
> states or not. To do this, I've so far just come up with
> set_partialagg_aggref_types() which is called during setrefs. The only other
> time that I can think to do this return type update would be when building
> the partial agg node's target list. I'm open to better ideas on this part.
Thanks for the patch. I am not sure about the proper place of this change,
but changing it with transtype will make all float4 and float8 aggregates to
work in parallel. Most of these aggregates return type is typbyval and
transition type is a variable length.
we may need to write better combine functions for these types to avoid wrong
results because of parallel.
I might be misunderstanding you here, but yeah, well, if by "write better" you mean "write some", then yeah :) I only touched sum(), min() and max() so far as I didn't need to do anything special with these. I'm not quite sure what you mean with the "wrong results" part. Could you explain more?
On Thu, Dec 24, 2015 at 1:12 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 24 December 2015 at 13:55, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Wed, Dec 23, 2015 at 7:50 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> > One other part that I'm not too sure on how to deal with is how to set >> > the >> > data type for the Aggrefs when we're not performing finalization on the >> > aggregate node. The return type for the Aggref in this case will be >> > either >> > the transtype, or the serialtype, depending on if we're serializing the >> > states or not. To do this, I've so far just come up with >> > set_partialagg_aggref_types() which is called during setrefs. The only >> > other >> > time that I can think to do this return type update would be when >> > building >> > the partial agg node's target list. I'm open to better ideas on this >> > part. >> >> >> Thanks for the patch. I am not sure about the proper place of this change, >> but changing it with transtype will make all float4 and float8 aggregates >> to >> work in parallel. Most of these aggregates return type is typbyval and >> transition type is a variable length. >> >> we may need to write better combine functions for these types to avoid >> wrong >> results because of parallel. > > > I might be misunderstanding you here, but yeah, well, if by "write better" > you mean "write some", then yeah :) I only touched sum(), min() and max() > so far as I didn't need to do anything special with these. I'm not quite > sure what you mean with the "wrong results" part. Could you explain more? sorry for not providing clear information. To check the change of replacing aggtype with aggtranstype is working fine or not for float8 avg. I just tried adding a float8_combine_accum function for float8 avg similar to float8_acuum with a difference of adding the two transvalues instead of newval to existing transval in float8_accum function as follows, N += transvalues1[0]; sumX += transvalues1[1]; sumX2 += transvalues1[2]; But the result came different compared to normal aggregate. The data in the column is 1.1 postgres=# select avg(f3) from tbl where f1 < 1001 group by f3; avg -----------------------2.16921537594434e-316 (1 row) postgres=# set enable_parallelagg = false; SET postgres=# select avg(f3) from tbl where f1 < 1001 group by f3; avg ------------------1.10000000000001 (1 row) First i thought it is because of combine function problem, but it seems some where else the problem is present in parallel aggregate code. sorry for the noise. Regards, Hari Babu Fujitsu Australia
On Mon, Dec 21, 2015 at 4:53 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 22 December 2015 at 01:30, Robert Haas <robertmhaas@gmail.com> wrote: >> Can we use Tom's expanded-object stuff instead of introducing >> aggserialfn and aggdeserialfn? In other words, if you have a >> aggtranstype = INTERNAL, then what we do is: >> >> 1. Create a new data type that represents the transition state. >> 2. Use expanded-object notation for that data type when we're just >> within a single process, and flatten it when we need to send it >> between processes. >> > > I'd not seen this before, but on looking at it I'm not sure if using it will > be practical to use for this. I may have missed something, but it seems that > after each call of the transition function, I'd need to ensure that the > INTERNAL state was in the varlana format. No, the idea I had in mind was to allow it to continue to exist in the expanded format until you really need it in the varlena format, and then serialize it at that point. You'd actually need to do the opposite: if you get an input that is not in expanded format, expand it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 December 2015 at 14:10, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 21, 2015 at 4:53 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 22 December 2015 at 01:30, Robert Haas <robertmhaas@gmail.com> wrote:
>> Can we use Tom's expanded-object stuff instead of introducing
>> aggserialfn and aggdeserialfn? In other words, if you have a
>> aggtranstype = INTERNAL, then what we do is:
>>
>> 1. Create a new data type that represents the transition state.
>> 2. Use expanded-object notation for that data type when we're just
>> within a single process, and flatten it when we need to send it
>> between processes.
>>
>
> I'd not seen this before, but on looking at it I'm not sure if using it will
> be practical to use for this. I may have missed something, but it seems that
> after each call of the transition function, I'd need to ensure that the
> INTERNAL state was in the varlana format.
No, the idea I had in mind was to allow it to continue to exist in the
expanded format until you really need it in the varlena format, and
then serialize it at that point. You'd actually need to do the
opposite: if you get an input that is not in expanded format, expand
it.
Admittedly I'm struggling to see how this can be done. I've spent a good bit of time analysing how the expanded object stuff works.
Hypothetically let's say we can make it work like:
1. During partial aggregation (finalizeAggs = false), in finalize_aggregates(), where we'd normally call the final function, instead flatten INTERNAL states and store the flattened Datum instead of the pointer to the INTERNAL state.
2. During combining aggregation (combineStates = true) have all the combine functions written in such a ways that the INTERNAL states expand the flattened states before combining the aggregate states.
Does that sound like what you had in mind?
If so I can't quite seem to wrap my head around 1. As I'm really not quite sure how, from finalize_aggregates() we'd flatten the INTERNAL pointer. I mean, how do we know which flatten function to call here? >From reading the expanded-object code I see that its used in expand_array(), In this case we know we're working with arrays, so it just always uses the EA_methods globally scoped struct to get the function pointers it requires for flattening the array. For the case of finalize_aggregates(), the best I can think of here is to have a bunch of global structs and then have a giant case statement to select the correct one. That's clearly horrid, and not commit worthy, and it does nothing to help user defined aggregates which use INTERNAL types. Am I missing something here?
As of the most recent patch I posted, having the serial and deserial functions in the catalogs allows user defined aggregates with INTERNAL states to work just fine. Admittedly I'm not all that happy that I've had to add 4 new columns to pg_aggregate to support this, but if I could think of how to make it work without doing that, then I'd likely go and do that instead.
If your problem with the serialize and deserialize stuff is around the serialized format, then can see no reason why we couldn't just invent some composite types for the current INTERNAL aggregate states, and have the serialfn convert the INTERNAL state into one of those, then have the deserialfn perform the opposite. Likely this would be neater than what I have at the moment with just converting the INTERNAL state into text.
Please let me know what I'm missing with the expanded-object code.
--
I've attached some re-based patched on current master. This is just to fix a duplicate OID problem.
Attachment
On 8 January 2016 at 22:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
I've attached some re-based patched on current master. This is just to fix a duplicate OID problem.
Attachment
On Fri, Jan 15, 2016 at 3:34 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 8 January 2016 at 22:43, David Rowley <david.rowley@2ndquadrant.com> > wrote: >> >> I've attached some re-based patched on current master. This is just to fix >> a duplicate OID problem. >> > > I've attached two updated patched to fix a conflict with a recent change to > planner.c I am getting following compilation error and warning with the latest patch, because of a function prototype mismatch. aggregatecmds.c: In function ‘DefineAggregate’: aggregatecmds.c:93:8: warning: variable ‘serialTypeType’ set but not used [-Wunused-but-set-variable] char serialTypeType = 0; ^ clauses.c:434:1: error: conflicting types for ‘partial_aggregate_walker’partial_aggregate_walker(Node *node, partial_agg_context*context)^ clauses.c:100:13: note: previous declaration of ‘partial_aggregate_walker’ was herestatic bool partial_aggregate_walker(Node *node, void *context); ^ Regards, Hari Babu Fujitsu Australia
On 15 January 2016 at 17:46, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Fri, Jan 15, 2016 at 3:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 8 January 2016 at 22:43, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>>
>> I've attached some re-based patched on current master. This is just to fix
>> a duplicate OID problem.
>>
>
> I've attached two updated patched to fix a conflict with a recent change to
> planner.c
I am getting following compilation error and warning with the latest patch,
because of a function prototype mismatch.
aggregatecmds.c: In function ‘DefineAggregate’:
aggregatecmds.c:93:8: warning: variable ‘serialTypeType’ set but not
used [-Wunused-but-set-variable]
char serialTypeType = 0;
^
clauses.c:434:1: error: conflicting types for ‘partial_aggregate_walker’
partial_aggregate_walker(Node *node, partial_agg_context *context)
^
clauses.c:100:13: note: previous declaration of
‘partial_aggregate_walker’ was here
static bool partial_aggregate_walker(Node *node, void *context);
^
Thanks for noticing that. The compiler I used didn't seem to mind that... Which is a bit of a worry.
I've attached an updated patch, and the same test patch again.
--
Attachment
Man, I really shouldn't go on vacation for Christmas or, like, ever. My email responses get way too slow. Sorry. On Tue, Dec 29, 2015 at 7:39 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> No, the idea I had in mind was to allow it to continue to exist in the >> expanded format until you really need it in the varlena format, and >> then serialize it at that point. You'd actually need to do the >> opposite: if you get an input that is not in expanded format, expand >> it. > > Admittedly I'm struggling to see how this can be done. I've spent a good bit > of time analysing how the expanded object stuff works. > > Hypothetically let's say we can make it work like: > > 1. During partial aggregation (finalizeAggs = false), in > finalize_aggregates(), where we'd normally call the final function, instead > flatten INTERNAL states and store the flattened Datum instead of the pointer > to the INTERNAL state. > 2. During combining aggregation (combineStates = true) have all the combine > functions written in such a ways that the INTERNAL states expand the > flattened states before combining the aggregate states. > > Does that sound like what you had in mind? More or less. But what I was really imagining is that we'd get rid of the internal states and replace them with new datatypes built to purpose. So, for example, for string_agg(text, text) you could make a new datatype that is basically a StringInfo. In expanded form, it really is a StringInfo. When you flatten it, you just get the string. When somebody expands it again, they again have a StringInfo. So the RW pointer to the expanded form supports append cheaply. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 January 2016 at 03:03, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 29, 2015 at 7:39 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>> No, the idea I had in mind was to allow it to continue to exist in the
>> expanded format until you really need it in the varlena format, and
>> then serialize it at that point. You'd actually need to do the
>> opposite: if you get an input that is not in expanded format, expand
>> it.
>
> Admittedly I'm struggling to see how this can be done. I've spent a good bit
> of time analysing how the expanded object stuff works.
>
> Hypothetically let's say we can make it work like:
>
> 1. During partial aggregation (finalizeAggs = false), in
> finalize_aggregates(), where we'd normally call the final function, instead
> flatten INTERNAL states and store the flattened Datum instead of the pointer
> to the INTERNAL state.
> 2. During combining aggregation (combineStates = true) have all the combine
> functions written in such a ways that the INTERNAL states expand the
> flattened states before combining the aggregate states.
>
> Does that sound like what you had in mind?
More or less. But what I was really imagining is that we'd get rid of
the internal states and replace them with new datatypes built to
purpose. So, for example, for string_agg(text, text) you could make a
new datatype that is basically a StringInfo. In expanded form, it
really is a StringInfo. When you flatten it, you just get the string.
When somebody expands it again, they again have a StringInfo. So the
RW pointer to the expanded form supports append cheaply.
That sounds fine in theory, but where and how do you suppose we determine which expand function to call? Nothing exists in the catalogs to decide this currently.
On Sat, Jan 16, 2016 at 12:00 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 16 January 2016 at 03:03, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Tue, Dec 29, 2015 at 7:39 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> >> No, the idea I had in mind was to allow it to continue to exist in the >> >> expanded format until you really need it in the varlena format, and >> >> then serialize it at that point. You'd actually need to do the >> >> opposite: if you get an input that is not in expanded format, expand >> >> it. >> > >> > Admittedly I'm struggling to see how this can be done. I've spent a good >> > bit >> > of time analysing how the expanded object stuff works. >> > >> > Hypothetically let's say we can make it work like: >> > >> > 1. During partial aggregation (finalizeAggs = false), in >> > finalize_aggregates(), where we'd normally call the final function, >> > instead >> > flatten INTERNAL states and store the flattened Datum instead of the >> > pointer >> > to the INTERNAL state. >> > 2. During combining aggregation (combineStates = true) have all the >> > combine >> > functions written in such a ways that the INTERNAL states expand the >> > flattened states before combining the aggregate states. >> > >> > Does that sound like what you had in mind? >> >> More or less. But what I was really imagining is that we'd get rid of >> the internal states and replace them with new datatypes built to >> purpose. So, for example, for string_agg(text, text) you could make a >> new datatype that is basically a StringInfo. In expanded form, it >> really is a StringInfo. When you flatten it, you just get the string. >> When somebody expands it again, they again have a StringInfo. So the >> RW pointer to the expanded form supports append cheaply. > > > That sounds fine in theory, but where and how do you suppose we determine > which expand function to call? Nothing exists in the catalogs to decide this > currently. I am thinking of transition function returns and accepts the StringInfoData instead of PolyNumAggState internal data for int8_avg_accum for example. The StringInfoData is formed with the members of the PolyNumAggState structure data. The input given StringInfoData is transformed into PolyNumAggState data and finish the calculation and again form the StringInfoData and return. Similar changes needs to be done for final functions input type also. I am not sure whether this approach may have some impact on performance? Regards, Hari Babu Fujitsu Australia
On 18 January 2016 at 14:36, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Sat, Jan 16, 2016 at 12:00 PM, David RowleyI am thinking of transition function returns and accepts the StringInfoData<david.rowley@2ndquadrant.com> wrote:
> On 16 January 2016 at 03:03, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Tue, Dec 29, 2015 at 7:39 PM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>> >> No, the idea I had in mind was to allow it to continue to exist in the
>> >> expanded format until you really need it in the varlena format, and
>> >> then serialize it at that point. You'd actually need to do the
>> >> opposite: if you get an input that is not in expanded format, expand
>> >> it.
>> >
>> > Admittedly I'm struggling to see how this can be done. I've spent a good
>> > bit
>> > of time analysing how the expanded object stuff works.
>> >
>> > Hypothetically let's say we can make it work like:
>> >
>> > 1. During partial aggregation (finalizeAggs = false), in
>> > finalize_aggregates(), where we'd normally call the final function,
>> > instead
>> > flatten INTERNAL states and store the flattened Datum instead of the
>> > pointer
>> > to the INTERNAL state.
>> > 2. During combining aggregation (combineStates = true) have all the
>> > combine
>> > functions written in such a ways that the INTERNAL states expand the
>> > flattened states before combining the aggregate states.
>> >
>> > Does that sound like what you had in mind?
>>
>> More or less. But what I was really imagining is that we'd get rid of
>> the internal states and replace them with new datatypes built to
>> purpose. So, for example, for string_agg(text, text) you could make a
>> new datatype that is basically a StringInfo. In expanded form, it
>> really is a StringInfo. When you flatten it, you just get the string.
>> When somebody expands it again, they again have a StringInfo. So the
>> RW pointer to the expanded form supports append cheaply.
>
>
> That sounds fine in theory, but where and how do you suppose we determine
> which expand function to call? Nothing exists in the catalogs to decide this
> currently.
instead of PolyNumAggState internal data for int8_avg_accum for example.
hmm, so wouldn't that mean that the transition function would need to (for each input tuple):
1. Parse that StringInfo into tokens.
2. Create a new aggregate state object.
3. Populate the new aggregate state based on the tokenised StringInfo, this would perhaps require that various *_in() functions are called on each token.
4. Add the new tuple to the aggregate state.
5. Build a new StringInfo based on the aggregate state modified in 4.
?
Currently the transition function only does 4, and performs 2 only if it's the first Tuple.
Is that what you mean? as I'd say that would slow things down significantly!
To get a gauge on how much more CPU work that would be for some aggregates, have a look at how simple int8_avg_accum() is currently when we have HAVE_INT128 defined. For the case of AVG(BIGINT) we just really have:
state->sumX += newval;
state->N++;
The above code is step 4 only. So unless I've misunderstood you, that would need to turn into steps 1-5 above. Step 4 here is probably just a handful of instructions right now, but adding code for steps 1,2,3 and 5 would turn that into hundreds.
I've been trying to avoid any overhead by adding the serializeStates flag to make_agg() so that we can maintain the same performance when we're just passing internal states around in the same process. This keeps the conversions between internal state and serialised state to a minimum.
The StringInfoData is formed with the members of the PolyNumAggState
structure data. The input given StringInfoData is transformed into
PolyNumAggState data and finish the calculation and again form the
StringInfoData and return. Similar changes needs to be done for final
functions input type also. I am not sure whether this approach may have
some impact on performance?
On Sun, Jan 17, 2016 at 9:26 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > hmm, so wouldn't that mean that the transition function would need to (for > each input tuple): > > 1. Parse that StringInfo into tokens. > 2. Create a new aggregate state object. > 3. Populate the new aggregate state based on the tokenised StringInfo, this > would perhaps require that various *_in() functions are called on each > token. > 4. Add the new tuple to the aggregate state. > 5. Build a new StringInfo based on the aggregate state modified in 4. > > ? I don't really know what you mean by parse the StringInfo into tokens. The whole point of the expanded-object interface is to be able to keep things in an expanded internal form so that you *don't* have to repeatedly construct and deconstruct internal data structures. I worked up an example of this approach using string_agg(), which I attach here. This changes the transition type of string_agg() from internal to text. The same code would work for bytea_string_agg(), which would allow removal of some other code, but this patch doesn't do that, because the point of this is to elucidate the approach. In my tests, this seems to be slightly slower than what we're doing today; worst of all, it adds a handful of cycles to advance_transition_function() even when the aggregate is not an expanded object or, indeed, not even pass-by-reference. Some of this might be able to be fixed by a little massaging - in particular, DatumIsReadWriteExpandedObject() seems like it could be partly or entirely inlined, and maybe there's some other way to improve the coding here. Generally, I think finding a way to pass expanded objects through nodeAgg.c would be a good thing to pursue, if we can make it work. The immediate impetus for changing things this way would be that we wouldn't need to add a mechanism for serializing and deserializing internal functions just to pass around partial aggregates. But there's another advantage, too: right now, advance_transition_function() does a lot of data copying to move data from per-call context to the per-aggregate context. When an expanded object is in use, this can be skipped. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 18 January 2016 at 16:44, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jan 17, 2016 at 9:26 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> hmm, so wouldn't that mean that the transition function would need to (for
> each input tuple):
>
> 1. Parse that StringInfo into tokens.
> 2. Create a new aggregate state object.
> 3. Populate the new aggregate state based on the tokenised StringInfo, this
> would perhaps require that various *_in() functions are called on each
> token.
> 4. Add the new tuple to the aggregate state.
> 5. Build a new StringInfo based on the aggregate state modified in 4.
>
> ?
I don't really know what you mean by parse the StringInfo into tokens.
The whole point of the expanded-object interface is to be able to keep
things in an expanded internal form so that you *don't* have to
repeatedly construct and deconstruct internal data structures.
That was a response to Haribabu proposal, although perhaps I misunderstood that. However I'm not sure how a PolyNumAggState could be converted to a string and back again without doing any string parsing.
I worked up an example of this approach using string_agg(), which I
attach here. This changes the transition type of string_agg() from
internal to text. The same code would work for bytea_string_agg(),
which would allow removal of some other code, but this patch doesn't
do that, because the point of this is to elucidate the approach.
Many thanks for working up that patch. I was clearly missing what you meant previously. I understand it much better now. Thank you for taking the time on that.
In my tests, this seems to be slightly slower than what we're doing
today; worst of all, it adds a handful of cycles to
advance_transition_function() even when the aggregate is not an
expanded object or, indeed, not even pass-by-reference. Some of this
might be able to be fixed by a little massaging - in particular,
DatumIsReadWriteExpandedObject() seems like it could be partly or
entirely inlined, and maybe there's some other way to improve the
coding here.
It also seems that an expanded object requires a new memory context which must be malloc()d and free()d. This has added quite an overhead in my testing. I assume that we must be doing that so that we can ensure that all memory is properly free()d once we're done with the expanded-object.
create table ab(a int, b text);
insert into ab select x,'aaaaaaaaaaaaaaaaaaaaaaaaaaa' from generate_Series(1,1000000) x(x);
set work_mem='1GB';
vacuum analyze ab;
explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;
Patched
1521.045 ms
1515.905 ms
1530.920 ms
Master
932.457 ms
959.285 ms
991.021 ms
Although performance of the patched version is closer to master when we have less groups, I felt it necessary to show the extreme case. I feel this puts a bit of a dampener on the future of this idea, as I've previously had patches rejected for adding 2-5% on planning time alone, adding over 50% execution time seems like a dead-end.
I've run profiles on this and malloc() and free() are both top of the profile with the patched version with about 30% CPU time each.
Generally, I think finding a way to pass expanded objects through
nodeAgg.c would be a good thing to pursue, if we can make it work.
The immediate impetus for changing things this way would be that we
wouldn't need to add a mechanism for serializing and deserializing
internal functions just to pass around partial aggregates. But
there's another advantage, too: right now,
advance_transition_function() does a lot of data copying to move data
from per-call context to the per-aggregate context. When an expanded
object is in use, this can be skipped.
I also witnessed another regression with your patch when testing on another machine. It caused the plan to change to a HashAgg instead of GroupAgg causing a significant slowdown.
Unpatched
# explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=119510.84..144510.84 rows=1000000 width=32) (actual time=538.938..1015.278 rows=1000000 loops=1)
Group Key: ((a % 1000000))
-> Sort (cost=119510.84..122010.84 rows=1000000 width=32) (actual time=538.917..594.194 rows=1000000 loops=1)
Sort Key: ((a % 1000000))
Sort Method: quicksort Memory: 102702kB
-> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) (actual time=0.016..138.964 rows=1000000 loops=1)
Planning time: 0.146 ms
Execution time: 1047.511 ms
Patched
# explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=24853.00..39853.00 rows=1000000 width=32) (actual time=8072.346..144424.872 rows=1000000 loops=1)
Group Key: (a % 1000000)
-> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) (actual time=0.025..481.332 rows=1000000 loops=1)
Planning time: 0.164 ms
Execution time: 263288.332 ms
--
On Mon, Jan 18, 2016 at 10:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 18 January 2016 at 16:44, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Sun, Jan 17, 2016 at 9:26 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> > hmm, so wouldn't that mean that the transition function would need to >> > (for >> > each input tuple): >> > >> > 1. Parse that StringInfo into tokens. >> > 2. Create a new aggregate state object. >> > 3. Populate the new aggregate state based on the tokenised StringInfo, >> > this >> > would perhaps require that various *_in() functions are called on each >> > token. >> > 4. Add the new tuple to the aggregate state. >> > 5. Build a new StringInfo based on the aggregate state modified in 4. >> > >> > ? >> >> I don't really know what you mean by parse the StringInfo into tokens. >> The whole point of the expanded-object interface is to be able to keep >> things in an expanded internal form so that you *don't* have to >> repeatedly construct and deconstruct internal data structures. > > > That was a response to Haribabu proposal, although perhaps I misunderstood > that. However I'm not sure how a PolyNumAggState could be converted to a > string and back again without doing any string parsing. I just thought like direct mapping of the structure with text pointer. something like the below. result = PG_ARGISNULL(0) ? NULL : (text *) PG_GETARG_POINTER(0); state = (PolyNumAggState *)VARDATA(result); To handle the big-endian or little-endian, we may need some extra changes. Instead of adding 3 new columns to the pg_aggregate catalog table to handle the internal types, either something like the above to handle the internal types or some other way is better IMO. Regards, Hari Babu Fujitsu Australia
On Mon, Jan 18, 2016 at 6:32 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> In my tests, this seems to be slightly slower than what we're doing >> today; worst of all, it adds a handful of cycles to >> advance_transition_function() even when the aggregate is not an >> expanded object or, indeed, not even pass-by-reference. Some of this >> might be able to be fixed by a little massaging - in particular, >> DatumIsReadWriteExpandedObject() seems like it could be partly or >> entirely inlined, and maybe there's some other way to improve the >> coding here. > > It also seems that an expanded object requires a new memory context which > must be malloc()d and free()d. This has added quite an overhead in my > testing. I assume that we must be doing that so that we can ensure that all > memory is properly free()d once we're done with the expanded-object. Yeah. The API contract for an expanded object took me quite a while to puzzle out, but it seems to be this: if somebody hands you an R/W pointer to an expanded object, you're entitled to assume that you can "take over" that object and mutate it however you like. But the object might be in some other memory context, so you have to move it into your own memory context. That's implementing by reparenting the object's context under your context. This is nice because it's O(1) - whereas copying would be O(n) - but it's sort of aggravating, too. In this case, the transition function already knows that it wants everything in the per-agg state, but it can't just create everything there and be done with it, because nodeAgg.c has to content with the possibility that some other piece of code will return an expanded object that *isn't* parented to the aggcontext. And in fact one of the regression tests does exactly that, which caused me lots of head-banging yesterday. Making it worse, the transition function isn't required to have the same behavior every time: the one in the regression tests sometimes returns an expanded-object pointer, and sometimes doesn't, and if nodeAgg.c reparents that pointer to the aggcontext, the next call to the transition function reparents the pointer BACK to the per-tuple context, so you can't even optimize away the repeated set-parent calls. Uggh. > create table ab(a int, b text); > insert into ab select x,'aaaaaaaaaaaaaaaaaaaaaaaaaaa' from > generate_Series(1,1000000) x(x); > set work_mem='1GB'; > vacuum analyze ab; > > explain analyze select a%1000000,length(string_agg(b,',')) from ab group by > 1; > > Patched > 1521.045 ms > 1515.905 ms > 1530.920 ms > > Master > 932.457 ms > 959.285 ms > 991.021 ms > > Although performance of the patched version is closer to master when we have > less groups, I felt it necessary to show the extreme case. I feel this puts > a bit of a dampener on the future of this idea, as I've previously had > patches rejected for adding 2-5% on planning time alone, adding over 50% > execution time seems like a dead-end. Thanks for working up this test case. I certainly agree that adding 50% execution time is a dead-end, but I wonder if that problem can be fixed somehow. It would be a shame to find out that expanded-objects can't be effectively used for anything other than the purposes for which they were designed, namely speeding up PL/pgsql array operations. > seems neater. I just don't want to waste my time writing all the code > required to replace all INTERNAL aggregate states when I know the > performance regression is going to be unacceptable. No argument there. If the performance regression isn't fixable, this approach is doomed. > I also witnessed another regression with your patch when testing on another > machine. It caused the plan to change to a HashAgg instead of GroupAgg > causing a significant slowdown. > > Unpatched > > # explain analyze select a%1000000,length(string_agg(b,',')) from ab group > by 1; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------- > GroupAggregate (cost=119510.84..144510.84 rows=1000000 width=32) (actual > time=538.938..1015.278 rows=1000000 loops=1) > Group Key: ((a % 1000000)) > -> Sort (cost=119510.84..122010.84 rows=1000000 width=32) (actual > time=538.917..594.194 rows=1000000 loops=1) > Sort Key: ((a % 1000000)) > Sort Method: quicksort Memory: 102702kB > -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) > (actual time=0.016..138.964 rows=1000000 loops=1) > Planning time: 0.146 ms > Execution time: 1047.511 ms > > > Patched > # explain analyze select a%1000000,length(string_agg(b,',')) from ab group > by 1; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------ > HashAggregate (cost=24853.00..39853.00 rows=1000000 width=32) (actual > time=8072.346..144424.872 rows=1000000 loops=1) > Group Key: (a % 1000000) > -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) (actual > time=0.025..481.332 rows=1000000 loops=1) > Planning time: 0.164 ms > Execution time: 263288.332 ms Well, that's pretty odd. I guess the plan change must be a result of switching the transition type from internal to text, although I'm not immediately certain why that would make a difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> # explain analyze select a%1000000,length(string_agg(b,',')) from ab group
> by 1;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=119510.84..144510.84 rows=1000000 width=32) (actual
> time=538.938..1015.278 rows=1000000 loops=1)
> Group Key: ((a % 1000000))
> -> Sort (cost=119510.84..122010.84 rows=1000000 width=32) (actual
> time=538.917..594.194 rows=1000000 loops=1)
> Sort Key: ((a % 1000000))
> Sort Method: quicksort Memory: 102702kB
> -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32)
> (actual time=0.016..138.964 rows=1000000 loops=1)
> Planning time: 0.146 ms
> Execution time: 1047.511 ms
>
>
> Patched
> # explain analyze select a%1000000,length(string_agg(b,',')) from ab group
> by 1;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=24853.00..39853.00 rows=1000000 width=32) (actual
> time=8072.346..144424.872 rows=1000000 loops=1)
> Group Key: (a % 1000000)
> -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) (actual
> time=0.025..481.332 rows=1000000 loops=1)
> Planning time: 0.164 ms
> Execution time: 263288.332 ms
Well, that's pretty odd. I guess the plan change must be a result of
switching the transition type from internal to text, although I'm not
immediately certain why that would make a difference.
It is strange, why hashaggregate is too slow?
Pavel
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes: > Yeah. The API contract for an expanded object took me quite a while > to puzzle out, but it seems to be this: if somebody hands you an R/W > pointer to an expanded object, you're entitled to assume that you can > "take over" that object and mutate it however you like. But the > object might be in some other memory context, so you have to move it > into your own memory context. Only if you intend to keep it --- for example, a function that is mutating and returning an object isn't required to move it somewhere else, if the input is R/W, and I think it generally shouldn't. In the context here, I'd think it is the responsibility of nodeAgg.c not individual datatype functions to make sure that expanded objects live where it wants them to. > Well, that's pretty odd. I guess the plan change must be a result of > switching the transition type from internal to text, although I'm not > immediately certain why that would make a difference. I'm pretty sure there's something in the planner that special-cases its estimate of space consumed by hashtable entries when the data type is "internal". You'd be wanting to fool with that estimate anyway for something like this ... regards, tom lane
On Mon, Jan 18, 2016 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Yeah. The API contract for an expanded object took me quite a while >> to puzzle out, but it seems to be this: if somebody hands you an R/W >> pointer to an expanded object, you're entitled to assume that you can >> "take over" that object and mutate it however you like. But the >> object might be in some other memory context, so you have to move it >> into your own memory context. > > Only if you intend to keep it --- for example, a function that is mutating > and returning an object isn't required to move it somewhere else, if the > input is R/W, and I think it generally shouldn't. > > In the context here, I'd think it is the responsibility of nodeAgg.c > not individual datatype functions to make sure that expanded objects > live where it wants them to. That's how I did it in my prototype, but the problem with that is that spinning up a memory context for every group sucks when there are many groups with only a small number of elements each - hence the 50% regression that David Rowley observed. If we're going to use expanded objects here, which seems like a good idea in principle, that's going to have to be fixed somehow. We're flogging the heck out of malloc by repeatedly creating a context, doing one or two allocations in it, and then destroying the context. I think that, in general, the memory context machinery wasn't really designed to manage lots of small contexts. The overhead of spinning up a new context for just a few allocations is substantial. That matters in some other situations too, I think - I've commonly seen AllocSetContextCreate taking several percent of runtime in profiles. But here it's much exacerbated. I'm not sure whether it's better to attack that problem at the root and try to make AllocSetContextCreate cheaper, or whether we should try to figure out some change to the expanded-object machinery to address the issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19 January 2016 at 02:44, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Mon, Jan 18, 2016 at 10:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
I just thought like direct mapping of the structure with text pointer.
something like
the below.
result = PG_ARGISNULL(0) ? NULL : (text *) PG_GETARG_POINTER(0);
state = (PolyNumAggState *)VARDATA(result);
To handle the big-endian or little-endian, we may need some extra changes.
Instead of adding 3 new columns to the pg_aggregate catalog table to handle
the internal types, either something like the above to handle the internal types
or some other way is better IMO.
The problem with that is that most of these internal structs for the aggregate states have pointers to other memory, so even if we laid those bytes down into a bytea or something, then doing so is not going to dereference the pointers to the other memory, and when we dereference those pointers in the other process, we'll have problems as these addresses belong to the other process.
For example PolyNumAggState is defined as:
typedef NumericAggState PolyNumAggState;
and NumericAggState has:
NumericVar sumX; /* sum of processed numbers */
NumericVar sumX2; /* sum of squares of processed numbers */
And NumericVar has:
NumericDigit *buf; /* start of palloc'd space for digits[] */
NumericDigit *digits; /* base-NBASE digits */
Both of these point to other memory which won't be in the varlena type.
Serialization is the process of collecting all of these pointers up in to some consecutive bytes.
Of course, that's not to say that there's never Aggregate State structs which don't have any pointers, I've not checked, but in these cases we could (perhaps) just make the serialize and deserialize function a simple memcpy() into a bytea array, although in reality, as you mentioned, we'd likely want to agree on some format that's cross platform for different byte orders, as we'll probably, one day, want to forward these values over to some other server to finish off the aggregation.
On 19 January 2016 at 06:03, Pavel Stehule <pavel.stehule@gmail.com> wrote:
-- >
> # explain analyze select a%1000000,length(string_agg(b,',')) from ab group
> by 1;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=119510.84..144510.84 rows=1000000 width=32) (actual
> time=538.938..1015.278 rows=1000000 loops=1)
> Group Key: ((a % 1000000))
> -> Sort (cost=119510.84..122010.84 rows=1000000 width=32) (actual
> time=538.917..594.194 rows=1000000 loops=1)
> Sort Key: ((a % 1000000))
> Sort Method: quicksort Memory: 102702kB
> -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32)
> (actual time=0.016..138.964 rows=1000000 loops=1)
> Planning time: 0.146 ms
> Execution time: 1047.511 ms
>
>
> Patched
> # explain analyze select a%1000000,length(string_agg(b,',')) from ab group
> by 1;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> HashAggregate (cost=24853.00..39853.00 rows=1000000 width=32) (actual
> time=8072.346..144424.872 rows=1000000 loops=1)
> Group Key: (a % 1000000)
> -> Seq Scan on ab (cost=0.00..19853.00 rows=1000000 width=32) (actual
> time=0.025..481.332 rows=1000000 loops=1)
> Planning time: 0.164 ms
> Execution time: 263288.332 ms
Well, that's pretty odd. I guess the plan change must be a result of
switching the transition type from internal to text, although I'm not
immediately certain why that would make a difference.It is strange, why hashaggregate is too slow?
Good question. I looked at this and found my VM was swapping like crazy. Upon investigation it appears that's because, since the patch creates a memory context per aggregated group, and in this case I've got 1 million of them, it means we create 1 million context, which are ALLOCSET_SMALL_INITSIZE (1KB) in size, which means about 1GB of memory, which is more than my VM likes.
set work_mem = '130MB' does coax the planner into a GroupAggregate plan, which is faster, but due to the the hash agg executor code not giving any regard to work_mem. If I set work_mem to 140MB (which is more realistic for this VM), it does cause the same swapping problems to occur. Probably setting aggtransspace for this aggregate to 1024 would help the costing problem, but it would also cause hashagg to be a less chosen option during planning.
On Mon, Jan 18, 2016 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 18, 2016 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Yeah. The API contract for an expanded object took me quite a while >>> to puzzle out, but it seems to be this: if somebody hands you an R/W >>> pointer to an expanded object, you're entitled to assume that you can >>> "take over" that object and mutate it however you like. But the >>> object might be in some other memory context, so you have to move it >>> into your own memory context. >> >> Only if you intend to keep it --- for example, a function that is mutating >> and returning an object isn't required to move it somewhere else, if the >> input is R/W, and I think it generally shouldn't. >> >> In the context here, I'd think it is the responsibility of nodeAgg.c >> not individual datatype functions to make sure that expanded objects >> live where it wants them to. > > That's how I did it in my prototype, but the problem with that is that > spinning up a memory context for every group sucks when there are many > groups with only a small number of elements each - hence the 50% > regression that David Rowley observed. If we're going to use expanded > objects here, which seems like a good idea in principle, that's going > to have to be fixed somehow. We're flogging the heck out of malloc by > repeatedly creating a context, doing one or two allocations in it, and > then destroying the context. > > I think that, in general, the memory context machinery wasn't really > designed to manage lots of small contexts. The overhead of spinning > up a new context for just a few allocations is substantial. That > matters in some other situations too, I think - I've commonly seen > AllocSetContextCreate taking several percent of runtime in profiles. > But here it's much exacerbated. I'm not sure whether it's better to > attack that problem at the root and try to make AllocSetContextCreate > cheaper, or whether we should try to figure out some change to the > expanded-object machinery to address the issue. Here is a patch that helps a good deal. I changed things so that when we create a context, we always allocate at least 1kB. If that's more than we need for the node itself and the name, then we use the rest of the space as a sort of keeper block. I think there's more that can be done to improve this, but I'm wimping out for now because it's late here. I suspect something like this is a good idea even if we don't end up using it for aggregate transition functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here is a patch that helps a good deal. I changed things so that when > we create a context, we always allocate at least 1kB. That's going to kill performance in some other cases; subtransactions in particular rely on the subtransaction's TransactionContext not causing any actual malloc unless something gets put into the TransactionContext. regards, tom lane
Hi, On 01/19/2016 05:00 AM, David Rowley wrote: > On 19 January 2016 at 06:03, Pavel Stehule <pavel.stehule@gmail.com > <mailto:pavel.stehule@gmail.com>> wrote: > ... > > It is strange, why hashaggregate is too slow? > > > Good question. I looked at this and found my VM was swapping like crazy. > Upon investigation it appears that's because, since the patch creates a > memory context per aggregated group, and in this case I've got 1 million > of them, it means we create 1 million context, which are > ALLOCSET_SMALL_INITSIZE (1KB) in size, which means about 1GB of memory, > which is more than my VM likes. Really? Where do we create the memory context? IIRC string_agg uses the aggcontext directly, and indeed that's what I see in string_agg_transfn and makeStringAggState. Perhaps you mean that initStringInfo() allocates 1kB buffers by default? > > set work_mem = '130MB' does coax the planner into a GroupAggregate plan, > which is faster, but due to the the hash agg executor code not giving > any regard to work_mem. If I set work_mem to 140MB (which is more > realistic for this VM), it does cause the same swapping problems to > occur. Probably setting aggtransspace for this aggregate to 1024 would > help the costing problem, but it would also cause hashagg to be a less > chosen option during planning. I'm not quite sure I understand - the current code ends up using 8192 for the transition space (per count_agg_clauses_walker). Are you suggesting lowering the value, despite the danger of OOM issues? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/19/2016 06:04 AM, Tomas Vondra wrote: > Hi, > > On 01/19/2016 05:00 AM, David Rowley wrote: >> Good question. I looked at this and found my VM was swapping like crazy. >> Upon investigation it appears that's because, since the patch creates a >> memory context per aggregated group, and in this case I've got 1 million >> of them, it means we create 1 million context, which are >> ALLOCSET_SMALL_INITSIZE (1KB) in size, which means about 1GB of memory, >> which is more than my VM likes. > > Really? Where do we create the memory context? IIRC string_agg uses the > aggcontext directly, and indeed that's what I see in string_agg_transfn > and makeStringAggState. > > Perhaps you mean that initStringInfo() allocates 1kB buffers by default? ... > I'm not quite sure I understand - the current code ends up using 8192 > for the transition space (per count_agg_clauses_walker). Are you > suggesting lowering the value, despite the danger of OOM issues? Meh, right after sending the message I realized that perhaps this relates to Robert's patch and that maybe it changes this. And indeed, it changes the type from internal to text, and thus the special case in count_agg_clauses_walker no longer applies. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 19 January 2016 at 18:04, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
On 01/19/2016 05:00 AM, David Rowley wrote:On 19 January 2016 at 06:03, Pavel Stehule <pavel.stehule@gmail.com...
<mailto:pavel.stehule@gmail.com>> wrote:
It is strange, why hashaggregate is too slow?
Good question. I looked at this and found my VM was swapping like crazy.
Upon investigation it appears that's because, since the patch creates a
memory context per aggregated group, and in this case I've got 1 million
of them, it means we create 1 million context, which are
ALLOCSET_SMALL_INITSIZE (1KB) in size, which means about 1GB of memory,
which is more than my VM likes.
Really? Where do we create the memory context? IIRC string_agg uses the aggcontext directly, and indeed that's what I see in string_agg_transfn and makeStringAggState.
Yeah, all this is talk is relating to Robert's expandedstring-v1.patch which changes string_agg to use text and expanded-objects. This also means that a memory context is created per group, which is rather a big overhead.
On 01/19/2016 05:14 AM, Robert Haas wrote: > On Mon, Jan 18, 2016 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jan 18, 2016 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> Yeah. The API contract for an expanded object took me quite a while >>>> to puzzle out, but it seems to be this: if somebody hands you an R/W >>>> pointer to an expanded object, you're entitled to assume that you can >>>> "take over" that object and mutate it however you like. But the >>>> object might be in some other memory context, so you have to move it >>>> into your own memory context. >>> >>> Only if you intend to keep it --- for example, a function that is >>> mutating and returning an object isn't required to move it >>> somewhere else, if the input is R/W, and I think it generally >>> shouldn't. >>> >>> In the context here, I'd think it is the responsibility of >>> nodeAgg.c not individual datatype functions to make sure that >>> expanded objects live where it wants them to. >> >> That's how I did it in my prototype, but the problem with that is that >> spinning up a memory context for every group sucks when there are many >> groups with only a small number of elements each - hence the 50% >> regression that David Rowley observed. If we're going to use expanded >> objects here, which seems like a good idea in principle, that's going >> to have to be fixed somehow. We're flogging the heck out of malloc by >> repeatedly creating a context, doing one or two allocations in it, and >> then destroying the context. >> >> I think that, in general, the memory context machinery wasn't really >> designed to manage lots of small contexts. The overhead of spinning >> up a new context for just a few allocations is substantial. That >> matters in some other situations too, I think - I've commonly seen >> AllocSetContextCreate taking several percent of runtime in profiles. >> But here it's much exacerbated. I'm not sure whether it's better to >> attack that problem at the root and try to make AllocSetContextCreate >> cheaper, or whether we should try to figure out some change to the >> expanded-object machinery to address the issue. > > Here is a patch that helps a good deal. I changed things so that when > we create a context, we always allocate at least 1kB. If that's more > than we need for the node itself and the name, then we use the rest of > the space as a sort of keeper block. I think there's more that can be > done to improve this, but I'm wimping out for now because it's late > here. > > I suspect something like this is a good idea even if we don't end up > using it for aggregate transition functions. I dare to claim that if expanded objects require a separate memory context per aggregate state (I don't see why would be the case), it's a dead end. Not so long ago we've fixed exactly this issue in array_agg(), which addressed a bunch of reported OOM issues and improved array_agg() performance quite a bit. It'd be rather crazy introducing the same problem to all aggregate functions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 19 January 2016 at 17:14, Robert Haas <robertmhaas@gmail.com> wrote:
Here is a patch that helps a good deal. I changed things so that whenOn Mon, Jan 18, 2016 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 18, 2016 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Yeah. The API contract for an expanded object took me quite a while
>>> to puzzle out, but it seems to be this: if somebody hands you an R/W
>>> pointer to an expanded object, you're entitled to assume that you can
>>> "take over" that object and mutate it however you like. But the
>>> object might be in some other memory context, so you have to move it
>>> into your own memory context.
>>
>> Only if you intend to keep it --- for example, a function that is mutating
>> and returning an object isn't required to move it somewhere else, if the
>> input is R/W, and I think it generally shouldn't.
>>
>> In the context here, I'd think it is the responsibility of nodeAgg.c
>> not individual datatype functions to make sure that expanded objects
>> live where it wants them to.
>
> That's how I did it in my prototype, but the problem with that is that
> spinning up a memory context for every group sucks when there are many
> groups with only a small number of elements each - hence the 50%
> regression that David Rowley observed. If we're going to use expanded
> objects here, which seems like a good idea in principle, that's going
> to have to be fixed somehow. We're flogging the heck out of malloc by
> repeatedly creating a context, doing one or two allocations in it, and
> then destroying the context.
>
> I think that, in general, the memory context machinery wasn't really
> designed to manage lots of small contexts. The overhead of spinning
> up a new context for just a few allocations is substantial. That
> matters in some other situations too, I think - I've commonly seen
> AllocSetContextCreate taking several percent of runtime in profiles.
> But here it's much exacerbated. I'm not sure whether it's better to
> attack that problem at the root and try to make AllocSetContextCreate
> cheaper, or whether we should try to figure out some change to the
> expanded-object machinery to address the issue.
we create a context, we always allocate at least 1kB. If that's more
than we need for the node itself and the name, then we use the rest of
the space as a sort of keeper block. I think there's more that can be
done to improve this, but I'm wimping out for now because it's late
here.
I suspect something like this is a good idea even if we don't end up
using it for aggregate transition functions.
Thanks for writing this up, but I think the key issue is not addressed, which is the memory context per aggregate group just being a performance killer. I ran the test query again with the attached patch and the hashagg query still took 300 seconds on my VM with 4GB ram.
On Mon, Jan 18, 2016 at 11:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Here is a patch that helps a good deal. I changed things so that when >> we create a context, we always allocate at least 1kB. > > That's going to kill performance in some other cases; subtransactions > in particular rely on the subtransaction's TransactionContext not causing > any actual malloc unless something gets put into the TransactionContext. Sorry, I guess I wasn't clear: it increases the allocation for the context node itself to 1kB and uses the extra space to store a few allocations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 19, 2016 at 12:22 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I dare to claim that if expanded objects require a separate memory context > per aggregate state (I don't see why would be the case), it's a dead end. > Not so long ago we've fixed exactly this issue in array_agg(), which > addressed a bunch of reported OOM issues and improved array_agg() > performance quite a bit. It'd be rather crazy introducing the same problem > to all aggregate functions. Expanded objects require a separate memory context per Datum. I think I know how to rejigger things so that the overhead of that is minimized as much as possible, but it's still 200 bytes for the AllocSetContext + ~16 bytes for the name + 32 bytes for an AllocBlock + 48 bytes for an ExpandedObjectHeader. That's about 300 bytes of overhead per expanded datum, which means on this test case the theoretical minimum memory burn for this approach is about 300 MB (and in practice somewhat more). Ouch. The upshot seems to be that expanded objects aren't going to actually be useful in any situation where you might have very many of them, because the memory overhead is just awful. That's rather disappointing. Even if you could get rid of the requirement for a memory context per Datum - and it seems like you could if you added a method defined to reparent the object to a designated context, which looks like a totally reasonable innovation - somebody's probably going to say, not without some justification, that 48 bytes of ExpandedObjectHeader per object is an unaffordable luxury here. The structure has 8 bytes of unnecessary padding in it that we could elide easily enough, but after that there's not really much way to squeeze it down without hurting performance in some case cases where this mechanism is fast today. So I guess I'm going to agree that this approach is doomed. Rats. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[ rewinding to here from the detour I led us on ] On Mon, Dec 21, 2015 at 4:02 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Now, there has been talk of this previously, on various threads, but I don't > believe any final decisions were made on how exactly it should be done. At > the moment I plan to make changes as follows: > > Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and > aggserialtype These will only be required when aggtranstype is INTERNAL. Check. > Perhaps we should disallow CREATE AGGREAGET from accepting them for any > other type... Well, we should definitely not accept them and have them be meaningless. We should either reject them or accept them and then use them. I can't immediately think of a reason to serialize one non-internal type as another, but maybe there is one. > The return type of aggserialfn should be aggserialtype, and it > should take a single parameter of aggtranstype. aggdeserialfn will be the > reverse of that. Check. > Add a new bool field to nodeAgg's state named serialStates. If this is field > is set to true then when we're in finalizeAgg = false mode, we'll call the > serialfn on the agg state instead of the finalfn. This will allow the > serialized state to be stored in the tuple and sent off to the main backend. > The combine agg node should also be set to serialStates = true, so that it > knows to deserialize instead of just assuming that the agg state is of type > aggtranstype. I'm not quite sure, but it sounds like you might be overloading serialStates with two different meanings here. > I believe this should allow us to not cause any performance regressions by > moving away from INTERNAL agg states. It should also be very efficient for > internal states such as Int8TransTypeData, as this struct merely has 2 int64 > fields which should be very simple to stuff into a bytea varlena type. We > don't need to mess around converting the ->count and ->sum into a strings or > anything. I think it would be more user-friendly to emit these as 2-element integer arrays rather than bytea. Sure, bytea is fine when PostgreSQL is talking to itself, but if you are communicating with an external system, it's a different situation. If the remote system is PostgreSQL then you are again OK, except for the data going over the wire being incomprehensible and maybe byte-order-dependent, but what if you want some other database system to do partial aggregation and then further aggregate the result? You don't want the intermediate state to be some kooky thing that only another PostgreSQL database has a chance of generating correctly. > Then in order for the planner to allow parallel aggregation all aggregates > must: > > Not have a DISTINCT or ORDER BY clause > Have a combinefn > If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn. > > We can relax the requirement on 3 if we're using 2-stage aggregation, but > not parallel aggregation. When would we do that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 19, 2016 at 11:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [ rewinding to here from the detour I led us on ] > > On Mon, Dec 21, 2015 at 4:02 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Now, there has been talk of this previously, on various threads, but I don't >> believe any final decisions were made on how exactly it should be done. At >> the moment I plan to make changes as follows: Oh, one more point: is there any reason why all of this needs to be a single (giant) patch? I feel like the handling of internal states could be a separate patch from the basic patch to allow partial aggregation and aggregate-combining, and that the latter could be committed first if we had a reasonably final version of it. That seems like it would be easier from a review standpoint, and might allow more development to proceed in parallel, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20 January 2016 at 05:56, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 21, 2015 at 4:02 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Now, there has been talk of this previously, on various threads, but I don't
> believe any final decisions were made on how exactly it should be done. At
> the moment I plan to make changes as follows:
>
> Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and
> aggserialtype These will only be required when aggtranstype is INTERNAL.
Check.
> Perhaps we should disallow CREATE AGGREAGET from accepting them for any
> other type...
Well, we should definitely not accept them and have them be
meaningless. We should either reject them or accept them and then use
them. I can't immediately think of a reason to serialize one
non-internal type as another, but maybe there is one.
In the latest patch I'm disallowing serial and deserial functions for non INTERNAL state aggregates during CREATE AGGREGATE.
I think it's best to keep this disabled for now, and do so until we discover some reason that we might want want to enable it. If we enabled it, and later decided that was a dumb idea, then it'll be much harder to add the restriction later, since it may cause errors for users who have created their own aggregates.
> Add a new bool field to nodeAgg's state named serialStates. If this is field
> is set to true then when we're in finalizeAgg = false mode, we'll call the
> serialfn on the agg state instead of the finalfn. This will allow the
> serialized state to be stored in the tuple and sent off to the main backend.
> The combine agg node should also be set to serialStates = true, so that it
> knows to deserialize instead of just assuming that the agg state is of type
> aggtranstype.
I'm not quite sure, but it sounds like you might be overloading
serialStates with two different meanings here.
Hmm, only in the sense that serialStates means "serialise" and "deserialise". The only time that both could occur in the same node is if combineStates=true and finalizeAggs=false (in other words 3 or more aggregation stages).
Let's say we are performing aggregation in 3 stages, stage 1 is operating on normal values and uses the transfn on these values, but does not finalise the states. If serialStates = true here then, if one exists, we call the serialfn, passing in the aggregated state, and pass the return value of that function up to the next node. Now, this next (middle) node is important, serialStates must also be true here so that the executor knows to deserialise the previously serialised states. Now, this node uses the combinefn to merge states which require, and then since serialStates is true, it also (re)serialises those new states again. Now if there was some reason that this middle node should deserialise, but not re-serialise those states, then we may need an extra parameter to instruct it to do so. I guess this may be required if we were to perform some partial aggregation and combining again within a single process (in which case we'd not need to serialise INTERNAL states, we can just pass the pointer to them in the Tuple), but then we might require the states to be serialised in order to hand them over to the main process, from a worker process.
I can imagine cases where we might want to do this in the future, so perhaps it is worth it:
Imagine:
SELECT COUNT(*) FROM (SELECT * FROM (SELECT * FROM a UNION ALL SELECT * FROM b) AS ab UNION ALL (SELECT * FROM c UNION ALL SELECT * FROM d)) abcd;
Where the planner may decide to, on 1 worker perform count(*) on a then b, and combine that into ab, while doing the same for c and d on some other worker process.
> I believe this should allow us to not cause any performance regressions by
> moving away from INTERNAL agg states. It should also be very efficient for
> internal states such as Int8TransTypeData, as this struct merely has 2 int64
> fields which should be very simple to stuff into a bytea varlena type. We
> don't need to mess around converting the ->count and ->sum into a strings or
> anything.
I think it would be more user-friendly to emit these as 2-element
integer arrays rather than bytea. Sure, bytea is fine when PostgreSQL
is talking to itself, but if you are communicating with an external
system, it's a different situation. If the remote system is
PostgreSQL then you are again OK, except for the data going over the
wire being incomprehensible and maybe byte-order-dependent, but what
if you want some other database system to do partial aggregation and
then further aggregate the result? You don't want the intermediate
state to be some kooky thing that only another PostgreSQL database has
a chance of generating correctly.
If we do that then we also need to invent composite database types for the more complicated internal states such as NumericAggState.
We should probably also consider performance here too. I had been considering just using the *send() functions to build a bytea.
> Then in order for the planner to allow parallel aggregation all aggregates
> must:
>
> Not have a DISTINCT or ORDER BY clause
> Have a combinefn
> If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.
>
> We can relax the requirement on 3 if we're using 2-stage aggregation, but
> not parallel aggregation.
When would we do that?
+ * states as input rather than input tuples. This mode facilitates multiple
+ * aggregate stages which allows us to support pushing aggregation down
+ * deeper into the plan rather than leaving it for the final stage. For
+ * example with a query such as:
+ *
+ * SELECT count(*) FROM (SELECT * FROM a UNION ALL SELECT * FROM b);
+ *
+ * with this functionality the planner has the flexibility to generate a
+ * plan which performs count(*) on table a and table b separately and then
+ * add a combine phase to combine both results. In this case the combine
+ * function would simply add both counts together.
There's also the possibility to use this functionality later to allow GROUP BY to occur before the final plan stage, e.g before joining to some relation.
On 20 January 2016 at 05:58, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 21, 2015 at 4:02 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Now, there has been talk of this previously, on various threads, but I don't
>> believe any final decisions were made on how exactly it should be done. At
>> the moment I plan to make changes as follows:
Oh, one more point: is there any reason why all of this needs to be a
single (giant) patch? I feel like the handling of internal states
could be a separate patch from the basic patch to allow partial
aggregation and aggregate-combining, and that the latter could be
committed first if we had a reasonably final version of it. That
seems like it would be easier from a review standpoint, and might
allow more development to proceed in parallel, too.
I didn't ever really imagine that I'd include any actual new combinefns or serialfn/deserialfn in this patch. One set has of course now ended up in there, I can move these off to the test patch for now.
Did you imagine that the first patch in the series would only add the aggcombinefn column to pg_aggregate and leave the aggserialfn stuff until later? I thought it seemed better to get the infrastructure committed in one hit, then add a bunch of new combinefn, serialfn, deserialfns for existing aggregates in follow-on commits.
--
On Tue, Jan 19, 2016 at 4:50 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Oh, one more point: is there any reason why all of this needs to be a >> single (giant) patch? I feel like the handling of internal states >> could be a separate patch from the basic patch to allow partial >> aggregation and aggregate-combining, and that the latter could be >> committed first if we had a reasonably final version of it. That >> seems like it would be easier from a review standpoint, and might >> allow more development to proceed in parallel, too. > > I didn't ever really imagine that I'd include any actual new combinefns or > serialfn/deserialfn in this patch. One set has of course now ended up in > there, I can move these off to the test patch for now. > > Did you imagine that the first patch in the series would only add the > aggcombinefn column to pg_aggregate and leave the aggserialfn stuff until > later? I think that would be a sensible way to proceed. > I thought it seemed better to get the infrastructure committed in one > hit, then add a bunch of new combinefn, serialfn, deserialfns for existing > aggregates in follow-on commits. To my mind, priority #1 ought to be putting this fine new functionality to some use. Expanding it to every aggregate we've got seems like a distinctly second priority. That's not to say that it's absolutely gotta go down that way, but those would be my priorities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20 January 2016 at 10:54, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Jan 19, 2016 at 4:50 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>> Oh, one more point: is there any reason why all of this needs to be a
>> single (giant) patch? I feel like the handling of internal states
>> could be a separate patch from the basic patch to allow partial
>> aggregation and aggregate-combining, and that the latter could be
>> committed first if we had a reasonably final version of it. That
>> seems like it would be easier from a review standpoint, and might
>> allow more development to proceed in parallel, too.
>
> I didn't ever really imagine that I'd include any actual new combinefns or
> serialfn/deserialfn in this patch. One set has of course now ended up in
> there, I can move these off to the test patch for now.
>
> Did you imagine that the first patch in the series would only add the
> aggcombinefn column to pg_aggregate and leave the aggserialfn stuff until
> later?
I think that would be a sensible way to proceed.
> I thought it seemed better to get the infrastructure committed in one
> hit, then add a bunch of new combinefn, serialfn, deserialfns for existing
> aggregates in follow-on commits.
To my mind, priority #1 ought to be putting this fine new
functionality to some use. Expanding it to every aggregate we've got
seems like a distinctly second priority. That's not to say that it's
absolutely gotta go down that way, but those would be my priorities.
Agreed. So I've attached a version of the patch which does not have any of the serialise/deserialise stuff in it.
I've also attached a test patch which modifies the grouping planner to add a Partial Aggregate node, and a final aggregate node when it's possible. Running the regression tests with this patch only shows up variances in the EXPLAIN outputs, which is of course expected.
Attachment
On Wed, Jan 20, 2016 at 7:38 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> To my mind, priority #1 ought to be putting this fine new >> functionality to some use. Expanding it to every aggregate we've got >> seems like a distinctly second priority. That's not to say that it's >> absolutely gotta go down that way, but those would be my priorities. > > Agreed. So I've attached a version of the patch which does not have any of > the serialise/deserialise stuff in it. > > I've also attached a test patch which modifies the grouping planner to add a > Partial Aggregate node, and a final aggregate node when it's possible. > Running the regression tests with this patch only shows up variances in the > EXPLAIN outputs, which is of course expected. That seems great as a test, but what's the first patch that can put this to real and permanent use? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 January 2016 at 01:44, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>> To my mind, priority #1 ought to be putting this fine new
>> functionality to some use. Expanding it to every aggregate we've got
>> seems like a distinctly second priority. That's not to say that it's
>> absolutely gotta go down that way, but those would be my priorities.
>
> Agreed. So I've attached a version of the patch which does not have any of
> the serialise/deserialise stuff in it.
>
> I've also attached a test patch which modifies the grouping planner to add a
> Partial Aggregate node, and a final aggregate node when it's possible.
> Running the regression tests with this patch only shows up variances in the
> EXPLAIN outputs, which is of course expected.
That seems great as a test, but what's the first patch that can put
this to real and permanent use?
In this patch I've changed aggregates_allow_partial() so that it properly determines what is possible based on the aggregates which are in the query. This now, of course restricts the aggregates to "internal only" when the agg state type is INTERNAL, providing there's a combine function, of course.
Parallel aggregate should work with all the MAX() and MIN() functions and a handful of other ones, I've managed to borrow various existing function as the combine function for many aggregates:
--
# select count(*) from pg_aggregate where aggcombinefn <> 0;
count
-------
58
(1 row)
On Wed, Jan 20, 2016 at 7:53 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 January 2016 at 01:44, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> >> To my mind, priority #1 ought to be putting this fine new >> >> functionality to some use. Expanding it to every aggregate we've got >> >> seems like a distinctly second priority. That's not to say that it's >> >> absolutely gotta go down that way, but those would be my priorities. >> > >> > Agreed. So I've attached a version of the patch which does not have any >> > of >> > the serialise/deserialise stuff in it. >> > >> > I've also attached a test patch which modifies the grouping planner to >> > add a >> > Partial Aggregate node, and a final aggregate node when it's possible. >> > Running the regression tests with this patch only shows up variances in >> > the >> > EXPLAIN outputs, which is of course expected. >> >> That seems great as a test, but what's the first patch that can put >> this to real and permanent use? > > There's no reason why parallel aggregates can't use the > combine_aggregate_state_d6d480b_2016-01-21.patch patch. I agree. Are you going to work on that? Are you expecting me to work on that? Do you think we can use Haribabu's patch? What other applications are in play in the near term, if any? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2016 at 7:38 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Agreed. So I've attached a version of the patch which does not have any of > the serialise/deserialise stuff in it. I re-reviewed this and have committed most of it with only minor kibitizing. A few notes: - I changed the EXPLAIN code, since it failed to display the aggregate node's mode of operation in non-text format. - I removed some of the planner support, since I'm not sure that it's completely correct and I'm very sure that it contains more code duplication than I'm comfortable with (set_combineagg_references in particular). Please feel free to resubmit this part, perhaps in combination with code that actually uses it for something. But please also think about whether there's a way to reduce the duplication, and maybe consider adding some comments, too. - I'm not sure that you made the right call regarding reusing transfn_oid and transfn for combine functions, vs. adding separate fields. It is sort of logical and has the advantage of keeping the data structure smaller, but it might also be thought confusing or inappropriate on other grounds. But I think we can change it later if that comes to seem like the right thing to do, so I've left it the way you had it for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 January 2016 at 04:59, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 7:53 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 21 January 2016 at 01:44, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>> >> To my mind, priority #1 ought to be putting this fine new
>> >> functionality to some use. Expanding it to every aggregate we've got
>> >> seems like a distinctly second priority. That's not to say that it's
>> >> absolutely gotta go down that way, but those would be my priorities.
>> >
>> > Agreed. So I've attached a version of the patch which does not have any
>> > of
>> > the serialise/deserialise stuff in it.
>> >
>> > I've also attached a test patch which modifies the grouping planner to
>> > add a
>> > Partial Aggregate node, and a final aggregate node when it's possible.
>> > Running the regression tests with this patch only shows up variances in
>> > the
>> > EXPLAIN outputs, which is of course expected.
>>
>> That seems great as a test, but what's the first patch that can put
>> this to real and permanent use?
>
> There's no reason why parallel aggregates can't use the
> combine_aggregate_state_d6d480b_2016-01-21.patch patch.
I agree. Are you going to work on that? Are you expecting me to work
on that? Do you think we can use Haribabu's patch? What other
applications are in play in the near term, if any?
At the moment I think everything which will use this is queued up behind the pathification of the grouping planner which Tom is working on. I think naturally Parallel Aggregate makes sense to work on first, given all the other parallel stuff in this release. I plan on working on that that by either assisting Haribabu, or... whatever else it takes.
The other two usages which I have thought of are;
1) Aggregating before UNION ALL, which might be fairly simple after the grouping planner changes, as it may just be a matter of considering another "grouping path" which partially aggregates before the UNION ALL, and performs the final grouping stage after UNION ALL. At this stage it's hard to say how that will work as I'm not sure how far changes to the grouping planner will go. Perhaps Tom can comment?
2) Group before join. e.g select p.description,sum(s.qty) from sale s inner join s.product_id = p.product_id group by s.product_id group by p.description; I have a partial patch which implements this, although I was a bit stuck on if I should invent the concept of "GroupingPaths", or just inject alternative subquery relations which are already grouped by the correct clause. The problem with "GroupingPaths" was down to the row estimates currently come from the RelOptInfo and is set in set_baserel_size_estimates() which always assumes the ungrouped number of rows, which is not what's needed if the grouping is already performed. I was holding off to see how Tom does this in the grouping planner changes.
--
On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Agreed. So I've attached a version of the patch which does not have any of
> the serialise/deserialise stuff in it.
I re-reviewed this and have committed most of it with only minor
kibitizing. A few notes:
Great! Thank you for committing.
- I changed the EXPLAIN code, since it failed to display the aggregate
node's mode of operation in non-text format.
Oops, thanks for fixing.
- I removed some of the planner support, since I'm not sure that it's
completely correct and I'm very sure that it contains more code
duplication than I'm comfortable with (set_combineagg_references in
particular). Please feel free to resubmit this part, perhaps in
combination with code that actually uses it for something. But please
also think about whether there's a way to reduce the duplication, and
maybe consider adding some comments, too.
That part is a gray area. Wasn't that sure it belonged to this patch either. The problem is, that without it an Aggref's return type is wrong when the node is in combine mode, which might not be a problem now as we don't have a planner yet that generates these nodes.
- I'm not sure that you made the right call regarding reusing
transfn_oid and transfn for combine functions, vs. adding separate
fields. It is sort of logical and has the advantage of keeping the
data structure smaller, but it might also be thought confusing or
inappropriate on other grounds. But I think we can change it later if
that comes to seem like the right thing to do, so I've left it the way
you had it for now.
Sharing the variable, as horrid as that might seem does simplify some code. For example if you look at find_compatible_pertrans(), then you see:
if (aggtransfn != pertrans->transfn_oid ||
aggtranstype != pertrans->aggtranstype)
continue;
This would need to be changed to something like:
if (!aggstate->combineStates &&
(aggtransfn != pertrans->transfn_oid ||
aggtranstype != pertrans->aggtranstype))
continue;
else if (aggstate->combineStates &&
(aggcombinefn != pertrans->combinefn_oid ||
aggtranstype != pertrans->aggtranstype))
continue;
And we'd then have to pass aggcombinefn to that function too.
What might be better would be to rename the variable to something name that screams something a bit more generic.
--
On Thu, Jan 21, 2016 at 12:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 January 2016 at 04:59, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Jan 20, 2016 at 7:53 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> > On 21 January 2016 at 01:44, Robert Haas <robertmhaas@gmail.com> wrote: >> >> >> >> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley >> >> <david.rowley@2ndquadrant.com> wrote: >> >> >> To my mind, priority #1 ought to be putting this fine new >> >> >> functionality to some use. Expanding it to every aggregate we've >> >> >> got >> >> >> seems like a distinctly second priority. That's not to say that >> >> >> it's >> >> >> absolutely gotta go down that way, but those would be my priorities. >> >> > >> >> > Agreed. So I've attached a version of the patch which does not have >> >> > any >> >> > of >> >> > the serialise/deserialise stuff in it. >> >> > >> >> > I've also attached a test patch which modifies the grouping planner >> >> > to >> >> > add a >> >> > Partial Aggregate node, and a final aggregate node when it's >> >> > possible. >> >> > Running the regression tests with this patch only shows up variances >> >> > in >> >> > the >> >> > EXPLAIN outputs, which is of course expected. >> >> >> >> That seems great as a test, but what's the first patch that can put >> >> this to real and permanent use? >> > >> > There's no reason why parallel aggregates can't use the >> > combine_aggregate_state_d6d480b_2016-01-21.patch patch. >> >> I agree. Are you going to work on that? Are you expecting me to work >> on that? Do you think we can use Haribabu's patch? What other >> applications are in play in the near term, if any? > > > At the moment I think everything which will use this is queued up behind the > pathification of the grouping planner which Tom is working on. I think > naturally Parallel Aggregate makes sense to work on first, given all the > other parallel stuff in this release. I plan on working on that that by > either assisting Haribabu, or... whatever else it takes. > Here I attached updated patch of parallel aggregate based on the latest changes in master. Still it lack of cost comparison of normal aggregate to parallel aggregate because of difficulty. This cost comparison is required in parallel aggregate as this is having some regression when the number of groups are less in the query plan. Regards, Hari Babu Fujitsu Australia
Attachment
On Thu, Jan 21, 2016 at 1:33 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > Here I attached updated patch of parallel aggregate based on the latest > changes in master. Still it lack of cost comparison of normal aggregate to > parallel aggregate because of difficulty. This cost comparison is required > in parallel aggregate as this is having some regression when the number > of groups are less in the query plan. > Updated patch is attached after removing a warning in building group aggregate path. Regards, Hari Babu Fujitsu Australia
Attachment
On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Agreed. So I've attached a version of the patch which does not have any of
> the serialise/deserialise stuff in it.
I re-reviewed this and have committed most of it with only minor
kibitizing. A few notes:
I'll submit this part to March 'fest, where hopefully we'll also have something to utilise it.
--
Attachment
On 21 January 2016 at 15:53, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Thu, Jan 21, 2016 at 1:33 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> Here I attached updated patch of parallel aggregate based on the latest
> changes in master. Still it lack of cost comparison of normal aggregate to
> parallel aggregate because of difficulty. This cost comparison is required
> in parallel aggregate as this is having some regression when the number
> of groups are less in the query plan.
>
Updated patch is attached after removing a warning in building group
aggregate path.
Hi,
Thanks for updating the patch. I'd like to look at this with priority, but can you post it on the Parallel Agg thread? that way anyone following there can chime in over there rather than here. I've still got a bit of work to do (in the not too distant future) on the serial/deserial part, so would be better to keep this thread for discussion on that.
On Wed, Jan 20, 2016 at 8:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > At the moment I think everything which will use this is queued up behind the > pathification of the grouping planner which Tom is working on. I think > naturally Parallel Aggregate makes sense to work on first, given all the > other parallel stuff in this release. I plan on working on that that by > either assisting Haribabu, or... whatever else it takes. Great! > The other two usages which I have thought of are; > > 1) Aggregating before UNION ALL, which might be fairly simple after the > grouping planner changes, as it may just be a matter of considering another > "grouping path" which partially aggregates before the UNION ALL, and > performs the final grouping stage after UNION ALL. At this stage it's hard > to say how that will work as I'm not sure how far changes to the grouping > planner will go. Perhaps Tom can comment? I hope he will, but in the meantime let me ask how this does us any good. UNION ALL turns into an Append node. Pushing aggregation through an Append doesn't make anything faster AFAICS *unless* you can further optimize beginning at that point. For example, if one or both sides of the Append node have a Gather under them, and you can push the partial aggregation down into the Gather; or if you hit a Foreign Scan and can have it do partial aggregation on the remote side, you win. But if you just have: Finalize Aggregate -> Append -> Partial Aggregate -> Thing One -> Partial Aggregate -> Thing Two Instead of: Aggregate -> Append -> Thing One -> Thing Two ...then I don't see the point, at least not for a single-group Aggregate or HashAggregate. For a GroupAggregate, Thing One and Thing Two might need to be sorted, and sorting two or more smaller data sets might be faster than sorting one larger data set, but I can't see us winning anything very big here. To be clear, I'm not saying we shouldn't do this. I just don't think it does much *by itself*. If you combine it with other optimizations that let the aggregation be pushed further down the plan tree, it could win big. > 2) Group before join. e.g select p.description,sum(s.qty) from sale s inner > join s.product_id = p.product_id group by s.product_id group by > p.description; I have a partial patch which implements this, although I was > a bit stuck on if I should invent the concept of "GroupingPaths", or just > inject alternative subquery relations which are already grouped by the > correct clause. The problem with "GroupingPaths" was down to the row > estimates currently come from the RelOptInfo and is set in > set_baserel_size_estimates() which always assumes the ungrouped number of > rows, which is not what's needed if the grouping is already performed. I was > holding off to see how Tom does this in the grouping planner changes. Your sample query has two GROUP BY clauses; I think you meant to include only the second one. This seems like it can win big, because it's possible that joining first means joining against a much larger relation, whereas aggregating first might reduce "sale" to a volume of data that fits in work_mem, after which you can hash join. On the other hand, if you add WHERE p.description LIKE '%snuffleupagus%' then the aggregate-first strategy figures to lose, assuming things are indexed properly. On the substantive question you raise, I hope Tom will weigh in here. However, I'll give you my take. It seems to me that you are likely to run into problems not only with the row counts but also with target lists and width estimates. All of those things change once you aggregate. It seems clear that you are going to need a GroupingPath here, but it also seems really clear that you can't take a path where some aggregation has been done and use add_path to toss it on the pile for the same RelOptInfo as unaggregated paths to which it is not comparable. Right now, there's a RelOptInfo corresponding to each subset of the joinrels for which we build paths; but in this new world, I think you'll end up with multiple joinrels for each RelOptInfo, one per In your example above, you'd end up with RelOptInfos for (1) {s} with no aggregation, (2) {p} with no aggregation, (3) {s p} with no aggregation, (4) {s} aggregated by s.product_id and (5) {s p} aggregated by p.description. You can generate paths for (5) either by joining (2) to (4) or by aggregating (3), and the cheapest path for (5) is the overall winner. It seems a bit tricky to determine which aggregations are worth considering, and how to represent that in the RelOptInfo, but overall that feels like the right idea. I'm not sure how much this is going to overlap with the work Tom is already doing. I think his principal goal is to let the planning of a subquery return multiple candidate paths. That's almost certain to involve creating something like GroupingPath, though I can't predict the details, and I suspect it's also likely that he'll create some new RelOptInfo that represents the output of the grouping stage. However, I'm not sure that'll have the aggregation that has been performed represented in any principled way that would lend itself to having more than one RelOptInfo of that kind. That's probably another problem altogether. However, my Tom-predictive capabilities are far from perfect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2016 at 9:33 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > Here I attached updated patch of parallel aggregate based on the latest > changes in master. Still it lack of cost comparison of normal aggregate to > parallel aggregate because of difficulty. This cost comparison is required > in parallel aggregate as this is having some regression when the number > of groups are less in the query plan. Please keep this on its own thread rather than colonizing this one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 January 2016 at 06:56, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 20, 2016 at 8:32 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> The other two usages which I have thought of are; >> >> 1) Aggregating before UNION ALL, which might be fairly simple after the >> grouping planner changes, as it may just be a matter of considering another >> "grouping path" which partially aggregates before the UNION ALL, and >> performs the final grouping stage after UNION ALL. At this stage it's hard >> to say how that will work as I'm not sure how far changes to the grouping >> planner will go. Perhaps Tom can comment? > > I hope he will, but in the meantime let me ask how this does us any > good. UNION ALL turns into an Append node. Pushing aggregation > through an Append doesn't make anything faster AFAICS *unless* you can > further optimize beginning at that point. For example, if one or both > sides of the Append node have a Gather under them, and you can push > the partial aggregation down into the Gather; or if you hit a Foreign > Scan and can have it do partial aggregation on the remote side, you > win. But if you just have: > > Finalize Aggregate > -> Append > -> Partial Aggregate > -> Thing One > -> Partial Aggregate > -> Thing Two > > Instead of: > > Aggregate > -> Append > -> Thing One > -> Thing Two > > ...then I don't see the point, at least not for a single-group > Aggregate or HashAggregate. For a GroupAggregate, Thing One and Thing > Two might need to be sorted, and sorting two or more smaller data sets > might be faster than sorting one larger data set, but I can't see us > winning anything very big here. > > To be clear, I'm not saying we shouldn't do this. I just don't think > it does much *by itself*. If you combine it with other optimizations > that let the aggregation be pushed further down the plan tree, it > could win big. Yes, I agree, it's not a big win, at least not in the case of a serial plan. If each branch of the UNION ALL could be processed in parallel, then that's different. It's quite simple to test how much of a win it'll be in the serial case today, and yes, it's not much, but it's a bit. create table t1 as select x from generate_series(1,1000000) x(x); vacuum analyze t1; select count(*) from (select * from t1 union all select * from t1) t; count ---------2000000 (1 row) Time: 185.793 ms -- Mock up pushed down aggregation by using sum() as a combine function for count(*) select sum(c) from (select count(*) c from t1 union all select count(*) from t1) t; sum ---------2000000 (1 row) Time: 162.076 ms Not particularly incredible, but we don't normally turn our noses up at a 14% improvement, so let's just see how complex it will be to implement, once the upper planner changes are done. But as you mention about lack of ability to make use of pre-sorted Path for each branch of the UNION ALL; I was really hoping Tom's patch will improve that part by allowing the planner to choose a pre-sorted Path and perform a MergeAppend instead of an Append, which would allow pre-sorted input into a GroupAggregate node. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jan 21, 2016 at 3:52 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 January 2016 at 15:53, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Thu, Jan 21, 2016 at 1:33 PM, Haribabu Kommi >> <kommi.haribabu@gmail.com> wrote: >> > >> > Here I attached updated patch of parallel aggregate based on the latest >> > changes in master. Still it lack of cost comparison of normal aggregate >> > to >> > parallel aggregate because of difficulty. This cost comparison is >> > required >> > in parallel aggregate as this is having some regression when the number >> > of groups are less in the query plan. >> > >> >> Updated patch is attached after removing a warning in building group >> aggregate path. > > > Hi, > > Thanks for updating the patch. I'd like to look at this with priority, but > can you post it on the Parallel Agg thread? that way anyone following there > can chime in over there rather than here. I've still got a bit of work to > do (in the not too distant future) on the serial/deserial part, so would be > better to keep this thread for discussion on that. Thanks for the details. Sorry for sending parallel aggregate patch in this thread. I will take care of it from next time onward. Regards, Hari Babu Fujitsu Australia
On Thu, Jan 21, 2016 at 4:08 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > It's quite simple to test how much of a win it'll be in the serial > case today, and yes, it's not much, but it's a bit. > > create table t1 as select x from generate_series(1,1000000) x(x); > vacuum analyze t1; > select count(*) from (select * from t1 union all select * from t1) t; > count > --------- > 2000000 > (1 row) > > Time: 185.793 ms > > -- Mock up pushed down aggregation by using sum() as a combine > function for count(*) > select sum(c) from (select count(*) c from t1 union all select > count(*) from t1) t; > sum > --------- > 2000000 > (1 row) > > Time: 162.076 ms > > Not particularly incredible, but we don't normally turn our noses up > at a 14% improvement, so let's just see how complex it will be to > implement, once the upper planner changes are done. Mumble mumble. Why is that even any faster? Just because we avoid the projection overhead of the Append node, which is a no-op here anyway? If so, a planner change is one thought, but perhaps we also ought to look at whether we can't reduce the execution-time overhead. > But as you mention about lack of ability to make use of pre-sorted > Path for each branch of the UNION ALL; I was really hoping Tom's patch > will improve that part by allowing the planner to choose a pre-sorted > Path and perform a MergeAppend instead of an Append, which would allow > pre-sorted input into a GroupAggregate node. I won't hazard a guess on that point... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 20, 2016 at 7:38 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Agreed. So I've attached a version of the patch which does not have any of >> the serialise/deserialise stuff in it. > > I re-reviewed this and have committed most of it with only minor > kibitizing. A few notes: This commit has broken pg_dump. At least, I think this is the thread referencing this commit: commit a7de3dc5c346e07e0439275982569996e645b3c2 Author: Robert Haas <rhaas@postgresql.org> Date: Wed Jan 20 13:46:50 2016 -0500 Support multi-stage aggregation. If I add an aggregate to an otherwise empty 9.4.5 database: CREATE OR REPLACE FUNCTION first_agg ( anyelement, anyelement ) RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$ SELECT $1; $$; -- And then wrap an aggregate around it -- aggregates do not support create or replace, alas. CREATE AGGREGATE first ( sfunc = first_agg, basetype = anyelement, stype = anyelement ); And then run pg_dump from 9.6.this on it, I get: pg_dump: column number -1 is out of range 0..17 Segmentation fault (core dumped) Program terminated with signal 11, Segmentation fault. #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) at pg_dump.c:12670 12670 if (strcmp(aggcombinefn, "-") != 0) (gdb) bt #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) at pg_dump.c:12670 #1 0x000000000041df7a in main (argc=<value optimized out>, argv=<value optimized out>) at pg_dump.c:810 Cheers, Jeff
On 23 January 2016 at 09:17, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> Agreed. So I've attached a version of the patch which does not have any of >>> the serialise/deserialise stuff in it. >> >> I re-reviewed this and have committed most of it with only minor >> kibitizing. A few notes: > > > This commit has broken pg_dump. At least, I think this is the thread > referencing this commit: > ... > pg_dump: column number -1 is out of range 0..17 > Segmentation fault (core dumped) > > Program terminated with signal 11, Segmentation fault. > #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) > at pg_dump.c:12670 > 12670 if (strcmp(aggcombinefn, "-") != 0) > (gdb) bt > #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) > at pg_dump.c:12670 > #1 0x000000000041df7a in main (argc=<value optimized out>, > argv=<value optimized out>) at pg_dump.c:810 Yikes :-( I will look at this with priority in the next few hours. Thanks for the report. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 23 January 2016 at 09:44, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 23 January 2016 at 09:17, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Wed, Jan 20, 2016 at 11:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley >>> <david.rowley@2ndquadrant.com> wrote: >>>> Agreed. So I've attached a version of the patch which does not have any of >>>> the serialise/deserialise stuff in it. >>> >>> I re-reviewed this and have committed most of it with only minor >>> kibitizing. A few notes: >> >> >> This commit has broken pg_dump. At least, I think this is the thread >> referencing this commit: >> > ... >> pg_dump: column number -1 is out of range 0..17 >> Segmentation fault (core dumped) >> >> Program terminated with signal 11, Segmentation fault. >> #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) >> at pg_dump.c:12670 >> 12670 if (strcmp(aggcombinefn, "-") != 0) >> (gdb) bt >> #0 0x0000000000416b0b in dumpAgg (fout=0x1e551e0, agginfo=0x1e65ec0) >> at pg_dump.c:12670 >> #1 0x000000000041df7a in main (argc=<value optimized out>, >> argv=<value optimized out>) at pg_dump.c:810 > > Yikes :-( I will look at this with priority in the next few hours. > > Thanks for the report. It seems that I must have mistakenly believed that non-existing columns for previous versions were handled using the power of magic. Turns out that I was wrong, and they need to be included as dummy columns in the queries for previous versions. The attached fixes the problem. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Fri, Jan 22, 2016 at 4:53 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > > It seems that I must have mistakenly believed that non-existing > columns for previous versions were handled using the power of magic. > Turns out that I was wrong, and they need to be included as dummy > columns in the queries for previous versions. > > The attached fixes the problem. Yep, that fixes it. Thanks, Jeff
On Sat, Jan 23, 2016 at 6:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Jan 22, 2016 at 4:53 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> >> It seems that I must have mistakenly believed that non-existing >> columns for previous versions were handled using the power of magic. >> Turns out that I was wrong, and they need to be included as dummy >> columns in the queries for previous versions. >> >> The attached fixes the problem. > > Yep, that fixes it. Committed. Apologies for the delay. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 21, 2016 at 3:42 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> > Agreed. So I've attached a version of the patch which does not have any >> > of >> > the serialise/deserialise stuff in it. >> >> I re-reviewed this and have committed most of it with only minor >> kibitizing. A few notes: > > > I've attached the re-based remainder, which includes the serial/deserial > again. > > I'll submit this part to March 'fest, where hopefully we'll also have > something to utilise it. > While testing parallel aggregate with float4 and float8 types based on the latest patch, I found the following problems, + /* + * For partial aggregation we must adjust the return types of + * the Aggrefs + */ + if (!aggplan->finalizeAggs) + set_partialagg_aggref_types(root, plan, + aggplan->serialStates); [...] + aggref->aggtype = aggform->aggserialtype; + else + aggref->aggtype = aggform->aggtranstype; Changing the aggref->aggtype with aggtranstype or aggserialtype will only gets it changed in partial aggregate plan, as set_upper_references starts from the top plan and goes further. Because of this reason, the targetlist contains for the node below finalize aggregate are still points to original type only. To fix this problem, I tried updating the targetlist aggref->aggtype with transtype during aggregate plan itself, that leads to a problem in setting upper plan references. This is because, while fixing the aggregate reference of upper plans after partial aggregate, the aggref at upper plan nodes doesn't match with aggref that is coming from partial aggregate node because of aggtype difference in _equalAggref function. COMPARE_SCALAR_FIELD(aggtype); Temporarily i corrected it compare it against aggtranstype and aggserialtype also then it works fine. I don't see that change as correct approach. Do you have any better idea to solve this problem? Regards, Hari Babu Fujitsu Australia
On 1/20/16 11:42 PM, David Rowley wrote: > On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Wed, Jan 20, 2016 at 7:38 AM, David Rowley > <david.rowley@2ndquadrant.com <mailto:david.rowley@2ndquadrant.com>> > wrote: > > Agreed. So I've attached a version of the patch which does not have any of > > the serialise/deserialise stuff in it. > > I re-reviewed this and have committed most of it with only minor > kibitizing. A few notes: > > > I've attached the re-based remainder, which includes the serial/deserial > again. > > I'll submit this part to March 'fest, where hopefully we'll also have > something to utilise it. As far as I can tell this is the most recent version of the patch but it doesn't apply on master. I'm guessing that's mostly because of Tom's UPP patch. This is a long and confusing thread and I should know since I just read through the entire thing. I'm setting this back to "waiting for author" and I'd like see a rebased patch and a summary of what's in the patch before setting it back to "needs review". -- -David david@pgmasters.net
On 12 March 2016 at 06:28, David Steele <david@pgmasters.net> wrote: > As far as I can tell this is the most recent version of the patch but it > doesn't apply on master. I'm guessing that's mostly because of Tom's > UPP patch. > > This is a long and confusing thread and I should know since I just read > through the entire thing. I'm setting this back to "waiting for author" > and I'd like see a rebased patch and a summary of what's in the patch > before setting it back to "needs review". Many thank for taking the time to read through this thread. Let me try to summarise the current status. Overview: Combine Aggregate states is required to allow aggregate functions to be calculated in multiple stages. The primary intended use for this, as of now is for allowing parallel aggregate. A simple combine function for count(*) would perform something very similar to sum() of all of the partially calculated aggregate states. In this case "partially" means aggregated states without the final function called. Status: As of [1] we have basic infrastructure in place to allow this, but that commit only added the infrastructure for aggregate functions which don't have an INTERNAL state type. Many aggregates do have an internal state type, and for the case of parallel aggregate, since these internal states are just passed around as pointers to memory which belongs to the process wherever the member was allocated, we're unable to pass these back to the main process for the final aggregate stage. This patch allows this to happen by way of serializing these internal states into some type that can be copied into the main backend process. Robert was quite right to leave this part out of the commit as at the time the method I was using was under debate, and there was no reason to hold things up, as the more simple case could be committed to allow work to continue on features which would use this new infrastructure. These threads are a bit of a mess as there's often confusion (from me too) about which code belongs in which patch. Quite possibly the setrefs.c stuff of the parallel aggregate patch actually do belong here, but it's probably not worth shuffling things around as I think the parallel agg stuff should go in before the additional serial states stuff. Current patch: I've now updated the patch to base it on top of the parallel aggregate patch in [2]. To apply the attached, you must apply [2] first! The changes to numeric.c are not really intended for commit for now. I plan on spending a bit of time first on looking at using send functions and bytea serial types, which if that works should be far more efficient than converting the serialized output into a string. The example combine function in numeric.c is just intended to show that this does actually work. I've also left one opr_sanity test failing for now. This test is complaining that the de-serialisation function returns an INTERNAL type and does not accept one (which would disallow it from being called from SQL interface). I think this is safe to allow as I've coded the function to fail if it's not called in aggregate context. Perhaps I'm wrong though, so I left the test failing to highlight that I'd like someone else to look and maybe think about this a bit. It seems very important not to get this part wrong. Comments and reviews etc all welcome. Please remember to apply [2] before the attached. [1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a7de3dc5c346e07e0439275982569996e645b3c2 [2] http://www.postgresql.org/message-id/CAKJS1f9Tr-+9aKWZ1XuHUnEJZ7GKKfo45b+4fCNj8DkrDZYK4g@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 14 March 2016 at 15:20, David Rowley <david.rowley@2ndquadrant.com> wrote: > Current patch: > I've now updated the patch to base it on top of the parallel aggregate > patch in [2]. To apply the attached, you must apply [2] first! ... > [2] http://www.postgresql.org/message-id/CAKJS1f9Tr-+9aKWZ1XuHUnEJZ7GKKfo45b+4fCNj8DkrDZYK4g@mail.gmail.com The attached fixes a small thinko in apply_partialaggref_nodes(). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 03/14/2016 05:45 AM, David Rowley wrote: > On 14 March 2016 at 15:20, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Current patch: >> I've now updated the patch to base it on top of the parallel aggregate >> patch in [2]. To apply the attached, you must apply [2] first! > > ... >> [2] http://www.postgresql.org/message-id/CAKJS1f9Tr-+9aKWZ1XuHUnEJZ7GKKfo45b+4fCNj8DkrDZYK4g@mail.gmail.com > > The attached fixes a small thinko in apply_partialaggref_nodes(). After looking at the parallel aggregate patch, I also looked at this one, as it's naturally related. Sadly I haven't found any issue that I could nag about ;-) The patch seems well baked, as it was in the oven for quite a long time already. The one concern I do have is that it only adds (de)serialize functions for SUM(numeric) and AVG(numeric). I think it's reasonable not to include that into the patch, but it will look a bit silly if that's all that gets into 9.6. It's true plenty of aggregates is supported out of the box, as they use fixed-length data types for the aggregate state. But imagine users happy that their aggregate queries get parallelized, only to find out that sum(int8) disables that :-/ So I think we should try to support as many of the aggregates using internal as possible - it seems quite straight-forward to do that at least for the aggregates using the same internal representation (avg, sum, var_samp, var_pop, variance, ...). That's 26 aggregates out of the 45 with aggtranstype=INTERNAL. For the remaining aggregates (jsonb_*. json_*, string_agg, array_agg, percentile_) it's probably more complicated to serialize the state, and maybe even impossible to do that correctly (e.g. string_agg accumulates the strings in the order as received, but partial paths would build private lists - not sure if this can actually happen in practice). I think it would be good to actually document which aggregates support parallel execution, and which don't. Currently the docs don't mention this at all, and it's tricky for regular users to deduce that as it's related to the type of the internal state and not to the input types. An example of that is the SUM(bigint) example mentioned above. That's actually the one thing I think the patch is missing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 16 March 2016 at 06:39, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > After looking at the parallel aggregate patch, I also looked at this one, as > it's naturally related. Sadly I haven't found any issue that I could nag > about ;-) The patch seems well baked, as it was in the oven for quite a long > time already. Thanks for looking at this. > The one concern I do have is that it only adds (de)serialize functions for > SUM(numeric) and AVG(numeric). I think it's reasonable not to include that > into the patch, but it will look a bit silly if that's all that gets into > 9.6. Yes me too, so I spent several hours yesterday writing all of the combine functions and serialisation/deserialisation that are required for all of SUM(), AVG() STDDEV*(). I also noticed that I had missed using some existing functions for bool_and() and bool_or() so I added those to pg_aggregate.h. I'm just chasing down a crash bug on HAVE_INT128 enabled builds, so should be posting a patch quite soon. I didn't touch the FLOAT4 and FLOAT8 aggregates as I believe Haribabu has a patch for that over on the parallel aggregate thread. I've not looked at it in detail yet. If Haribabu's patch does all that's required for the numerical aggregates for floating point types then the status of covered aggregates is (in order of pg_aggregate.h): * AVG() complete coverage * SUM() complete coverage * MAX() complete coverage * MIN() complete coverage * COUNT() complete coverage * STDDEV + friends complete coverage * regr_*,covar_pop,covar_samp,corr not touched these. * bool*() complete coverage * bitwise aggs. complete coverage * Remaining are not touched. I see diminishing returns with making these parallel for now. I think I might not be worth pushing myself any harder to make these ones work. Does what I have done + floating point aggs from Haribabu seem reasonable for 9.6? > I think it would be good to actually document which aggregates support > parallel execution, and which don't. Currently the docs don't mention this > at all, and it's tricky for regular users to deduce that as it's related to > the type of the internal state and not to the input types. An example of > that is the SUM(bigint) example mentioned above. I agree. I will look for a suitable place. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 16, 2016 at 8:34 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 16 March 2016 at 06:39, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> After looking at the parallel aggregate patch, I also looked at this one, as >> it's naturally related. Sadly I haven't found any issue that I could nag >> about ;-) The patch seems well baked, as it was in the oven for quite a long >> time already. > > Thanks for looking at this. > >> The one concern I do have is that it only adds (de)serialize functions for >> SUM(numeric) and AVG(numeric). I think it's reasonable not to include that >> into the patch, but it will look a bit silly if that's all that gets into >> 9.6. > > Yes me too, so I spent several hours yesterday writing all of the > combine functions and serialisation/deserialisation that are required > for all of SUM(), AVG() STDDEV*(). I also noticed that I had missed > using some existing functions for bool_and() and bool_or() so I added > those to pg_aggregate.h. I'm just chasing down a crash bug on > HAVE_INT128 enabled builds, so should be posting a patch quite soon. I > didn't touch the FLOAT4 and FLOAT8 aggregates as I believe Haribabu > has a patch for that over on the parallel aggregate thread. I've not > looked at it in detail yet. The additional combine function patch that I posted handles all float4 and float8 aggregates. There is an OID conflict with the latest source code, I will update the patch and post it in that thread. Regards, Hari Babu Fujitsu Australia
On 16 March 2016 at 10:34, David Rowley <david.rowley@2ndquadrant.com> wrote: > If Haribabu's patch does all that's required for the numerical > aggregates for floating point types then the status of covered > aggregates is (in order of pg_aggregate.h): > > * AVG() complete coverage > * SUM() complete coverage > * MAX() complete coverage > * MIN() complete coverage > * COUNT() complete coverage > * STDDEV + friends complete coverage > * regr_*,covar_pop,covar_samp,corr not touched these. > * bool*() complete coverage > * bitwise aggs. complete coverage > * Remaining are not touched. I see diminishing returns with making > these parallel for now. I think I might not be worth pushing myself > any harder to make these ones work. > > Does what I have done + floating point aggs from Haribabu seem > reasonable for 9.6? I've attached a series of patches. Patch 1: This is the parallel aggregate patch, not intended for review here. However, all further patches are based on this, and this adds the required planner changes to make it possible to test patches 2 and 3. Patch 2: This adds the serial/deserial aggregate infrastructure, pg_dump support, CREATE AGGREGATE changes, and nodeAgg.c changes to have it serialise and deserialise aggregate states when instructed to do so. Patch 3: This adds a boat load of serial/deserial functions, and combine functions for most of the built-in numerical aggregate functions. It also contains some regression tests which should really be in patch 2, but I with patch 2 there's no suitable serialisation or de-serialisation functions to test CREATE AGGREGATE with. I think having them here is ok, as patch 2 is quite useless without patch 3 anyway. Another thing to note about this patch is that I've gone and created serial/de-serial functions for when PolyNumAggState both require sumX2, and don't require sumX2. I had thought about perhaps putting an extra byte in the serial format to indicate if a sumX2 is included, but I ended up not doing it this way. I don't really want these serial formats getting too complex as we might like to do fun things like pass them along to sharded servers one day, so it might be nice to keep them simple. Patch 4: Adds a bunch of opr_sanity regression tests. This could be part of patch 3, but 3 was quite big already. Patch 5: Adds some documentation to indicate which aggregates allow partial mode. Comments welcome. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Allow-aggregation-to-happen-in-parallel_2016-03-16.patch
- 0002-Allow-INTERNAL-state-aggregates-to-participate-in-pa_2016-03-16.patch
- 0003-Add-various-aggregate-combine-and-serialize-de-seria_2016-03-16.patch
- 0004-Add-sanity-regression-tests-for-new-aggregate-serial_2016-03-16.patch
- 0005-Add-documents-to-explain-which-aggregates-support-pa_2016-03-16.patch
On 16 March 2016 at 23:54, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Wed, Mar 16, 2016 at 8:34 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Yes me too, so I spent several hours yesterday writing all of the >> combine functions and serialisation/deserialisation that are required >> for all of SUM(), AVG() STDDEV*(). I also noticed that I had missed >> using some existing functions for bool_and() and bool_or() so I added >> those to pg_aggregate.h. I'm just chasing down a crash bug on >> HAVE_INT128 enabled builds, so should be posting a patch quite soon. I >> didn't touch the FLOAT4 and FLOAT8 aggregates as I believe Haribabu >> has a patch for that over on the parallel aggregate thread. I've not >> looked at it in detail yet. > > The additional combine function patch that I posted handles all float4 and > float8 aggregates. There is an OID conflict with the latest source code, > I will update the patch and post it in that thread. Thanks! I just send a series of patches which add a whole host of serial/deserial functions, and a patch which adds some documentation. Maybe you could base your patch on the 0005 patch, and update the documents to remove the "All types apart from floating-point types" text and replace that with "Yes". -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 03/16/2016 12:03 PM, David Rowley wrote: > On 16 March 2016 at 10:34, David Rowley <david.rowley@2ndquadrant.com> wrote: >> If Haribabu's patch does all that's required for the numerical >> aggregates for floating point types then the status of covered >> aggregates is (in order of pg_aggregate.h): >> >> * AVG() complete coverage >> * SUM() complete coverage >> * MAX() complete coverage >> * MIN() complete coverage >> * COUNT() complete coverage >> * STDDEV + friends complete coverage >> * regr_*,covar_pop,covar_samp,corr not touched these. >> * bool*() complete coverage >> * bitwise aggs. complete coverage >> * Remaining are not touched. I see diminishing returns with making >> these parallel for now. I think I might not be worth pushing myself >> any harder to make these ones work. >> >> Does what I have done + floating point aggs from Haribabu seem >> reasonable for 9.6? > > I've attached a series of patches. Thanks. Haven't really done much testing of this, aside from reading through the patches and "it compiles". > > Patch 1: > This is the parallel aggregate patch, not intended for review here. > However, all further patches are based on this, and this adds the > required planner changes to make it possible to test patches 2 and 3. > > Patch 2: > This adds the serial/deserial aggregate infrastructure, pg_dump > support, CREATE AGGREGATE changes, and nodeAgg.c changes to have it > serialise and deserialise aggregate states when instructed to do so. > > Patch 3: > This adds a boat load of serial/deserial functions, and combine > functions for most of the built-in numerical aggregate functions. It > also contains some regression tests which should really be in patch 2, > but I with patch 2 there's no suitable serialisation or > de-serialisation functions to test CREATE AGGREGATE with. I think > having them here is ok, as patch 2 is quite useless without patch 3 > anyway. I don't see how you could move the tests into #2 when the functions are defined in #3? IMHO this is the right place for the regression tests. > > Another thing to note about this patch is that I've gone and created > serial/de-serial functions for when PolyNumAggState both require > sumX2, and don't require sumX2. I had thought about perhaps putting an > extra byte in the serial format to indicate if a sumX2 is included, > but I ended up not doing it this way. I don't really want these serial > formats getting too complex as we might like to do fun things like > pass them along to sharded servers one day, so it might be nice to > keep them simple. Hmmm, I've noticed that while eyeballing the diffs, and I'm not sure if it's worth the additional complexity at this point. I mean, one byte is hardly going to make a measurable difference - we're probably wasting more than that due to alignment, for example. As you've mentioned sharded servers - how stable should the serialized format be? I've assumed it to be transient, i.e. in the extreme case it might change after restarting a database - for the parallel aggregate that's clearly sufficient. But if we expect this to eventually work in a sharded environment, that's going to be way more complicated. I guess in these cases we could rely on implicit format versioning, via the major the version (doubt we'd tweak the format in a minor version anyway). I'm not sure the sharding is particularly useful argument at this point, considering we don't really know if the current format is actually a good match for that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 16, 2016 at 10:08 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 16 March 2016 at 23:54, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >> On Wed, Mar 16, 2016 at 8:34 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> Yes me too, so I spent several hours yesterday writing all of the >>> combine functions and serialisation/deserialisation that are required >>> for all of SUM(), AVG() STDDEV*(). I also noticed that I had missed >>> using some existing functions for bool_and() and bool_or() so I added >>> those to pg_aggregate.h. I'm just chasing down a crash bug on >>> HAVE_INT128 enabled builds, so should be posting a patch quite soon. I >>> didn't touch the FLOAT4 and FLOAT8 aggregates as I believe Haribabu >>> has a patch for that over on the parallel aggregate thread. I've not >>> looked at it in detail yet. >> >> The additional combine function patch that I posted handles all float4 and >> float8 aggregates. There is an OID conflict with the latest source code, >> I will update the patch and post it in that thread. > > Thanks! I just send a series of patches which add a whole host of > serial/deserial functions, and a patch which adds some documentation. > Maybe you could base your patch on the 0005 patch, and update the > documents to remove the "All types apart from floating-point types" > text and replace that with "Yes". Here I attached updated float aggregates patch based on 0005 patch. Regards, Hari Babu Fujitsu Australia
Attachment
On 17 March 2016 at 16:30, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Wed, Mar 16, 2016 at 10:08 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> On 16 March 2016 at 23:54, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >>> On Wed, Mar 16, 2016 at 8:34 AM, David Rowley >>> <david.rowley@2ndquadrant.com> wrote: >>>> Yes me too, so I spent several hours yesterday writing all of the >>>> combine functions and serialisation/deserialisation that are required >>>> for all of SUM(), AVG() STDDEV*(). I also noticed that I had missed >>>> using some existing functions for bool_and() and bool_or() so I added >>>> those to pg_aggregate.h. I'm just chasing down a crash bug on >>>> HAVE_INT128 enabled builds, so should be posting a patch quite soon. I >>>> didn't touch the FLOAT4 and FLOAT8 aggregates as I believe Haribabu >>>> has a patch for that over on the parallel aggregate thread. I've not >>>> looked at it in detail yet. >>> >>> The additional combine function patch that I posted handles all float4 and >>> float8 aggregates. There is an OID conflict with the latest source code, >>> I will update the patch and post it in that thread. >> >> Thanks! I just send a series of patches which add a whole host of >> serial/deserial functions, and a patch which adds some documentation. >> Maybe you could base your patch on the 0005 patch, and update the >> documents to remove the "All types apart from floating-point types" >> text and replace that with "Yes". > > Here I attached updated float aggregates patch based on 0005 patch. Great! Thanks for sending that. I just had a quick skim over the patch and noticed the naming convention you're using for the combine function is *_pl, and you have float8_pl. There's already a function named float8pl() which is quite close to what you have. I've been sticking to *_combine() for these, so maybe float8_combine() and float8_regr_combine() are better names. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 03/17/2016 12:53 PM, David Rowley wrote:> ...> > I just had a quick skim over the patch and noticed the naming > convention you're using for the combine function is *_pl, and you have > float8_pl. There's already a function named float8pl() which is quite > close to what you have. I've been sticking to *_combine() for these, > so maybe float8_combine() and float8_regr_combine() are better names. +1 to the _combine naming convention. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 17, 2016 at 10:59 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > On 03/17/2016 12:53 PM, David Rowley wrote: >> > ... >> >> >> I just had a quick skim over the patch and noticed the naming >> convention you're using for the combine function is *_pl, and you have >> float8_pl. There's already a function named float8pl() which is quite >> close to what you have. I've been sticking to *_combine() for these, >> so maybe float8_combine() and float8_regr_combine() are better names. > > > +1 to the _combine naming convention. Thanks for the input. Makes sense, updated patch is attached with the changes. Regards, Hari Babu Fujitsu Australia
Attachment
On 18 March 2016 at 13:39, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > On Thu, Mar 17, 2016 at 10:59 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Hi, >> >> On 03/17/2016 12:53 PM, David Rowley wrote: >>> >> ... >>> >>> >>> I just had a quick skim over the patch and noticed the naming >>> convention you're using for the combine function is *_pl, and you have >>> float8_pl. There's already a function named float8pl() which is quite >>> close to what you have. I've been sticking to *_combine() for these, >>> so maybe float8_combine() and float8_regr_combine() are better names. >> >> >> +1 to the _combine naming convention. > > Thanks for the input. Makes sense, updated patch is attached with > the changes. I've had a look over this. I had to first base it on the 0005 patch, as it seemed like the pg_aggregate.h changes didn't include the serialfn and deserialfn changes, and an OID was being consumed by another function I added in patch 0003. On testing I also noticed some wrong results, which on investigation, are due to the wrong array elements being added together. For example: postgres=# select stddev(num) from f; stddev ------------------28867.5149028984 (1 row) postgres=# set max_parallel_degree=8; SET postgres=# select stddev(num) from f;stddev -------- 0 (1 row) + N += transvalues2[0]; + sumX += transvalues2[1]; + CHECKFLOATVAL(sumX, isinf(transvalues1[1]) || isinf(transvalues2[1]), true); + sumX2 += transvalues1[2]; The last line should use transvalues2[2], not transvalues1[2]. There's also quite a few mistakes in float8_regr_combine() + sumX2 += transvalues2[2]; + CHECKFLOATVAL(sumX2, isinf(transvalues1[2]) || isinf(transvalues2[1]), true); Wrong array element on isinf() check + sumY2 += transvalues2[4]; + CHECKFLOATVAL(sumY2, isinf(transvalues1[4]) || isinf(transvalues2[3]), true); Wrong array element on isinf() check + sumXY += transvalues2[5]; + CHECKFLOATVAL(sumXY, isinf(transvalues1[5]) || isinf(transvalues2[1]) || + isinf(transvalues2[3]), true); Wrong array element on isinf() check, and the final isinf(transvalues2[3]) check does not need to be there. I've re-based the patch and fixed these already, so will send updated patches shortly. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 March 2016 at 14:25, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 03/16/2016 12:03 PM, David Rowley wrote: >> Patch 2: >> This adds the serial/deserial aggregate infrastructure, pg_dump >> support, CREATE AGGREGATE changes, and nodeAgg.c changes to have it >> serialise and deserialise aggregate states when instructed to do so. >> >> Patch 3: >> This adds a boat load of serial/deserial functions, and combine >> functions for most of the built-in numerical aggregate functions. It >> also contains some regression tests which should really be in patch 2, >> but I with patch 2 there's no suitable serialisation or >> de-serialisation functions to test CREATE AGGREGATE with. I think >> having them here is ok, as patch 2 is quite useless without patch 3 >> anyway. > > I don't see how you could move the tests into #2 when the functions are > defined in #3? IMHO this is the right place for the regression tests. Yeah, but the CREATE AGGREGATE changes which are being tested in 0003 were actually added in 0002. I think 0002 is the right place to test the changes to CREATE AGGREGATE, but since there's a complete lack of functions to use, then I've just delayed until 0003. >> Another thing to note about this patch is that I've gone and created >> serial/de-serial functions for when PolyNumAggState both require >> sumX2, and don't require sumX2. I had thought about perhaps putting an >> extra byte in the serial format to indicate if a sumX2 is included, >> but I ended up not doing it this way. I don't really want these serial >> formats getting too complex as we might like to do fun things like >> pass them along to sharded servers one day, so it might be nice to >> keep them simple. > > > Hmmm, I've noticed that while eyeballing the diffs, and I'm not sure if it's > worth the additional complexity at this point. I mean, one byte is hardly > going to make a measurable difference - we're probably wasting more than > that due to alignment, for example. I don't think any alignment gets done here. Look at pq_sendint(). Did you mean the complexity of having extra functions, or having the extra byte to check in the deserial func? > As you've mentioned sharded servers - how stable should the serialized > format be? I've assumed it to be transient, i.e. in the extreme case it > might change after restarting a database - for the parallel aggregate that's > clearly sufficient. > > But if we expect this to eventually work in a sharded environment, that's > going to be way more complicated. I guess in these cases we could rely on > implicit format versioning, via the major the version (doubt we'd tweak the > format in a minor version anyway). > > I'm not sure the sharding is particularly useful argument at this point, > considering we don't really know if the current format is actually a good > match for that. Perhaps you're right. At this stage I've no idea if we'd want to support shards on varying major versions. I think probably not, since the node write functions might not be compatible with the node read functions on the other server. I've attached another series of patches: 0001: This is the latest Parallel Aggregate Patch, not intended for review here, but is required for the remaining patches. This patch has changed quite a bit from the previous one that I posted here, and the remaining patches needed re-based due to those changes. 0002: Adds serial/de-serial function support to CREATE AGGREGATE, contains minor fix-ups from last version. 0003: Adds various combine/serial/de-serial functions for the standard set of aggregates, apart from most float8 aggregates. 0004: Adds regression tests for 0003 pg_aggregate.h changes. 0005: Documents to mention which aggregate functions support partial mode. 0006: Re-based version of Haribabu's floating point aggregate support, containing some fixes by me. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Allow-aggregation-to-happen-in-parallel_2016-03-20.patch
- 0002-Allow-INTERNAL-state-aggregates-to-participate-in-pa_2016-03-20.patch
- 0003-Add-various-aggregate-combine-and-serialize-de-seria_2016-03-20.patch
- 0004-Add-sanity-regression-tests-for-new-aggregate-serial_2016-03-20.patch
- 0005-Add-documents-to-explain-which-aggregates-support-pa_2016-03-20.patch
- 0006-Add-combine-functions-for-various-floating-point-agg_2016-03-20.patch
Hi, On 03/20/2016 04:48 AM, David Rowley wrote: > On 17 March 2016 at 14:25, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> On 03/16/2016 12:03 PM, David Rowley wrote: >>> Patch 2: >>> This adds the serial/deserial aggregate infrastructure, pg_dump >>> support, CREATE AGGREGATE changes, and nodeAgg.c changes to have it >>> serialise and deserialise aggregate states when instructed to do so. >>> >>> Patch 3: >>> This adds a boat load of serial/deserial functions, and combine >>> functions for most of the built-in numerical aggregate functions. It >>> also contains some regression tests which should really be in patch 2, >>> but I with patch 2 there's no suitable serialisation or >>> de-serialisation functions to test CREATE AGGREGATE with. I think >>> having them here is ok, as patch 2 is quite useless without patch 3 >>> anyway. >> >> I don't see how you could move the tests into #2 when the functions are >> defined in #3? IMHO this is the right place for the regression tests. > > Yeah, but the CREATE AGGREGATE changes which are being tested in 0003 > were actually added in 0002. I think 0002 is the right place to test > the changes to CREATE AGGREGATE, but since there's a complete lack of > functions to use, then I've just delayed until 0003. > >>> Another thing to note about this patch is that I've gone and created >>> serial/de-serial functions for when PolyNumAggState both require >>> sumX2, and don't require sumX2. I had thought about perhaps putting an >>> extra byte in the serial format to indicate if a sumX2 is included, >>> but I ended up not doing it this way. I don't really want these serial >>> formats getting too complex as we might like to do fun things like >>> pass them along to sharded servers one day, so it might be nice to >>> keep them simple. >> >> >> Hmmm, I've noticed that while eyeballing the diffs, and I'm not >> sure if it's worth the additional complexity at this point. I mean, >> one byte is hardly going to make a measurable difference - we're >> probably wasting more than that due to alignment, for example. > > I don't think any alignment gets done here. Look at pq_sendint(). Did > you mean the complexity of having extra functions, or having the > extra byte to check in the deserial func? I haven't realized alignment does not apply here, but I still think the extra byte would be negligible in the bigger scheme of things. For example we're serializing the state into a bytea, which adds up to 4B header. So I don't think adding 1B would make any measurable difference. But that assumes adding the 1B header would actually make the code simpler. Now that I think about it, that may not the case - in the end you'd probably get a function with two branches, with about the same size as the two functions combined. So not really an improvement. >> As you've mentioned sharded servers - how stable should the >> serialized format be? I've assumed it to be transient, i.e. in the >> extreme case it might change after restarting a database - for the >> parallel aggregate that's clearly sufficient. >> >> But if we expect this to eventually work in a sharded environment, >> that's going to be way more complicated. I guess in these cases we >> could rely on implicit format versioning, via the major the >> version (doubt we'd tweak the format in a minor version anyway). >> >> I'm not sure the sharding is particularly useful argument at this >> point, considering we don't really know if the current format is >> actually a good match for that. > > Perhaps you're right. At this stage I've no idea if we'd want to > support shards on varying major versions. I think probably not, > since the node write functions might not be compatible with the node > read functions on the other server. OK, so let's ignore the sharded setup for now. > I've attached another series of patches: > > 0001: This is the latest Parallel Aggregate Patch, not intended for > review here, but is required for the remaining patches. This patch has > changed quite a bit from the previous one that I posted here, and the > remaining patches needed re-based due to those changes. > > 0002: Adds serial/de-serial function support to CREATE AGGREGATE, > contains minor fix-ups from last version. > > 0003: Adds various combine/serial/de-serial functions for the standard > set of aggregates, apart from most float8 aggregates. > > 0004: Adds regression tests for 0003 pg_aggregate.h changes. > > 0005: Documents to mention which aggregate functions support partial mode. > > 0006: Re-based version of Haribabu's floating point aggregate support, > containing some fixes by me. I went through the changes and found nothing suspicious, except maybe for the wording in one of the doc patches: All types apart from floating-point types It may not be entirely clear for the readers, but this does not include "numeric" data type, only float4 and float8. But I don't think this really matters unless we fail to commit the last patch adding functions even for those data types. Also, I think it's probably the right time to fix the failures in opr_sanity regression tests. After all, we'll only whitelist a bunch of serialize functions, so the tests will still detect any unexpected functions dealing with 'internal' data type. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Mar 20, 2016 at 2:23 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > > I've had a look over this. I had to first base it on the 0005 patch, > as it seemed like the pg_aggregate.h changes didn't include the > serialfn and deserialfn changes, and an OID was being consumed by > another function I added in patch 0003. > > On testing I also noticed some wrong results, which on investigation, > are due to the wrong array elements being added together. > > For example: > > postgres=# select stddev(num) from f; > stddev > ------------------ > 28867.5149028984 > (1 row) > > > postgres=# set max_parallel_degree=8; > SET > postgres=# select stddev(num) from f; > stddev > -------- > 0 > (1 row) > > + N += transvalues2[0]; > + sumX += transvalues2[1]; > + CHECKFLOATVAL(sumX, isinf(transvalues1[1]) || isinf(transvalues2[1]), true); > + sumX2 += transvalues1[2]; > > The last line should use transvalues2[2], not transvalues1[2]. Thanks. > There's also quite a few mistakes in float8_regr_combine() > > + sumX2 += transvalues2[2]; > + CHECKFLOATVAL(sumX2, isinf(transvalues1[2]) || isinf(transvalues2[1]), true); > > Wrong array element on isinf() check > > + sumY2 += transvalues2[4]; > + CHECKFLOATVAL(sumY2, isinf(transvalues1[4]) || isinf(transvalues2[3]), true); > > Wrong array element on isinf() check > > + sumXY += transvalues2[5]; > + CHECKFLOATVAL(sumXY, isinf(transvalues1[5]) || isinf(transvalues2[1]) || > + isinf(transvalues2[3]), true); > > Wrong array element on isinf() check, and the final > isinf(transvalues2[3]) check does not need to be there. Thanks for the changes, I just followed the float8_regr_accum function while writing float8_regr_combine function. Now I understood isinf proper usage. Regards, Hari Babu Fujitsu Australia
On Sat, Mar 19, 2016 at 11:48 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > 0002: Adds serial/de-serial function support to CREATE AGGREGATE, > contains minor fix-ups from last version. This looks pretty good, but don't build_aggregate_serialfn_expr and build_aggregate_deserialfn_expr compile down to identical machine code? Keeping two copies of the same code with different parameter names is a degree of neatnik-ism I'm not sure I can swallow. The only caller to make_partialgroup_input_target() passes true for the additional argument. That doesn't seem right. Maybe error messages should refer to "aggregate serialization function" and "aggregate deserialization function" instead of "aggregate serial function" and "aggregate de-serial function". - * Other behavior is also supported and is controlled by the 'combineStates' - * and 'finalizeAggs'. 'combineStates' controls whether the trans func or - * the combine func is used during aggregation. When 'combineStates' is - * true we expect other (previously) aggregated states as input rather than - * input tuples. This mode facilitates multiple aggregate stages which - * allows us to support pushing aggregation down deeper into the plan rather - * than leaving it for the final stage. For example with a query such as: + * Other behavior is also supported and is controlled by the 'combineStates', + * 'finalizeAggs' and 'serialStates' parameters. 'combineStates' controls + * whether the trans func or the combine func is used during aggregation. + * When 'combineStates' is true we expect other (previously) aggregated + * states as input rather than input tuples. This mode facilitates multiple + * aggregate stages which allows us to support pushing aggregation down + * deeper into the plan rather than leaving it for the final stage. For + * example with a query such as: I'd omit this hunk. The serialStates thing is really separate from what's being talked about here, and you discuss it further down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20 March 2016 at 16:48, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached another series of patches: > > 0001: This is the latest Parallel Aggregate Patch, not intended for > review here, but is required for the remaining patches. This patch has > changed quite a bit from the previous one that I posted here, and the > remaining patches needed re-based due to those changes. > > 0002: Adds serial/de-serial function support to CREATE AGGREGATE, > contains minor fix-ups from last version. > > 0003: Adds various combine/serial/de-serial functions for the standard > set of aggregates, apart from most float8 aggregates. > > 0004: Adds regression tests for 0003 pg_aggregate.h changes. > > 0005: Documents to mention which aggregate functions support partial mode. > > 0006: Re-based version of Haribabu's floating point aggregate support, > containing some fixes by me. Now that parallel aggregate has been committed, 0002 no longer applies. Please find attached a new 0001 patch, intended to replace the previous 0001 and 0002. The previous 0003, 0004, 0005 and 0006 should still apply onto the new 0001. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 22 March 2016 at 05:27, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Mar 19, 2016 at 11:48 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> 0002: Adds serial/de-serial function support to CREATE AGGREGATE, >> contains minor fix-ups from last version. > > This looks pretty good, but don't build_aggregate_serialfn_expr and > build_aggregate_deserialfn_expr compile down to identical machine > code? Keeping two copies of the same code with different parameter > names is a degree of neatnik-ism I'm not sure I can swallow. Good point. I've removed the deserial one, and renamed the arguments to arg_input_type and arg_output_type. > The only caller to make_partialgroup_input_target() passes true for > the additional argument. That doesn't seem right. Yeah, I coded that with non-parallel use cases in mind. I very much thing we'll end up using that flag in 9.7, but perhaps that's not a good enough reason to be adding it in 9.6. I've pulled it out of the patch for now, but I could also go further and pull the flag from the whole patch, as we could just serialize the states in nodeAgg.c when finalizeAggs == false and deserialize when combineStates == true. > Maybe error messages should refer to "aggregate serialization > function" and "aggregate deserialization function" instead of > "aggregate serial function" and "aggregate de-serial function". That's probably a good idea. > - * Other behavior is also supported and is controlled by the > 'combineStates' > - * and 'finalizeAggs'. 'combineStates' controls whether the trans func or > - * the combine func is used during aggregation. When 'combineStates' is > - * true we expect other (previously) aggregated states as input > rather than > - * input tuples. This mode facilitates multiple aggregate stages which > - * allows us to support pushing aggregation down deeper into > the plan rather > - * than leaving it for the final stage. For example with a query such as: > + * Other behavior is also supported and is controlled by the > 'combineStates', > + * 'finalizeAggs' and 'serialStates' parameters. 'combineStates' controls > + * whether the trans func or the combine func is used during aggregation. > + * When 'combineStates' is true we expect other (previously) aggregated > + * states as input rather than input tuples. This mode > facilitates multiple > + * aggregate stages which allows us to support pushing aggregation down > + * deeper into the plan rather than leaving it for the final stage. For > + * example with a query such as: > > I'd omit this hunk. The serialStates thing is really separate from > what's being talked about here, and you discuss it further down. Removed. I've attached 2 of the patches which are affected by the changes. Thanks for looking over this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com> wrote: > I re-reviewed this and have committed most of it with only minor > kibitizing. A few notes: I realised today that the combinefunc is rather undocumented. I've attached a patch which aims to fix this. Comments are welcome. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Thu, Mar 24, 2016 at 5:22 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 January 2016 at 08:06, Robert Haas <robertmhaas@gmail.com> wrote: >> I re-reviewed this and have committed most of it with only minor >> kibitizing. A few notes: > > I realised today that the combinefunc is rather undocumented. I've > attached a patch which aims to fix this. > > Comments are welcome. Looks good to me. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 21, 2016 at 2:18 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached 2 of the patches which are affected by the changes. I think the documentation for 0001 needs some work yet. The additional paragraph that you've added... (1) doesn't seem to appear at a very logical place in the documentation - I think it should be much further down, as it's a minor detail. Maybe document this the same way as the documentation patch you just sent for the combine-function stuff does it; and (2) isn't indented consistently with the surrounding paragraphs; and (3) is missing a closing </para> tag Also, I'd just cut this: + This is required due to + the process model being unable to pass references to <literal>INTERNAL + </literal> types between different <productname>PostgreSQL</productname> + processes. Instead, I'd change the earlier sentence in the paragraph, which currently reads: + These + functions are required in order to allow parallel aggregation for aggregates + with an <replaceable class="PARAMETER">stype</replaceable> of <literal> + INTERNAL</>. I'd replace the period at end with a comma and add "since <literal>INTERNAL</> values represent arbitrary in-memory data structures which can't be passed between processes". I think that's a bit smoother. I'm going to read through the code again now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 24, 2016 at 1:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm going to read through the code again now. OK, I noticed another documentation problem: you need to update catalogs.sgml for these new columns. + * Validate the serial function, if present. We must ensure that the return + * Validate the de-serial function, if present. We must ensure that the I think that you should refer to these consistently in the comments as the "serialization function" and the "deserialization function", even though the SQL syntax is different. And unhyphenated. + /* check that we also got a serial type */ + if (!OidIsValid(aggSerialType)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION), + errmsg("must specify serialization type when specifying serialization function"))); I think that in parallel cases, this check is in DefineAggregate(), not here. See, e.g. "aggregate mfinalfunc must not be specified without mstype". Existing type parameters to CREATE AGGREGATE have IsPolymorphicType() checks to enforce sanity in various ways, but you seem not to have added that for the serial type. + /* don't call a strict serial function with NULL input */ + if (pertrans->serialfn.fn_strict && + pergroupstate->transValueIsNull) + continue; Shouldn't this instead set aggnulls[aggno] = true? And doesn't the hunk in combine_aggregates() have the same problem? + /* + * serial and de-serial functions must match, if present. Remember that + * these will be InvalidOid if they're not required for this agg node + */ Explain WHY they need to match. And maybe update the overall comment for the function. + "'-' AS aggdeserialfn,aggmtransfn, aggminvtransfn, " Whitespace. In your pg_aggregate.h changes, avg(numeric) sets aggserialtype but no aggserialfn or aggdeserialfn. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 25 March 2016 at 06:17, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 21, 2016 at 2:18 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> I've attached 2 of the patches which are affected by the changes. > > I think the documentation for 0001 needs some work yet. The > additional paragraph that you've added... > > (1) doesn't seem to appear at a very logical place in the > documentation - I think it should be much further down, as it's a > minor detail. Maybe document this the same way as the documentation > patch you just sent for the combine-function stuff does it; and Thanks. I also realised this when writing the combine documents fix. > (2) isn't indented consistently with the surrounding paragraphs; and > > (3) is missing a closing </para> tag > > Also, I'd just cut this: > > + This is required due to > + the process model being unable to pass references to <literal>INTERNAL > + </literal> types between different <productname>PostgreSQL</productname> > + processes. > > Instead, I'd change the earlier sentence in the paragraph, which > currently reads: > > + These > + functions are required in order to allow parallel aggregation for aggregates > + with an <replaceable class="PARAMETER">stype</replaceable> of <literal> > + INTERNAL</>. > > I'd replace the period at end with a comma and add "since > <literal>INTERNAL</> values represent arbitrary in-memory data > structures which can't be passed between processes". I think that's a > bit smoother. > In my rewrite I've incorporated these words. Thanks for checking over this. A patch will follow in my response to the next email. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 25 March 2016 at 07:26, Robert Haas <robertmhaas@gmail.com> wrote: > OK, I noticed another documentation problem: you need to update > catalogs.sgml for these new columns. Thanks. Done. > > + * Validate the serial function, if present. We must ensure > that the return > + * Validate the de-serial function, if present. We must ensure that the > I think that you should refer to these consistently in the comments as > the "serialization function" and the "deserialization function", even > though the SQL syntax is different. And unhyphenated. That's probably better. Fixed. > + /* check that we also got a serial type */ > + if (!OidIsValid(aggSerialType)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION), > + errmsg("must specify serialization > type when specifying serialization function"))); > > I think that in parallel cases, this check is in DefineAggregate(), > not here. See, e.g. "aggregate mfinalfunc must not be specified > without mstype". Good point. I've fixed this. > Existing type parameters to CREATE AGGREGATE have IsPolymorphicType() > checks to enforce sanity in various ways, but you seem not to have > added that for the serial type. Added check for this to disallow pseudo types that are not polymorphic. > + /* don't call a strict serial function with > NULL input */ > + if (pertrans->serialfn.fn_strict && > + pergroupstate->transValueIsNull) > + continue; > > Shouldn't this instead set aggnulls[aggno] = true? And doesn't the > hunk in combine_aggregates() have the same problem? hmm, no, I can't see a problem there. aggvalues[aggno] is what's going to be set after that condition. > + /* > + * serial and de-serial functions must match, if > present. Remember that > + * these will be InvalidOid if they're not required > for this agg node > + */ > > Explain WHY they need to match. And maybe update the overall comment > for the function. I've added: /* * The serialization and deserialization functions must match, if * present, as we're unable to share the trans state for aggregates * which will serialize or deserialize into different formats. Remember * that these will be InvalidOid if they're not required for this agg * node. */ This is for the code which shares transition states between aggregate functions which have the same transition function, for example SUM(NUMERIC) and AVG(NUMERIC). The serialization functions must match too. Perhaps, providing the serial/deserial function does its job properly, then it does not matter, but it does seem better to be safe here. I can't really imagine why anyone would want to have the same transition function but have different serialization functions. > + "'-' AS > aggdeserialfn,aggmtransfn, aggminvtransfn, " > > Whitespace. Fixed. > In your pg_aggregate.h changes, avg(numeric) sets aggserialtype but no > aggserialfn or aggdeserialfn. Fixed. I've also added some checks to ensure a combine function being set on an aggregate with an INTERNAL state is not-strict. I realised when writing the documentation for combine functions that it needed to be this way, as, if the function was strict, then advance_combine_function() could incorrectly assign the memory address of the 2nd state to the transValue. Actually, this seems a little tricky to optimise well, as in the case when we're serialising states, the deserialisation function should allocate a new state in the correct memory context, so technically the combine function could be strict in this case, as it would be fine to use state2 if state1 was null, but parallel aggregation is probably unique and could be the only time we'll need to use the serialisation/deserialisation functions, so without serialStates = true, the memory address could still belong to some other aggregate node's memory context which is certainly not right, so the code as of this patch is playing it safe. You could say that this is technically a bug in master, but it should be impossible to hit it now due to lack of ability to combine internal aggregate states. The checks for the correct strict property in the combine function are 3-fold. 1. CREATE AGGREGATE check. 2. executor startup for Agg nodes (must be here too, since someone could modify the strict property after CREATE AGGREGATE). 3. Regression test for built-in aggregates. ... long pause... Ok, so on further look at this I've decided to make changes and have it so the serialisation function can be dumb about memory contexts in the same way as finalize_aggregate() allows the final function to be dumb... notice at the end of the function it copies byref types into the correct memory context, if they're not already. So in the attached the serialisation function call code now does the same thing. In fact I decided to move all that code off into a function called finalize_partialaggregate(). As for the deserialisation function... there's not much we can do there in regards to copying the value back to the correct memory context, as we always expect the return value of that function to be INTERNAL, which is byval (pointer). So I've just set the code to aggstate->tmpcontext->ecxt_per_tuple_memory, and leave it up to the combine function to do the right thing, and make sure the new state is built properly in the aggregate memory context, which is analogous to how INTERNAL state transition functions work. As for the optimisation I mentioned above; what I have now is not perfect, as when we Gather the first state per group from the first worker to return it, we deserialise that serialised state into a per tuple context, next the combine function pulls that state apart and rebuilds it again into another newly built state in the correct memory context. I can't really see a way around this. I'd really like it if someone could look over this and make sure its all sane, as I'm not sure if I've fully wrapped my head around the expected life time of each context used here... Status of other patched: 0002: * Changed the new incorrectly defined combine functions so they're no longer strict. * Removed all memory context switches from serialisation/deserialisation functions. 0003: I've added a few new regression tests to test for strictness of combine functions, and also threw in some tests to ensure the serialisation/deserialisation functions are all strict, since the current ones can't handle NULLs, and it would be wasteful to make them trouble themselves with that ability. 0004: No change from last time. 0005: Haribabu's patch; no change from last time. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Allow-INTERNAL-state-aggregates-to-participate-in-pa_2016-03-26.patch
- 0002-Add-various-aggregate-combine-and-serialize-de-seria_2016-03-26.patch
- 0003-Add-sanity-regression-tests-for-new-aggregate-serial_2016-03-26.patch
- 0004-Add-documents-to-explain-which-aggregates-support-pa_2016-03-26.patch
- 0005-Add-combine-functions-for-various-floating-point-agg_2016-03-26.patch
On 26 March 2016 at 15:07, David Rowley <david.rowley@2ndquadrant.com> wrote: > Ok, so on further look at this I've decided to make changes and have > it so the serialisation function can be dumb about memory contexts in > the same way as finalize_aggregate() allows the final function to be > dumb... notice at the end of the function it copies byref types into > the correct memory context, if they're not already. So in the attached > the serialisation function call code now does the same thing. In fact > I decided to move all that code off into a function called > finalize_partialaggregate(). Please disregard the previous 0001 patch in favour of the attached one. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 26 March 2016 at 15:07, David Rowley <david.rowley@2ndquadrant.com> wrote: Many thanks Robert for committing the serialize states portion. > 0005: > Haribabu's patch; no change from last time. Just in case you jump ahead. I just wanted to re-highlight something Haribabu mentioned a while ago about combining floating point states [1]. This discussion did not happen on this thread, so to make sure it does not get lost... As of today aggregate calculations for floating point types can vary depending on the order in which values are aggregated. For example: create table f8 (num float8); insert into f8 select x/100.0 from generate_series(1,10000) x(x); select stddev(num order by num) from f8; stddev ------------------28.8689567990717 (1 row) select stddev(num order by num desc) from f8; stddev ------------------28.8689567990716 (1 row) select stddev(num order by random()) from f8; stddev ------------------28.8689567990715 (1 row) And of course the execution plan can determine the order in which rows are aggregated, even if the underlying data does not change. Parallelising these aggregates increases the chances of seeing these variations as the number of rows aggregated in each worker is going to vary on each run, so the numerical anomalies will also vary between each run. I wrote in [1]: > We do also warn about this in the manual: "Inexact means that some > values cannot be converted exactly to the internal format and are > stored as approximations, so that storing and retrieving a value might > show slight discrepancies. Managing these errors and how they > propagate through calculations is the subject of an entire branch of > mathematics and computer science and will not be discussed here, > except for the following points:" [1] > [1] http://www.postgresql.org/docs/devel/static/datatype-numeric.html Does this text in the documents stretch as far as the variable results from parallel aggregate for floating point types? or should we be more careful and not parallelise these, similar to how we didn't end up with inverse aggregate transition functions for floating point types? I'm personally undecided, and would like to hear what others think. [1] http://www.postgresql.org/message-id/CAKJS1f_HpLFhkd2yLfrsrmUMBZWQkGvJCWX21B_xg1A-0pzwHw@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 29, 2016 at 11:14 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Many thanks Robert for committing the serialize states portion. yw, sorry I didn't get an email out about that. >> 0005: >> Haribabu's patch; no change from last time. So what's the distinction between 0002 and 0005? And what is the correct author/reviewer attribution for each? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31 March 2016 at 00:48, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 29, 2016 at 11:14 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >>> 0005: >>> Haribabu's patch; no change from last time. > > So what's the distinction between 0002 and 0005? And what is the > correct author/reviewer attribution for each? I wrote 0002 - 0004, these were reviewed by Tomas. 0005 is Haribabu's patch: Reviewed by Tomas and I. The reason 0002 and 0005 are separate patches is just down to them having different authors. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 30, 2016 at 3:34 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I wrote 0002 - 0004, these were reviewed by Tomas. > 0005 is Haribabu's patch: Reviewed by Tomas and I. I think it might be a good idea if these patches made less use of bytea and exposed the numeric transition values as, say, a 2-element array of numeric. Maybe this doesn't really matter, but it's not impossible that these values could be exposed somewhere, and a numeric is an awful lot more user-friendly than an essentially opaque bytea. One of the things I eventually want to figure out how to do with this is distributed aggregation across multiple shards, and I think it might be better to have the value being sent over the wire be something like {26.6,435.12} rather than \x1234... Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5 April 2016 at 05:54, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Wed, Mar 30, 2016 at 3:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I wrote 0002 - 0004, these were reviewed by Tomas.
> 0005 is Haribabu's patch: Reviewed by Tomas and I.
I think it might be a good idea if these patches made less use of
bytea and exposed the numeric transition values as, say, a 2-element
array of numeric. Maybe this doesn't really matter, but it's not
impossible that these values could be exposed somewhere, and a numeric
is an awful lot more user-friendly than an essentially opaque bytea.
One of the things I eventually want to figure out how to do with this
is distributed aggregation across multiple shards, and I think it
might be better to have the value being sent over the wire be
something like {26.6,435.12} rather than \x1234...
Thoughts?
Rewriting something that works fine just before the deadline isn't a good plan.
"Might be better" isn't enough.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 April 2016 at 16:54, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 30, 2016 at 3:34 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> I wrote 0002 - 0004, these were reviewed by Tomas. >> 0005 is Haribabu's patch: Reviewed by Tomas and I. > > I think it might be a good idea if these patches made less use of > bytea and exposed the numeric transition values as, say, a 2-element > array of numeric. Well, if you have a look at NumericAggState you can see it's not quite as simple as an array of numeric, unless of course you'd be willing to spend the extra cycles, use the extra memory, and bandwidth to convert those int64's to numeric too, then it could be made to work. To do are you describe properly, we'd need a composite type. > Maybe this doesn't really matter, but it's not > impossible that these values could be exposed somewhere, and a numeric > is an awful lot more user-friendly than an essentially opaque bytea. hmm, isn't that why we have a deserialisation functions? Do you see a use case where these won't be available? > One of the things I eventually want to figure out how to do with this > is distributed aggregation across multiple shards, and I think it > might be better to have the value being sent over the wire be > something like {26.6,435.12} rather than \x1234... > > Thoughts? I've not yet read the design spec for sharding in Postgres. If there's something I can read over to get an idea of why you think this won't work, then maybe we can come to a good conclusion that way. But if I take a guess, then I'd have imagined that we'd not support sharding over different major versions, and if we really needed to change the serialisation format later, then we could do so. We could even put a note in the documents that the serialisation format may change between major versions. To be really honest, I'm quite worried that if I go and make this change then my time might be wasted as I really think making that change this late in the day is just setting this up for failure. I really don't think we can bat this patch over the Pacific Ocean too many times before we find ourselves inside the feature freeze. Of course, if you really think it's no good, that's different, it can't be committed, but "I think it might be better" does not seem like a solid enough argument for me to want to risk trying this and delaying further for that. But either way, we should come try to come to a consensus quickly. I don't want to alienate you because you think I don't want to listen to you. We need to act quickly or we'll only have a handful of the useful aggregates working in parallel for 9.6. I'd rather we had them all. I hope you do too. Thanks for looking over the patch. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 5, 2016 at 3:55 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think it might be a good idea if these patches made less use of >> bytea and exposed the numeric transition values as, say, a 2-element >> array of numeric. > > Well, if you have a look at NumericAggState you can see it's not quite > as simple as an array of numeric, unless of course you'd be willing to > spend the extra cycles, use the extra memory, and bandwidth to convert > those int64's to numeric too, then it could be made to work. To do are > you describe properly, we'd need a composite type. Uggh, yeah. Yuck. > hmm, isn't that why we have a deserialisation functions? Do you see a > use case where these won't be available? ... > I've not yet read the design spec for sharding in Postgres. If there's > something I can read over to get an idea of why you think this won't > work, then maybe we can come to a good conclusion that way. But if I > take a guess, then I'd have imagined that we'd not support sharding > over different major versions, and if we really needed to change the > serialisation format later, then we could do so. We could even put a > note in the documents that the serialisation format may change between > major versions. Well, OK, so here was my thought. For the sake of simplicity, let's suppose that creating a sharded table works more or less like what you can already do today: create a parent table with a non-inherited CHECK (false) constraint and then create some inheritance children that are foreign tables on various remote servers. Give those children CHECK constraints that explicate the partitioning scheme. This is probably not actually how we want this to work in detail (e.g. we probably want declarative partitioning) but the details don't matter very much for purposes of what I'm trying to explain here so let's just ignore them for the moment. Now, let's suppose that the user sets up a sharded table and then says: SELECT a, SUM(b), AVG(c) FROM sometab. At this point, what we'd like to have happen is that for each child foreign table, we go and fetch partially aggregated results. Those children might be running any version of PostgreSQL - I was not assuming that we'd insist on matching major versions, although of course that could be done - and there would probably need to be a minimum version of PostgreSQL anyway. They could even be running some other database. As long as they can spit out partial aggregates in a format that we can understand, we can deserialize those aggregates and run combine functions on them. But if the remote side is, say, MariaDB, it's probably much easier to get it to spit out something that looks like a PostgreSQL array than it is to make it spit out some bytea blob that's in an entirely PostgreSQL-specific format. Now, maybe that doesn't matter. Even getting something like this working with PostgreSQL as the remote side is going to be hard. Moreover, for this to have any chance of working with anything other than a compatible PostgreSQL server on the remote side, the FDW is going to have to write bespoke code for each aggregate anyway, and that code can always construct the requisite bytea blobs internally. So who cares? I can't point to any really tangible advantage of having serialized transition states be human-readable, so maybe there isn't one. But I was thinking about it, for fear that there might be some advantage that I'm missing. > To be really honest, I'm quite worried that if I go and make this > change then my time might be wasted as I really think making that > change this late in the day is just setting this up for failure. I > really don't think we can bat this patch over the Pacific Ocean too > many times before we find ourselves inside the feature freeze. Of > course, if you really think it's no good, that's different, it can't > be committed, but "I think it might be better" does not seem like a > solid enough argument for me to want to risk trying this and delaying > further for that. I think I agree. Certainly, with what we've got here today, these are not user-exposed, so I think we could certainly change them next time around if need be. If they ever become user-exposed, then maybe we should think a little harder. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 April 2016 at 01:01, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 5, 2016 at 3:55 AM, David Rowley >> To be really honest, I'm quite worried that if I go and make this >> change then my time might be wasted as I really think making that >> change this late in the day is just setting this up for failure. I >> really don't think we can bat this patch over the Pacific Ocean too >> many times before we find ourselves inside the feature freeze. Of >> course, if you really think it's no good, that's different, it can't >> be committed, but "I think it might be better" does not seem like a >> solid enough argument for me to want to risk trying this and delaying >> further for that. > > I think I agree. Certainly, with what we've got here today, these are > not user-exposed, so I think we could certainly change them next time > around if need be. If they ever become user-exposed, then maybe we > should think a little harder. I understand your concern. Painting ourselves into a corner would be pretty bad. Although, I'm not so sure we'd want to go down the route of trying to stabilise the format, even into something human readable. As far as I know we've never mentioned anything about what's stored in the internal aggregate states in the documentation, and we've made some fairly hefty changes to these over the last few releases. For example; 959277a4, where internal int128 support was added to improve performance of various aggregates, like sum(bigint). I think if we're going to head down the route of trying to keep this format stable, then we're going to struggle to make improvements to aggregates in the future. It's also not all that clear that all of the internal fields really make sense to other databases, for example NumericAggState's maxScale and maxScaleCount. These really only are needed for moving aggregates. Would we want some other database to have to magic up some numbers for these fields because our format requires it? -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 5, 2016 at 9:30 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 6 April 2016 at 01:01, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Apr 5, 2016 at 3:55 AM, David Rowley >>> To be really honest, I'm quite worried that if I go and make this >>> change then my time might be wasted as I really think making that >>> change this late in the day is just setting this up for failure. I >>> really don't think we can bat this patch over the Pacific Ocean too >>> many times before we find ourselves inside the feature freeze. Of >>> course, if you really think it's no good, that's different, it can't >>> be committed, but "I think it might be better" does not seem like a >>> solid enough argument for me to want to risk trying this and delaying >>> further for that. >> >> I think I agree. Certainly, with what we've got here today, these are >> not user-exposed, so I think we could certainly change them next time >> around if need be. If they ever become user-exposed, then maybe we >> should think a little harder. > > I understand your concern. Painting ourselves into a corner would be pretty bad. > > Although, I'm not so sure we'd want to go down the route of trying to > stabilise the format, even into something human readable. As far as I > know we've never mentioned anything about what's stored in the > internal aggregate states in the documentation, and we've made some > fairly hefty changes to these over the last few releases. For example; > 959277a4, where internal int128 support was added to improve > performance of various aggregates, like sum(bigint). I think if we're > going to head down the route of trying to keep this format stable, > then we're going to struggle to make improvements to aggregates in the > future. That might be true, but it's much more likely to be true if the internal format is an opaque blob. If the serialized transition state for average is {count,sum} then we can switch between having the internal representation being numeric and having it be int128 and nothing breaks. With the opaque blob, maybe nothing breaks, or maybe something does. > It's also not all that clear that all of the internal fields really > make sense to other databases, for example NumericAggState's maxScale > and maxScaleCount. These really only are needed for moving aggregates. > Would we want some other database to have to magic up some numbers for > these fields because our format requires it? Well, if we want to do cross-node aggregation with that database on the other side, those numbers are going to have to get magicked up someplace along the line. I'm going to concede the point that this shouldn't really be a priority for 9.6, but I might want to come back to it later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 5, 2016 at 10:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm going to concede the point that this shouldn't really be a > priority for 9.6, but I might want to come back to it later. It seems to me that if two aggregates are using the same transition function, they ought to also be using the same combine, serialization, and deserialization functions. I wrote a query to find cases where that wasn't so and found a few, which I changed before committing. DATA(insert ( 2100 n 0 int8_avg_accum numeric_poly_avg int8_avg_combine int8_avg_serialize int8_avg_deserialize int8_avg_accum int8_avg_accum_inv numeric_poly_avg f f 0 228117 48 2281 48 _null_ _null_ )); DATA(insert ( 2107 n 0 int8_avg_accum numeric_poly_sum numeric_poly_combine int8_avg_serialize int8_avg_deserializeint8_avg_accum int8_avg_accum_inv numeric_poly_sum f f 0 2281 17 48 2281 48 _null_ _null_ )); I changed the second of these from numeric_poly_combine to int8_avg_combine. DATA(insert ( 2103 n 0 numeric_avg_accum numeric_avg numeric_avg_combine numeric_avg_serialize numeric_avg_deserialize numeric_avg_accum numeric_accum_inv numeric_avg f f 0 228117 128 2281 128 _null_ _null_ )); DATA(insert ( 2114 n 0 numeric_avg_accum numeric_sum numeric_combine numeric_avg_serialize numeric_avg_deserialize numeric_avg_accum numeric_accum_inv numeric_sum f f 0 2281 17 128 2281 128 _null_ _null_ )); I changed the second of these from numeric_combine to numeric_avg_combine. I also added a regression test for this. Please let me know if I'm off-base here. Committed 0002+0003 with those changes, some minor cosmetic stuff, and of course the obligatory catversion bump. Oh, and fixed an OID conflict with the patch Magnus just committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > Now, let's suppose that the user sets up a sharded table and then > says: SELECT a, SUM(b), AVG(c) FROM sometab. At this point, what we'd > like to have happen is that for each child foreign table, we go and > fetch partially aggregated results. Those children might be running > any version of PostgreSQL - I was not assuming that we'd insist on > matching major versions, although of course that could be done - and > there would probably need to be a minimum version of PostgreSQL > anyway. They could even be running some other database. As long as > they can spit out partial aggregates in a format that we can > understand, we can deserialize those aggregates and run combine > functions on them. But if the remote side is, say, MariaDB, it's > probably much easier to get it to spit out something that looks like a > PostgreSQL array than it is to make it spit out some bytea blob that's > in an entirely PostgreSQL-specific format. Basing parts of the Postgres sharding mechanism on FDWs sounds acceptable. Trying to design things so that *any* FDW can be part of a shard, so that you have some shards in Postgres and other shards in MariaDB, seems ludicrous to me. Down that path lies madness. In fact, trying to ensure cross-major-version compatibility already sounds like asking for too much. Requiring matching major versions sounds a perfectly acceptable restricting to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 5, 2016 at 2:52 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> Now, let's suppose that the user sets up a sharded table and then >> says: SELECT a, SUM(b), AVG(c) FROM sometab. At this point, what we'd >> like to have happen is that for each child foreign table, we go and >> fetch partially aggregated results. Those children might be running >> any version of PostgreSQL - I was not assuming that we'd insist on >> matching major versions, although of course that could be done - and >> there would probably need to be a minimum version of PostgreSQL >> anyway. They could even be running some other database. As long as >> they can spit out partial aggregates in a format that we can >> understand, we can deserialize those aggregates and run combine >> functions on them. But if the remote side is, say, MariaDB, it's >> probably much easier to get it to spit out something that looks like a >> PostgreSQL array than it is to make it spit out some bytea blob that's >> in an entirely PostgreSQL-specific format. > > Basing parts of the Postgres sharding mechanism on FDWs sounds > acceptable. Trying to design things so that *any* FDW can be part of a > shard, so that you have some shards in Postgres and other shards in > MariaDB, seems ludicrous to me. Down that path lies madness. I'm doubtful that anyone wants to do the work to make that happen, but I don't agree that we shouldn't care about whether it's possible. Extensibility is a feature of the system that we've worked hard for, and we shouldn't casually surrender it. For example, postgres_fdw now implements join pushdown, and I suspect few other FDW authors will care to do the work to add similar support to their implementations. But some may, and it's good that the code is structured in such a way that they have the option. Actually, though, MariaDB is a bad example. What somebody's much more likely to want to do is have PostgreSQL as a frontend accessing data that's actually stored in Hadoop. There are lots of SQL interfaces to Hadoop already, so it's clearly a thing people want, and our SQL is the best SQL (right?) so if you could put that on top of Hadoop somebody'd probably like it. I'm not planning to try it myself, but I think it would be cool if somebody else did. I have been very pleased to see that many of the bits and pieces of infrastructure that I created for parallel query (parallel background workers, DSM, shm_mq) have attracted quite a bit of interest from other developers for totally unrelated purposes, and I think anything we do around horizontal scalability should be designed the same way: the goal should be to work with PostgreSQL on the other side, but the bits that can be made reusable for other purposes should be so constructed. > In fact, trying to ensure cross-major-version compatibility already > sounds like asking for too much. Requiring matching major versions > sounds a perfectly acceptable restricting to me. I disagree. One of the motivations (not the only one, by any means) for wanting logical replication in PostgreSQL is precisely to get around the fact that physical replication requires matching major versions. That restriction turns out to be fairly onerous, not least because when you've got a cluster of several machines you'd rather upgrade them one at a time rather than all at once. That's going to be even more true with a sharded cluster, which will probably tend to involve more machines than a replication cluster, maybe a lot more. If you say that the user has got to shut the entire thing down, upgrade all the machines, and turn it all back on again, and just hope it works, that's going to be really painful. I think that we should treat this more like we do with libpq, where each major release can add new capabilities that new applications can use, but the old stuff has got to keep working essentially forever. Maybe the requirements here are not quite so tight, because it's probably reasonable to say, e.g. that you must upgrade every machine in the cluster to at least release 11.1 before upgrading any machine to 11.3 or higher, but the fewer such requirements we have, the better. Getting people to upgrade to new major releases is already fairly hard, and anything that makes it harder is an unwelcome development from my point of view. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5 April 2016 at 19:33, Robert Haas <robertmhaas@gmail.com> wrote:
--
Committed 0002+0003 with those changes, some minor cosmetic stuff, and
of course the obligatory catversion bump. Oh, and fixed an OID
conflict with the patch Magnus just committed.
Is that everything now? I don't see anything about combining aggs in the git log and this is still showing as UnCommitted in the CF app.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 7 April 2016 at 09:25, Simon Riggs <simon@2ndquadrant.com> wrote: > On 5 April 2016 at 19:33, Robert Haas <robertmhaas@gmail.com> wrote: > >> >> Committed 0002+0003 with those changes, some minor cosmetic stuff, and >> of course the obligatory catversion bump. Oh, and fixed an OID >> conflict with the patch Magnus just committed. > > > Is that everything now? I don't see anything about combining aggs in the git > log and this is still showing as UnCommitted in the CF app. There's just a documentation patch and two combine functions for floating point aggs left now (Haribabu's patch) -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 6 April 2016 at 22:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
--
On 7 April 2016 at 09:25, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 5 April 2016 at 19:33, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>
>> Committed 0002+0003 with those changes, some minor cosmetic stuff, and
>> of course the obligatory catversion bump. Oh, and fixed an OID
>> conflict with the patch Magnus just committed.
>
>
> Is that everything now? I don't see anything about combining aggs in the git
> log and this is still showing as UnCommitted in the CF app.
There's just a documentation patch and two combine functions for
floating point aggs left now (Haribabu's patch)
Ah, I see, it was the commit about "multi-stage aggregation".
So SELECT sum(x), avg(x) is now optimized?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 6, 2016 at 5:28 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Is that everything now? I don't see anything about combining aggs in the git >> log and this is still showing as UnCommitted in the CF app. > > There's just a documentation patch and two combine functions for > floating point aggs left now (Haribabu's patch) I'll go have a look at that now. Hopefully it will be in good shape and committable; if not, it'll have to wait for 9.7. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 8, 2016 at 11:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 6, 2016 at 5:28 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >>> Is that everything now? I don't see anything about combining aggs in the git >>> log and this is still showing as UnCommitted in the CF app. >> >> There's just a documentation patch and two combine functions for >> floating point aggs left now (Haribabu's patch) > > I'll go have a look at that now. Hopefully it will be in good shape > and committable; if not, it'll have to wait for 9.7. And... it's pretty hard to argue with any of this. Committed. I am *so* glad to be done with this patch series. Man, that was a lot of work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 April 2016 at 05:49, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 8, 2016 at 11:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Apr 6, 2016 at 5:28 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>>> Is that everything now? I don't see anything about combining aggs in the git >>>> log and this is still showing as UnCommitted in the CF app. >>> >>> There's just a documentation patch and two combine functions for >>> floating point aggs left now (Haribabu's patch) >> >> I'll go have a look at that now. Hopefully it will be in good shape >> and committable; if not, it'll have to wait for 9.7. > > And... it's pretty hard to argue with any of this. Committed. > > I am *so* glad to be done with this patch series. Man, that was a lot of work. Great! Thanks for committing all these Robert. I really appreciate it, and I know I'm not the only one. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services