Re: Parallel Aggregates for string_agg and array_agg - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Parallel Aggregates for string_agg and array_agg |
Date | |
Msg-id | 1b6ee15c-c218-761e-0120-c1ed438e2db2@2ndquadrant.com Whole thread Raw |
In response to | Re: Parallel Aggregates for string_agg and array_agg (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: Parallel Aggregates for string_agg and array_agg
|
List | pgsql-hackers |
On 03/11/2018 07:31 AM, David Rowley wrote: > On 11 March 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> On 03/05/2018 04:51 AM, David Rowley wrote: >>> On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>> Consider the following slightly backward-looking case; >>> >>> select string_agg(',', x::text) from generate_Series(1,10)x; >>> string_agg >>> ---------------------- >>> ,2,3,4,5,6,7,8,9,10, >>> >>> Here the delimiter is the number, not the ','. Of course, the >>> delimiter for the first aggregated result is not shown. The previous >>> implementation of the transfn for this aggregate just threw away the >>> first delimiter, but that's no good for parallel aggregates as the >>> transfn may be getting called in a parallel worker, in which case >>> we'll need that delimiter in the combine function to properly join to >>> the other partial aggregated results matching the same group. >>> >> >> Hmmm, you're right I haven't considered the delimiter might be variable. >> But isn't just yet another hint that while StringInfo was suitable for >> aggregate state of non-parallel string_agg, it may not be for the >> parallel version? >> >> What if the parallel string_agg instead used something like this: >> >> struct StringAggState >> { >> char *delimiter; /* first delimiter */ >> StringInfoData str; /* partial aggregate state */ >> } StringAggState; >> >> I think that would make the code cleaner, compared to using the cursor >> field for that purpose. But maybe it'd make it not possible to share >> code with the non-parallel aggregate, not sure. > > It would be possible to use something like that. The final function > would just need to take 'str' and ignore 'delimiter', whereas the > combine function would need both. However, you could optimize the > above to just have a single StringInfoData and have a pointer to the > start of the actual data (beyond the first delimiter), that would save > us a call to palloc and also allow the state's data to exist in one > contiguous block of memory, which will be more cache friendly when it > comes to reading it back again. The pointer here would, of course, > have to be an offset into the data array, since repallocs would cause > problems as the state grew. > > This is kinda the option I was going for with using the cursor. I > didn't think that was going to be a problem. It seems that cursor was > invented so that users of StringInfo could store a marker somehow > along with the string. The comment for cursor read: > > * cursor is initialized to zero by makeStringInfo or initStringInfo, > * but is not otherwise touched by the stringinfo.c routines. > * Some routines use it to scan through a StringInfo. > > The use of the cursor field does not interfere with pqformat.c's use > of it as in the serialization functions we're creating a new > StringInfo for the 'result'. If anything tries to send the internal > state, then that's certainly broken. It needs to be serialized before > that can happen. > > I don't quite see how inventing a new struct would make the code > cleaner. It would make the serialization and deserialization functions > have to write and read more fields for the lengths of the data > contained in the state. > > I understand that the cursor field is used in pqformat.c quite a bit, > but I don't quite understand why it should get to use that field the > way it wants, but the serialization and deserialization functions > shouldn't. I'd understand that if we were trying to phase out the > cursor field from StringInfoData, but I can't imagine that's going to > happen. > > Of course, with that all said. If you really strongly object to how > I've done this, then I'll change it to use a new struct type. I had > just tried to make the patch footprint as small as possible, and the > above text is me just explaining my reasoning behind this, not me > objecting to your request for me to change it. Let me know your > thoughts after reading the above. > Hmmm, I guess you're right, I didn't really thought that through. I don't object to sticking to the current approach. > In the meantime, I've attached an updated patch. The only change it > contains is updating the "Partial Mode" column in the docs from "No" > to "Yes". > Yeah, seems fine to me. I wonder what else would be needed before switching the patch to RFC. I plan to do that after a bit more testing sometime early next week, unless someone objects. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: