Re: Parallel Aggregates for string_agg and array_agg - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Parallel Aggregates for string_agg and array_agg |
Date | |
Msg-id | CAKJS1f8LV7AT=AAhdYGKtGrGkSkEgO6C_SW2Ztz1sR3encisqw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Aggregates for string_agg and array_agg (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Parallel Aggregates for string_agg and array_agg
|
List | pgsql-hackers |
On 11 March 2018 at 12:11, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 03/05/2018 04:51 AM, David Rowley wrote: >> On 5 March 2018 at 04:54, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Consider the following slightly backward-looking case; >> >> select string_agg(',', x::text) from generate_Series(1,10)x; >> string_agg >> ---------------------- >> ,2,3,4,5,6,7,8,9,10, >> >> Here the delimiter is the number, not the ','. Of course, the >> delimiter for the first aggregated result is not shown. The previous >> implementation of the transfn for this aggregate just threw away the >> first delimiter, but that's no good for parallel aggregates as the >> transfn may be getting called in a parallel worker, in which case >> we'll need that delimiter in the combine function to properly join to >> the other partial aggregated results matching the same group. >> > > Hmmm, you're right I haven't considered the delimiter might be variable. > But isn't just yet another hint that while StringInfo was suitable for > aggregate state of non-parallel string_agg, it may not be for the > parallel version? > > What if the parallel string_agg instead used something like this: > > struct StringAggState > { > char *delimiter; /* first delimiter */ > StringInfoData str; /* partial aggregate state */ > } StringAggState; > > I think that would make the code cleaner, compared to using the cursor > field for that purpose. But maybe it'd make it not possible to share > code with the non-parallel aggregate, not sure. It would be possible to use something like that. The final function would just need to take 'str' and ignore 'delimiter', whereas the combine function would need both. However, you could optimize the above to just have a single StringInfoData and have a pointer to the start of the actual data (beyond the first delimiter), that would save us a call to palloc and also allow the state's data to exist in one contiguous block of memory, which will be more cache friendly when it comes to reading it back again. The pointer here would, of course, have to be an offset into the data array, since repallocs would cause problems as the state grew. This is kinda the option I was going for with using the cursor. I didn't think that was going to be a problem. It seems that cursor was invented so that users of StringInfo could store a marker somehow along with the string. The comment for cursor read: * cursor is initialized to zero by makeStringInfo or initStringInfo, * but is not otherwise touched by the stringinfo.c routines. * Some routines use it to scan through a StringInfo. The use of the cursor field does not interfere with pqformat.c's use of it as in the serialization functions we're creating a new StringInfo for the 'result'. If anything tries to send the internal state, then that's certainly broken. It needs to be serialized before that can happen. I don't quite see how inventing a new struct would make the code cleaner. It would make the serialization and deserialization functions have to write and read more fields for the lengths of the data contained in the state. I understand that the cursor field is used in pqformat.c quite a bit, but I don't quite understand why it should get to use that field the way it wants, but the serialization and deserialization functions shouldn't. I'd understand that if we were trying to phase out the cursor field from StringInfoData, but I can't imagine that's going to happen. Of course, with that all said. If you really strongly object to how I've done this, then I'll change it to use a new struct type. I had just tried to make the patch footprint as small as possible, and the above text is me just explaining my reasoning behind this, not me objecting to your request for me to change it. Let me know your thoughts after reading the above. In the meantime, I've attached an updated patch. The only change it contains is updating the "Partial Mode" column in the docs from "No" to "Yes". -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: