Home > mailing lists
Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

From	Amit Khandekar
Subject	Re: [HACKERS] Parallel Append implementation
Date	March 22, 2017 11:49:13
Msg-id	CAJ3gD9e-_qFOBkoQCR7OV6PjJ2uMjftEPSCHMBNPT7ZohSzCow@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Parallel Append implementation (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [HACKERS] Parallel Append implementation
List	pgsql-hackers
Tree view
Attached is the updated patch that handles the changes for all the
comments except the cost changes part. Details about the specific
changes are after the cost-related points discussed below.

>> I wanted to take into account persubpath parallel_workers for total
>> cost of Append. Suppose the partial subpaths have per worker total
>> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
>> Append workers available. So according to what you say, the total cost
>> is 9. With persubplan parallel_workers taken into account, total cost
>> = (3*2 + 3*8 * 3*4)/2 = 21.
> But that case never happens, because the parallel workers for the
> append is always at least as large as the number of workers for any
> single child.

Yeah, that's right. I will use this approach for partial paths.


For non-partial paths, I was checking following 3 options :

Option 1. Just take the sum of total non-partial child costs and
divide it by number of workers. It seems to be getting close to the
actual cost.

Option 2. Calculate exact cost by an algorithm which I mentioned
before, which is pasted below for reference :
Persubpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
times remaining are :
10  6  0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4  0  0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
 2  0  0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
 1  0  0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
 0  0  0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

Option 3. Get some approximation formula like you suggested. I am also
looking for such formula, just that some things are not clear to me.
The discussion of the same is below ...
>>> 2. Next, estimate the cost of the nonpartial paths.  To do this, make
>>> an array of Cost of that length and initialize all the elements to
>>> zero, then add the total cost of each nonpartial plan in turn to the
>>> element of the array with the smallest cost, and then take the maximum
>>> of the array elements as the total cost of the nonpartial plans.  Add
>>> this to the result from step 1 to get the total cost.
>>
>> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
>> 15) , and so the max is 15 ? I surely am misinterpreting this.
> No.  If you have costs 8, 5, and 2 and only one process, cost is 15.
> If you have two processes then for costing purposes you assume worker
> 1 will execute the first path (cost 8) and worker 2 will execute the
> other two (cost 5 + 2 = 7), so the total cost is 8.  If you have three
> workers, the cost will still be 8, because there's no way to finish
> the cost8 path in less than 8 units of work.

So the part that you suggested about adding up total cost in turn to
the smallest cost; this suggestion applies to only 1 worker right ?
For more than worker, are you suggesting to use some algorithm similar
to the one I suggested in option 2 above ? If yes, it would be great
if you again describe how that works for multiple workers. Or is it
that you were suggesting some simple approximate arithmetic that
applies to multiple workers ?
Like I mentioned, I will be happy to get such simple approximation
arithmetic that can be applied for multiple worker case. The one logic
I suggested in option 2 is something we can keep as the last option.
And option 1 is also an approximation but we would like to have a
better approximation. So wanted to clear my queries regarding option
3.

----------

Details about all the remaining changes in updated patch are below ...

On 20 March 2017 at 17:29, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> - The substantive changes in add_paths_to_append_rel don't look right
>>> either.  It's not clear why accumulate_partialappend_subpath is
>>> getting called even in the non-enable_parallelappend case.  I don't
>>> think the logic for the case where we're not generating a parallel
>>> append path needs to change at all.
>>
>> When accumulate_partialappend_subpath() is called for a childrel with
>> a partial path, it works just like accumulate_append_subpath() when
>> enable_parallelappend is false. That's why, for partial child path,
>> the same function is called irrespective of parallel-append or
>> non-parallel-append case. May be mentioning this in comments should
>> suffice here ?
>
> I don't get it.  If you can get the same effect by changing something
> or not changing it, presumably it'd be better to not change it.   We
> try not to change things just because we can; the change should be an
> improvement in some way.
>
>>> - When parallel append is enabled, I think add_paths_to_append_rel
>>> should still consider all the same paths that it does today, plus one
>>> extra.  The new path is a parallel append path where each subpath is
>>> the cheapest subpath for that childrel, whether partial or
>>> non-partial.  If !enable_parallelappend, or if all of the cheapest
>>> subpaths are partial, then skip this.  (If all the cheapest subpaths
>>> are non-partial, it's still potentially useful.)
>>
>> In case of all-partial childrels, the paths are *exactly* same as
>> those that would have been created for enable_parallelappend=off. The
>> extra path is there for enable_parallelappend=on only when one or more
>> of the child rels do not have partial paths. Does this make sense ?
>
> No, I don't think so.  Imagine that we have three children, A, B, and
> C.  The cheapest partial paths have costs of 10,000 each.  A, however,
> has a non-partial path with a cost of 1,000.  Even though A has a
> partial path, we still want to consider a parallel append using the
> non-partial path because it figures to be hugely faster.

Right. Now that we want to consider both cheapest partial and cheapest
non-partial path, I now get what you were saying about having an extra
path for parallel_append. I have done all of the above changes. Now we
have an extra path for enable_parallelappend=true, besides the
non-parallel partial append path.

> - You've added a GUC (which is good) but not documented it (which is
> bad) or added it to postgresql.conf.sample (also bad).

Done.

>
> - You've used a loop inside a spinlock-protected critical section,
> which is against project policy.  Use an LWLock; define and document a
> new builtin tranche ID.

Done. Used LWlock for the parallel append synchronization. But I am
not sure what does "document the new builtin trancheID" mean. Didn't
find a readme which documents tranche ids.

For setting pa_finished=true when a partial plan finished, earlier it
was using Spinlock. Now it does not use any synchronization. It was
actually earlier using it because there was another field num_workers,
but it is not needed since there is no num_workers. I was considering
whether to use atomic read and write API in atomics.c for pa_finished.
But from what I understand, just a plain read/write is already atomic.
We require them only if there are some compound operations like
increment, exchange, etc.

>
> - The comment for pa_finished claims that it is the number of workers
> executing the subplan, but it's a bool, not a count; I think this
> comment is just out of date.

Done.

>
> - paths_insert_sorted_by_cost() is a hand-coded insertion sort.  Can't
> we find a way to use qsort() for this instead of hand-coding a slower
> algorithm?  I think we could just create an array of the right length,
> stick each path into it from add_paths_to_append_rel, and then qsort()
> the array based on <is-partial, total-cost>.  Then the result can be
> turned into a list.

Now added a new function list.c list_qsort() so that it can be used in
the future.

>
> - Maybe the new helper functions in nodeAppend.c could get names
> starting with exec_append_, to match the style of
> exec_append_initialize_next().

Done.

>
> - There's a superfluous whitespace change in add_paths_to_append_rel.

Didn't find exactly which, but I guess the attached latest patch does
not have it.


>>> - In get_append_num_workers, instead of the complicated formula with
>>> log() and 0.693, just add the list lengths and call fls() on the
>>> result.  Integer arithmetic FTW!
>>
>> Yeah fls() could be used. BTW I just found that costsize.c already has
>> this defined in the same way I did:
>> #define LOG2(x)  (log(x) / 0.693147180559945)
>> May be we need to shift this to some common header file.
>
> LOG2() would make sense if you're working with a value represented as
> a double, but if you have an integer input, I think fls() is better.

Used fls() now.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment

ParallelAppend_v8.patch
pgsql-hackers by date:
From: "Seki, Eiji"
Date: 22 March 2017, 11:25:16
Subject: Re: [HACKERS] BRIN de-summarize ranges
From: Craig Ringer
Date: 22 March 2017, 11:53:14
Subject: Re: [HACKERS] Logical decoding on standby
Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

Attachment

Previous

Next