Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | Re: [HACKERS] Parallel Append implementation |
Date | |
Msg-id | CAJ3gD9fzuBxS8he5fxcpzdoGb=b6A+pV4keZzDY15T6JkS=R-Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Append implementation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel Append implementation
|
List | pgsql-hackers |
On 23 March 2017 at 05:55, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> Attached is the updated patch that handles the changes for all the >> comments except the cost changes part. Details about the specific >> changes are after the cost-related points discussed below. >> >> For non-partial paths, I was checking following 3 options : >> >> Option 1. Just take the sum of total non-partial child costs and >> divide it by number of workers. It seems to be getting close to the >> actual cost. > > If the costs for all children are about equal, then that works fine. > But when they are very unequal, then it's highly misleading. > >> Option 2. Calculate exact cost by an algorithm which I mentioned >> before, which is pasted below for reference : >> Persubpath cost : 20 16 10 8 3 1, with 3 workers. >> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the >> times remaining are : >> 10 6 0 8 3 1 >> After 6 units (minimum of 10, 06, 08), the times remaining are : >> 4 0 0 2 3 1 >> After 2 units (minimum of 4, 2, 3), the times remaining are : >> 2 0 0 0 1 1 >> After 1 units (minimum of 2, 1, 1), the times remaining are : >> 1 0 0 0 0 0 >> After 1 units (minimum of 1, 0 , 0), the times remaining are : >> 0 0 0 0 0 0 >> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20 > > This gives the same answer as what I was proposing Ah I see. > but I believe it's more complicated to compute. Yes a bit, particularly because in my algorithm, I would have to do 'n' subtractions each time, in case of 'n' workers. But it looked more natural because it follows exactly the way we manually calculate. > The way my proposal would work in this > case is that we would start with an array C[3] (since there are three > workers], with all entries 0. Logically C[i] represents the amount of > work to be performed by worker i. We add each path in turn to the > worker whose array entry is currently smallest; in the case of a tie, > just pick the first such entry. > > So in your example we do this: > > C[0] += 20; > C[1] += 16; > C[2] += 10; > /* C[2] is smaller than C[0] or C[1] at this point, so we add the next > path to C[2] */ > C[2] += 8; > /* after the previous line, C[1] is now the smallest, so add to that > entry next */ > C[1] += 3; > /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */ > C[2] += 1; > /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */ > > Now we just take the highest entry that appears in any array, which in > this case is C[0], as the total cost. Wow. The way your final result exactly tallies with my algorithm result is very interesting. This looks like some maths or computer science theory that I am not aware. I am currently coding the algorithm using your method. Meanwhile attached is a patch that takes care of your other comments, details of which are below... > > In my previous review, I said that you should "define and document a > new builtin tranche ID"; you did the first but not the second. See > the table in monitoring.sgml. Yeah, I tried to search how TBM did in the source, but I guess I didn't correctly run "git grep" commands, so the results did not have monitoring.sgml, so I thought may be you mean something else by "document". Added changes in monitoring.sgml now. > > Definition of exec_append_goto_next_plan should have a line break > after the return type, per usual PostgreSQL style rules. Oops. Done. > > - * initialize to scan first subplan > + * In case it's a sequential Append, initialize to scan first subplan. > > This comment is confusing because the code is executed whether it's > parallel or not. I think it might be better to write something like > "initialize to scan first subplan (but note that we'll override this > later in the case of a parallel append)" Done. > > /* > + * Check if we are already finished plans from parallel append. This > + * can happen if all the subplans are finished when this worker > + * has not even started returning tuples. > + */ > + if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN) > + return ExecClearTuple(node->ps.ps_ResultTupleSlot); > > There seems to be no reason why this couldn't be hoisted out of the > loop. Actually, I think Ashutosh pointed this out before, but I > didn't understand at that time what his point was. Looking back, I > see that he also pointed out that the as_padesc test isn't necessary, > which is also true. I am assuming both yours and Ashutosh's concern is that this check will be executed for *each* tuple returned, and which needs to be avoided. Actually, just moving it out of the loop is not going to solve the runs-for-each-tuple issue. It still will execute for each tuple. But after a thought, now I agree this can be taken out of loop anyways, but, not for solving the per-tuple issue, but because it need not be run for each of the iteration of the loop because that loop is there to go to the next subplan. When a worker tries to choose a plan to execute at the very beginning (i.e in ExecAppendInitializeWorker()), it sometimes finds there is no plan to execute, because all the others have already taken them and they are already finished or they are all non-partial plans. In short, for all subplans, pa_finished = true. So as_whichplan has to be PA_INVALID_PLAN. To get rid of the extra check in ExecAppend(), in ExecAppendInitializeWorker() if all plans are finished, we can very well assign as_whichplan to a partial plan which has already finished, so that ExecAppend() will execute this finished subplan and just return NULL. But if all plans are non-partial, we cannot do that. Now, when ExecAppend() is called, there is no way to know whether this is the first time ExecProcNode() is executed or not. So we have to keep on checking the node->as_whichplan == PA_INVALID_PLAN condition. My earlier response to Ashutosh's feedback on this same point are pasted below, where there are some possible improvements discussed : The way ExecProcNode() gets called, there is no different special code that gets called instead of ExecProcNode() when a tuple is fetched for the first time. I mean, we cannot prevent ExecProcNode() from getting called when as_whichplan is invalid right from the beginning. One thing we can do is : have a special slot in AppenState>as_plan[] which has some dummy execution node that just returns NULL tuple, and initially make as_whichplan point to this slot. But I think it is not worth doing this. We can instead reduce the if condition to: if (node>as_whichplan == PA_INVALID_PLAN) { Assert(node>as_padesc != NULL); return ExecClearTuple(node>ps.ps_ResultTupleSlot); } BTW, the loop which you mentioned that returns tuples.... the loop is not for returning tuples, the loop is for iterating to the next subplan. Even if we take the condition out and keep it in the beginning of ExecAppend, the issue will remain. > > + if (node->as_padesc) > + node->as_padesc->pa_finished[node->as_whichplan] = true; > > I think you should move this logic inside exec_append_parallel_next. > That would avoid testing node->pa_desc an extra time for non-parallel > append. Actually exec_append_parallel_next() is called at other places also, for which we cannot set pa_finished to true inside exec_append_parallel_next(). But I have done the changes another way. I have taken exec_append_parallel_next() out of exec_append_initialize_next(), and put two different conditional code blocks in ExecAppend(), one which calls set_finished() followed by exec_append_parallel_next() and the other calls exec_append_initialize_next() (now renamed to exec_append_seq_next() But one thing to note is that this condition is not executed for each tuple. It is only while going to the next subplan. > I note that the comment doesn't explain why it's safe to do > this without taking the lock. I think we could consider doing it with > the lock held, but it probably is safe, because we're only setting it > from false to true. If someone else does the same thing, that won't > hurt anything, and if someone else fails to see our update, then the > worst-case scenario is that they'll try to execute a plan that has no > chance of returning any more rows. That's not so bad. Actually, > looking further, you do have a comment explaining that, but it's in > exec_append_parallel_next() where the value is used, rather than here. Yes, right. > > + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans); > + > + shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id, padesc); > + node->as_padesc = padesc; > > Putting the shm_toc_insert call after we fully initialize the > structure seems better than putting it after we've done some of the > initialization and before we've done the rest. Done. Also found out that I was memset()ing only pa_finished[]. Now there is a memset for the whole ParallelAppendDesc structure. > > + /* Choose the optimal subplan to be executed. */ > > I think the word "first" would be more accurate than "optimal". We > can only hope to pick the optimal one, but whichever one we pick is > definitely the one we're executing first! Done. > > I think the loop in exec_append_parallel_next() is a bit confusing. > It has three exit conditions, one checked at the top of the loop and > two other ways to exit via break statements. Sometimes it exits > because whichplan == PA_INVALID_PLAN was set by > exec_append_goto_next_plan(), and other times it exits because > whichplan == initial_plan Yeah, we cannot bring up the (whichplan == initialplan) to the top in for(;;) because initially whichplan is initialplan, and we have to execute the loop at least once (unless whichplan = INVALID). And we cannot bring down the for condition (which != PA_INVALID_PLAN) because whichplan can be INVALID right at the beginning if pa_next_plan itself can be PA_INVALID_PLAN. > and then it sets whichplan == PA_INVALID_PLAN itself. It sets that to PA_INVALID_PLAN only when it does not find any next plan to execute. This is essential to do that particularly because initiallly when ExecAppendInitialize[Worker/DSM]() is called, it cannot set as_whichplan to any valid value. > I feel like this whole function could be written more simply somehow. Yeah, the main reason it is a bit compilcated is because we are simulating circular array structure, and that too with an optimization that we can skip the finished non-partial plans while wrapping around to the next plan in the circular array. I have tried to add a couple of more comments. Also renamed exec_append_goto_next_plan() to exec_append_get_next_plan() since it is not actually shifting any counter, it is just returning what is the next counter. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: