Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel Append implementation |
Date | |
Msg-id | CA+TgmoZ3UGPn-1=kkFn4AVD-Zi=6i_T1H=mrsTvHzcgY7-3x3Q@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Append implementation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel Append implementation
|
List | pgsql-hackers |
On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > No, because the Append node is *NOT* getting copied into shared > memory. I have pushed a comment update to the existing functions; you > can use the same comment for this patch. I spent the last several days working on this patch, which had a number of problems both cosmetic and functional. I think the attached is in better shape now, but it could certainly use some more review and testing since I only just finished modifying it, and I modified it pretty heavily. Changes: - I fixed the "morerows" entries in the documentation. If you had built the documentation the way you had it and loaded up in a web browser, you would have seen that the way you had it was not correct. - I moved T_AppendState to a different position in the switch inside ExecParallelReInitializeDSM, so as to keep that switch in the same order as all of the other switch statements in that file. - I rewrote the comment for pa_finished. It previously began with "workers currently executing the subplan", which is not an accurate description. I suspect this was a holdover from a previous version of the patch in which this was an array of integers rather than an array of type bool. I also fixed the comment in ExecAppendEstimate and added, removed, or rewrote various other comments as well. - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is more clear and allows for the possibility that this sentinel value might someday be used for non-parallel-aware Append plans. - I largely rewrote the code for picking the next subplan. A superficial problem with the way that you had it is that you had renamed exec_append_initialize_next to exec_append_seq_next but not updated the header comment to match. Also, the logic was spread out all over the file. There are three cases: not parallel aware, leader, worker. You had the code for the first case at the top of the file and the other two cases at the bottom of the file and used multiple "if" statements to pick the right one in each case. I replaced all that with a function pointer stored in the AppendState, moved the code so it's all together, and rewrote it in a way that I find easier to understand. I also changed the naming convention. - I renamed pappend_len to pstate_len and ParallelAppendDescData to ParallelAppendState. I think the use of the word "descriptor" is a carryover from the concept of a scan descriptor. There's nothing really wrong with inventing the concept of an "append descriptor", but it seems more clear to just refer to shared state. - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan. Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong; instead, local state should be reset in ExecReScanAppend. I installed what I believe to be the correct logic in that function instead. - I fixed list_qsort() so that it copies the type of the old list into the new list. Otherwise, sorting a list of type T_IntList or T_OidList would turn it into just plain T_List, which is wrong. - I removed get_append_num_workers and integrated the logic into the callers. This function was coded quite strangely: it assigned the return value of fls() to a double and then eventually rounded the result back to an integer. But fls() returns an integer, so this doesn't make much sense. On a related note, I made it use fls(# of subpaths) instead of fls(# of subpaths)+1. Adding 1 doesn't make sense to me here because it leads to a decision to use 2 workers for a single, non-partial subpath. I suspect both of these mistakes stem from thinking that fls() returns the base-2 logarithm, but in fact it doesn't, quite: log2(1) = 0.0 but fls(1) = 1. - In the process of making the changes described in the previous point, I added a couple of assertions, one of which promptly failed. It turns out the reason is that your patch didn't update accumulate_append_subpaths(), which can result in flattening non-partial paths from a Parallel Append into a parent Append's list of partial paths, which is bad. The easiest way to fix that would be to just teach accumulate_append_subpaths() not to flatten a Parallel Append into a parent Append or MergeAppend node, but it seemed to me that there was a fair amount of duplication between accumulate_partialappend_subpath() and accumulate_append_subpaths, so what I did instead is folded all of the necessarily logic directly into accumulate_append_subpaths(). This approach also avoids some assumptions that your code made, such as the assumption that we will never have a partial MergeAppend path. - I changed create_append_path() so that it uses the parallel_aware argument as the only determinant of whether the output path is flagged as parallel-aware. Your version also considered whether parallel_workers > 0, but I think that's not a good idea; the caller should pass the correct value for parallel_aware rather than relying on this function to fix it. Possibly you indirectly encountered the problem mentioned in the previous point and worked around it like this, or maybe there was some other reason, but it doesn't seem to be necessary. - I changed things around to enforce the rule that all partial paths added to an appendrel must use the same row count estimate. (This is not a new coding rule, but this patch provides a new way to violate it.) I did that by forcing the row-count for any parallel append mixing partial and non-partial paths to use the same row count as the row already added. I also changed the way the row count is calculated in the case where the only parallel append path mixes partial and non-partial plans; I think this way is more consistent with what we've done elsewhere. This amounts to the assumption that we're trying to estimate the average number of rows per worker rather than the largest possible number; I'm not sure what the best thing to do here is in theory, but one advantage of this approach is that I think it will produce answers closer to the value we get for an all-partial-paths append. That's good, because we don't want the row-count estimate to change precipitously based on whether an all-partial-paths append is possible. - I fixed some whitespace problems by running pgindent on various files and manually breaking some long lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: