Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers
From | Amit Khandekar |
---|---|
Subject | Re: [HACKERS] Parallel Append implementation |
Date | |
Msg-id | CAJ3gD9cohvXTUCBeyMVu+1JtEt3L6aUSGRWFXU2vsuzniaZwfg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Append implementation (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel Append implementation
|
List | pgsql-hackers |
Thanks a lot Robert for the patch. I will have a look. Quickly tried to test some aggregate queries with a partitioned pgbench_accounts table, and it is crashing. Will get back with the fix, and any other review comments. Thanks -Amit Khandekar On 9 November 2017 at 23:44, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> No, because the Append node is *NOT* getting copied into shared >> memory. I have pushed a comment update to the existing functions; you >> can use the same comment for this patch. > > I spent the last several days working on this patch, which had a > number of problems both cosmetic and functional. I think the attached > is in better shape now, but it could certainly use some more review > and testing since I only just finished modifying it, and I modified it > pretty heavily. Changes: > > - I fixed the "morerows" entries in the documentation. If you had > built the documentation the way you had it and loaded up in a web > browser, you would have seen that the way you had it was not correct. > > - I moved T_AppendState to a different position in the switch inside > ExecParallelReInitializeDSM, so as to keep that switch in the same > order as all of the other switch statements in that file. > > - I rewrote the comment for pa_finished. It previously began with > "workers currently executing the subplan", which is not an accurate > description. I suspect this was a holdover from a previous version of > the patch in which this was an array of integers rather than an array > of type bool. I also fixed the comment in ExecAppendEstimate and > added, removed, or rewrote various other comments as well. > > - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is > more clear and allows for the possibility that this sentinel value > might someday be used for non-parallel-aware Append plans. > > - I largely rewrote the code for picking the next subplan. A > superficial problem with the way that you had it is that you had > renamed exec_append_initialize_next to exec_append_seq_next but not > updated the header comment to match. Also, the logic was spread out > all over the file. There are three cases: not parallel aware, leader, > worker. You had the code for the first case at the top of the file > and the other two cases at the bottom of the file and used multiple > "if" statements to pick the right one in each case. I replaced all > that with a function pointer stored in the AppendState, moved the code > so it's all together, and rewrote it in a way that I find easier to > understand. I also changed the naming convention. > > - I renamed pappend_len to pstate_len and ParallelAppendDescData to > ParallelAppendState. I think the use of the word "descriptor" is a > carryover from the concept of a scan descriptor. There's nothing > really wrong with inventing the concept of an "append descriptor", but > it seems more clear to just refer to shared state. > > - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan. > Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong; > instead, local state should be reset in ExecReScanAppend. I installed > what I believe to be the correct logic in that function instead. > > - I fixed list_qsort() so that it copies the type of the old list into > the new list. Otherwise, sorting a list of type T_IntList or > T_OidList would turn it into just plain T_List, which is wrong. > > - I removed get_append_num_workers and integrated the logic into the > callers. This function was coded quite strangely: it assigned the > return value of fls() to a double and then eventually rounded the > result back to an integer. But fls() returns an integer, so this > doesn't make much sense. On a related note, I made it use fls(# of > subpaths) instead of fls(# of subpaths)+1. Adding 1 doesn't make > sense to me here because it leads to a decision to use 2 workers for a > single, non-partial subpath. I suspect both of these mistakes stem > from thinking that fls() returns the base-2 logarithm, but in fact it > doesn't, quite: log2(1) = 0.0 but fls(1) = 1. > > - In the process of making the changes described in the previous > point, I added a couple of assertions, one of which promptly failed. > It turns out the reason is that your patch didn't update > accumulate_append_subpaths(), which can result in flattening > non-partial paths from a Parallel Append into a parent Append's list > of partial paths, which is bad. The easiest way to fix that would be > to just teach accumulate_append_subpaths() not to flatten a Parallel > Append into a parent Append or MergeAppend node, but it seemed to me > that there was a fair amount of duplication between > accumulate_partialappend_subpath() and accumulate_append_subpaths, so > what I did instead is folded all of the necessarily logic directly > into accumulate_append_subpaths(). This approach also avoids some > assumptions that your code made, such as the assumption that we will > never have a partial MergeAppend path. > > - I changed create_append_path() so that it uses the parallel_aware > argument as the only determinant of whether the output path is flagged > as parallel-aware. Your version also considered whether > parallel_workers > 0, but I think that's not a good idea; the caller > should pass the correct value for parallel_aware rather than relying > on this function to fix it. Possibly you indirectly encountered the > problem mentioned in the previous point and worked around it like > this, or maybe there was some other reason, but it doesn't seem to be > necessary. > > - I changed things around to enforce the rule that all partial paths > added to an appendrel must use the same row count estimate. (This is > not a new coding rule, but this patch provides a new way to violate > it.) I did that by forcing the row-count for any parallel append > mixing partial and non-partial paths to use the same row count as the > row already added. I also changed the way the row count is calculated > in the case where the only parallel append path mixes partial and > non-partial plans; I think this way is more consistent with what we've > done elsewhere. This amounts to the assumption that we're trying to > estimate the average number of rows per worker rather than the largest > possible number; I'm not sure what the best thing to do here is in > theory, but one advantage of this approach is that I think it will > produce answers closer to the value we get for an all-partial-paths > append. That's good, because we don't want the row-count estimate to > change precipitously based on whether an all-partial-paths append is > possible. > > - I fixed some whitespace problems by running pgindent on various > files and manually breaking some long lines. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: