Re: [HACKERS] parallelize queries containing initplans - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [HACKERS] parallelize queries containing initplans |
Date | |
Msg-id | CAA4eK1K4Q-aMeJ0nezBoiw2=Vn-QbdyM3exBQnE6KJt8uVzaQw@mail.gmail.com Whole thread Raw |
In response to | [HACKERS] parallelize queries containing initplans (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: [HACKERS] parallelize queries containing initplans
|
List | pgsql-hackers |
On Wed, Dec 28, 2016 at 5:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > To start > with let us see the plan of TPC-H query (Q-22) and understand how it > can be improved. > > Limit > InitPlan 1 (returns $0) > -> Finalize Aggregate > -> Gather > Workers Planned: 2 > -> Partial Aggregate > -> Parallel Seq Scan on customer customer_1 > Filter: (...) > -> GroupAggregate > Group Key: ("substring"((customer.c_phone)::text, 1, 2)) > -> Sort > Sort Key: ("substring"((customer.c_phone)::text, 1, 2)) > -> Nested Loop Anti Join > -> Seq Scan on customer > Filter: ((c_acctbal > $0) AND (...))) > -> Index Only Scan using idx_orders_custkey on orders > Index Cond: (o_custkey = customer.c_custkey) > > > In the above plan, we can see that the join on customer and orders > table (Nested Loop Anti Join) is not parallelised even though we have > the capability to parallelize Nested Loop Joins. The reason for not > choosing the parallel plan is that one of the nodes (Seq Scan on > customer) is referring to initplan and we consider such nodes as > parallel-restricted which means they can't be parallelised. Now, I > could see three ways of parallelizing such a query. The first way is > that we just push parallel-safe initplans to workers and allow them to > execute it, the drawback of this approach is that it won't be able to > push initplans in cases as shown above where initplan is > parallel-unsafe (contains Gather node) and second is we will lose the > expectation of single evaluation. The second way is that we always > execute the initplan in the master backend and pass the resultant > value to the worker, this will allow above form of plans to push > initplans to workers and hence can help in enabling parallelism for > other nodes in plan tree. > I have used the second way to parallelize queries containing initplans as that can help in cases where initplans in itself also uses parallelism and it will also retain an existing expectation of single evaluation for initplans. The basic idea as mentioned in above mail is to evaluate the initplans at Gather node and pass the value to worker backends which can use it as required. The patch has used *plan->allParam* bitmapset to evaluate the initplans at Gather node (we traverse the planstate tree to find params at each node and we take care to avoid multiple evaluations of same initplan). To identify initplan params among other params in *allParams*, the patch has added an additional bool variable (isinitplan) in ParamExecData. We can do it in some other way as well if there is any better suggestion. The patch has also changed the explain output of queries where initplan param is evaluated at Gather node. For ex. postgres=# explain (costs off) select t1.i from t1, t2 where t1.j=t2.j and t1.k < (select max(k) from t3) and t1.k < (select max(k) from t3); QUERY PLAN -------------------------------------------------------- Hash Join Hash Cond: (t2.j = t1.j) InitPlan 1 (returns $0) -> Finalize Aggregate -> Gather Workers Planned: 1 -> Partial Aggregate -> Parallel Seq Scan on t3 InitPlan 2 (returns $1) -> Finalize Aggregate -> Gather Workers Planned: 1 -> Partial Aggregate -> Parallel Seq Scan on t3 t3_1 -> Gather Workers Planned: 1 -> Parallel Seq Scan on t2 -> Hash -> Gather Workers Planned: 1 Params Evaluated: $0, $1 -> Parallel Seq Scan on t1 Filter: ((k < $0) AND (k < $1)) (23 rows) In the above plan, you can notice a line (Params Evaluated: $0, $1) which indicates the params evaluated at Gather node. As of now, explain just uses the *allParam* params present at the Gather node, but we need to traverse the planstate tree as we do during execution. This patch gives 2.5~3x performance benefit for Q-22 of TPC-H. > The drawback of the second approach is > that we need to evaluate the initplan before it is actually required > which means that we might evaluate it even when it is not required. I > am not sure if it is always safe to assume that we can evaluate the > initplan before pushing it to workers especially for the cases when it > is far enough down in the plan tree which we are parallelizing, > I think we can always pull up un-correlated initplans at Gather node, however, if there is a correlated initplan, then it is better not to allow such initplans for being pushed below gather. Ex. of correlated initplans: postgres=# explain (costs off) select * from t1 where t1.i in (select t2.i from t2 where t1.k = (select max(k) from t3 where t3.i=t1.i)); QUERY PLAN ---------------------------------------------- Seq Scan on t1 Filter: (SubPlan 2) SubPlan 2 -> Gather Workers Planned: 1 Params Evaluated: $1 InitPlan 1 (returns $1) -> Aggregate -> Seq Scan on t3 Filter: (i = t1.i) -> Result One-Time Filter: (t1.k = $1) -> Parallel Seq Scan on t2 (13 rows) It might be safe to allow above plan, but in general, such plans should not be allowed, because it might not be feasible to compute such initplan references at Gather node. I am still thinking on the best way to deal with such initplans. Thoughts? Thanks to Kuntal who is a co-author of this patch for doing the investigation along with me of different plans which contain references to initplans. Note - This patch needs to be applied on top of subplan patches [1][2]. [1] - https://www.postgresql.org/message-id/CAA4eK1KYQjQzQMpEz%2BQRA2fmim386gQLQBEf%2Bp2Wmtqjh1rjwg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1LK3NjNY4ghHUOwYfBFa%2BAb2SvccTKAxTHbOdW1NhUjvQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: