Re: [DESIGN] ParallelAppend - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [DESIGN] ParallelAppend |
Date | |
Msg-id | CA+Tgmob18VZEhghSzeWqqEHBPQ89j7B9gJadytN=v0OEBtNM1A@mail.gmail.com Whole thread Raw |
In response to | Re: [DESIGN] ParallelAppend (Kouhei Kaigai <kaigai@ak.jp.nec.com>) |
Responses |
Re: [DESIGN] ParallelAppend
Re: [DESIGN] ParallelAppend |
List | pgsql-hackers |
On Thu, Nov 12, 2015 at 12:09 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > I'm now designing the parallel feature of Append... > > Here is one challenge. How do we determine whether each sub-plan > allows execution in the background worker context? I've been thinking about these questions for a bit now, and I think we can work on improving Append in multiple phases. The attached patch shows what I have in mind for phase 1. Currently, if you set up an inheritance hierarchy, you might get an Append some of whose children are Gather nodes with Parallel Seq Scans under them, like this: Append -> Gather -> Parallel Seq Scan -> Gather -> Parallel Seq Scan -> Seq Scan This is a crappy plan. Each Gather node will try to launch its own bunch of workers, which sucks. The attached patch lets you get this plan instead: Gather -> Append -> Partial Seq Scan -> Partial Seq Scan -> Partial Seq Scan That's a much better plan. To make this work, this plan introduces a couple of new concepts. Each RelOptInfo gets a new partial_pathlist field, which stores paths that need to be gathered to produce a workable plan. Once we've populated the partial_pathlist with whatever partial paths we know how to generate, we can either push a Gather node on top of one of those partial paths to create a real path; or we can use those partial paths to build more partial paths. The current patch handles only the simplest case: we can build a partial path for an appendrel by appending a partial path for each member rel. But eventually I hope to handle joinrels this way as well: you can join a partial path with an ordinary path for form a partial path for the joinrel. This requires some way of figuring out how many workers to request for the append-path, so this patch also adds a parallel_degree field to the path object, which is 0 for non-parallel things and the number of workers that the path believes to be ideal otherwise. Right now, it just percolates the highest worker count of any child up to the appendrel, which might not be right, especially once the append node knows how to balance workers, but it seems like a reasonable place to start. > Type-A) It can be performed on background worker context and > picked up by multiple worker processes concurrently. > (e.g: Parallel SeqScan) > Type-B) It can be performed on background worker context but > cannot be picked up by multiple worker processes. > (e.g: non-parallel aware node) > Type-C) It should not be performed on background worker context. > (e.g: plan/path node with volatile functions) At the time that we're forming an AppendPath, we can identify what you're calling type-A paths very easily: they're the ones that are coming from the partial_pathlist. We can distinguish between type-B and type-C paths coming from the childrel's pathlist based on the childrel's consider_parallel flag: if it's false, it's type-C, else type-B. At some point, we might need a per-path flag to distinguish cases where a particular path is type-C even though some other plan for that relation might be type-B. But right now that case doesn't arise. Now, of course, it's not good enough to have this information available when we're generating the AppendPath; it has to survive until execution time. But that doesn't mean the paths need to be self-identifying. We could, of course, decide to make them so, and maybe that's the best design, but I'm thinking about another approach: suppose the append node itself, instead of having one list of child plans, has a list of type-A plans, a list of type-B plans, and a list of type-C plans. This actually seems quite convenient, because at execution time, you presumably want the leader to prioritize type-C plans over any of the others, since it's the only one that can execute them, and the workers to prioritize type-B plans, since they can only take one worker each and are thus prone to be the limiting factor on finishing the whole Append. Having the plans divided in advance between these three lists (say, restricted_plans, safe_plans, parallel_plans) makes that easy to implement. Incidentally, I think it's subtly wrong to think of the parallel_aware flag as telling you whether the plan can absorb multiple workers. That's not really what it's for. It's to tell you whether the plan is doing *something* parallel aware - that is, whether its Estimate, InitializeDSM, and InitializeWorker callbacks should do anything. For SeqScan, flipping parallel_aware actually does split the input among all the workers, but for Append it's probably just load balances and for other nodes it might be something else again. The term I'm using to indicate a path/plan that returns only a subset of the results in each worker is "partial". Whether or not a path is partial is, in the design embodied in this patch, indicated both by whether path->parallel_degree > 0 and whether the path is in rel->pathlist or rel->partial_pathlist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: