Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | 20150210140838.GI21017@alap3.anarazel.de Whole thread Raw |
In response to | Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel Seq Scan
|
List | pgsql-hackers |
On 2015-02-10 08:52:09 -0500, Robert Haas wrote: > On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Note that I'm not saying that Amit's patch is right - I haven't read it > > - but that I don't think a 'scan this range of pages' heapscan API would > > not be a bad idea. Not even just for parallelism, but for a bunch of > > usecases. > > We do have that, already. heap_setscanlimits(). I'm just not > convinced that that's the right way to split up a parallel scan. > There's too much risk of ending up with a very-uneven distribution of > work. If you make the chunks small enough, and then coordate only the chunk distribution, not really. > >> Regarding tuple flow between backends, I've thought about that before, > >> I agree that we need it, and I don't think I know how to do it. I can > >> see how to have a group of processes executing a single node in > >> parallel, or a single process executing a group of nodes we break off > >> from the query tree and push down to it, but what you're talking about > >> here is a group of processes executing a group of nodes jointly. > > > > I don't think it really is that. I think you'd do it essentially by > > introducing a couple more nodes. Something like > > > > SomeUpperLayerNode > > | > > | > > AggCombinerNode > > / \ > > / \ > > / \ > > PartialHashAggNode PartialHashAggNode .... .PartialHashAggNode ... > > | | > > | | > > | | > > | | > > PartialSeqScan PartialSeqScan > > > > The only thing that'd potentially might need to end up working jointly > > jointly would be the block selection of the individual PartialSeqScans > > to avoid having to wait for stragglers for too long. E.g. each might > > just ask for a range of a 16 megabytes or so that it scans sequentially. > > > > In such a plan - a pretty sensible and not that uncommon thing for > > parallelized aggregates - you'd need to be able to tell the heap scans > > which blocks to scan. Right? > > For this case, what I would imagine is that there is one parallel heap > scan, and each PartialSeqScan attaches to it. The executor says "give > me a tuple" and heapam.c provides one. Details like the chunk size > are managed down inside heapam.c, and the executor does not know about > them. It just knows that it can establish a parallel scan and then > pull tuples from it. I think that's a horrible approach that'll end up with far more entangled pieces than what you're trying to avoid. Unless the tuple flow is organized to only happen in the necessary cases the performance will be horrible. And good chunk sizes et al depend on higher layers, selectivity estimates and such. And that's planner/executor work, not the physical layer (which heapam.c pretty much is). A individual heap scan's state lives in process private memory. And if the results inside the separate workers should directly be used in the these workers without shipping over the network it'd be horrible to have the logic in the heapscan. How would you otherwise model an executor tree that does the seqscan and aggregation combined in multiple processes at the same time? > >> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq > >> scan, mostly, I would think) and can-absorb-parallel-tuple-streams > >> (sort, hash, materialize), or something like that, but I'm really > >> fuzzy on the details. > > > > I don't think we really should have individual nodes that produce > > multiple streams - that seems like it'd end up being really > > complicated. I'd more say that we have distinct nodes (like the > > PartialSeqScan ones above) that do a teensy bit of coordination about > > which work to perform. > > I think we're in violent agreement here, except for some > terminological confusion. Are there N PartialSeqScan nodes, one > running in each node, or is there one ParallelSeqScan node, which is > copied and run jointly across N nodes? You can talk about either way > and have it make sense, but we haven't had enough conversations about > this on this list to have settled on a consistent set of vocabulary > yet. I pretty strongly believe that it has to be independent scan nodes. Both from a implementation and a conversational POV. They might have some very light cooperation between them (e.g. coordinating block ranges or such), but everything else should be separate. From an implementation POV it seems pretty awful to have executor node that's accessed by multiple separate backends - that'd mean it have to be concurrency safe, have state in shared memory and everything. Now, there'll be a node that needs to do some parallel magic - but in the above example that should be the AggCombinerNode, which would not only ask for tuples from one of the children at a time, but ask multiple ones in parallel. But even then it doesn't have to deal with concurrency around it's own state. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: