Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | 20150210074841.GA21017@alap3.anarazel.de Whole thread Raw |
In response to | Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel Seq Scan
|
List | pgsql-hackers |
On 2015-02-07 17:16:12 -0500, Robert Haas wrote: > On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > [ criticicm of Amit's heapam integration ] > > I'm not convinced that that reasoning is generally valid. While it may > > work out nicely for seqscans - which might be useful enough on its own - > > the more stuff we parallelize the *more* the executor will have to know > > about it to make it sane. To actually scale nicely e.g. a parallel sort > > will have to execute the nodes below it on each backend, instead of > > doing that in one as a separate step, ferrying over all tuples to > > indivdual backends through queues, and only then parallezing the > > sort. > > > > Now. None of that is likely to matter immediately, but I think starting > > to build the infrastructure at the points where we'll later need it does > > make some sense. > > Well, I agree with you, but I'm not really sure what that has to do > with the issue at hand. I mean, if we were to apply Amit's patch, > we'd been in a situation where, for a non-parallel heap scan, heapam.c > decides the order in which blocks get scanned, but for a parallel heap > scan, nodeParallelSeqscan.c makes that decision. Maybe I'm an old > fuddy-duddy[1] but that seems like an abstraction violation to me. I > think the executor should see a parallel scan as a stream of tuples > that streams into a bunch of backends in parallel, without really > knowing how heapam.c is dividing up the work. That's how it's > modularized today, and I don't see a reason to change it. Do you? I don't really agree. Normally heapam just sequentially scan the heap in one go, not much logic to that. Ok, then there's also the synchronized seqscan stuff - which just about every user of heapscans but the executor promptly disables again. I don't think a heap_scan_page() or similar API will consitute a relevant layering violation over what we already have. Note that I'm not saying that Amit's patch is right - I haven't read it - but that I don't think a 'scan this range of pages' heapscan API would not be a bad idea. Not even just for parallelism, but for a bunch of usecases. > Regarding tuple flow between backends, I've thought about that before, > I agree that we need it, and I don't think I know how to do it. I can > see how to have a group of processes executing a single node in > parallel, or a single process executing a group of nodes we break off > from the query tree and push down to it, but what you're talking about > here is a group of processes executing a group of nodes jointly. I don't think it really is that. I think you'd do it essentially by introducing a couple more nodes. Something like SomeUpperLayerNode | | AggCombinerNode / \ / \ / \ PartialHashAggNode PartialHashAggNode .... .PartialHashAggNode... | | | | | | | | PartialSeqScan PartialSeqScan The only thing that'd potentially might need to end up working jointly jointly would be the block selection of the individual PartialSeqScans to avoid having to wait for stragglers for too long. E.g. each might just ask for a range of a 16 megabytes or so that it scans sequentially. In such a plan - a pretty sensible and not that uncommon thing for parallelized aggregates - you'd need to be able to tell the heap scans which blocks to scan. Right? > That seems like an excellent idea, but I don't know how to design it. > Actually routing the tuples between whichever backends we want to > exchange them between is easy enough, but how do we decide whether to > generate such a plan? What does the actual plan tree look like? I described above how I think it'd roughly look like. Whether to generate it probably would be dependant on the cardinality (not much point to do the above if all groups are distinct) and possibly the aggregates in use (if we have a parallizable sum/count/avg etc). > Maybe we designate nodes as can-generate-multiple-tuple-streams (seq > scan, mostly, I would think) and can-absorb-parallel-tuple-streams > (sort, hash, materialize), or something like that, but I'm really > fuzzy on the details. I don't think we really should have individual nodes that produce multiple streams - that seems like it'd end up being really complicated. I'd more say that we have distinct nodes (like the PartialSeqScan ones above) that do a teensy bit of coordination about which work to perform. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: