Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | 20150210205653.GM21017@alap3.anarazel.de Whole thread Raw |
In response to | Re: Parallel Seq Scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel Seq Scan
Re: Parallel Seq Scan |
List | pgsql-hackers |
On 2015-02-10 09:23:02 -0500, Robert Haas wrote: > On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > And good chunk sizes et al depend on higher layers, > > selectivity estimates and such. And that's planner/executor work, not > > the physical layer (which heapam.c pretty much is). > > If it's true that a good chunk size depends on the higher layers, then > that would be a good argument for doing this differently, or at least > exposing an API for the higher layers to tell heapam.c what chunk size > they want. I hadn't considered that possibility - can you elaborate > on why you think we might want to vary the chunk size? Because things like chunk size depend on the shape of the entire plan. If you have a 1TB table and want to sequentially scan it in parallel with 10 workers you better use some rather large chunks. That way readahead will be efficient in a cpu/socket local manner, i.e. directly reading in the pages into the directly connected memory of that cpu. Important for performance on a NUMA system, otherwise you'll constantly have everything go over the shared bus. But if you instead have a plan where the sequential scan goes over a 1GB table, perhaps with some relatively expensive filters, you'll really want a small chunks size to avoid waiting. The chunk size will also really depend on what other nodes are doing, at least if they can run in the same worker. Even without things like NUMA and readahead I'm pretty sure that you'll want a chunk size a good bit above one page. The locks we acquire for the buffercache lookup and for reading the page are already quite bad for performance/scalability; even if we don't always/often hit the same lock. Making 20 processes that scan pages in parallel acquire yet a another lock (that's shared between all of them!) for every single page won't be fun, especially without or fast filters. > >> For this case, what I would imagine is that there is one parallel heap > >> scan, and each PartialSeqScan attaches to it. The executor says "give > >> me a tuple" and heapam.c provides one. Details like the chunk size > >> are managed down inside heapam.c, and the executor does not know about > >> them. It just knows that it can establish a parallel scan and then > >> pull tuples from it. > > > > I think that's a horrible approach that'll end up with far more > > entangled pieces than what you're trying to avoid. Unless the tuple flow > > is organized to only happen in the necessary cases the performance will > > be horrible. > > I can't understand this at all. A parallel heap scan, as I've coded > it up, involves no tuple flow at all. All that's happening at the > heapam.c layer is that we're coordinating which blocks to scan. Not > to be disrespectful, but have you actually looked at the patch? No, and I said so upthread. I started commenting because you argued that architecturally parallelism belongs in heapam.c instead of upper layers, and I can't agree with that. I now have, and it looks less bad than I had assumed, sorry. Unfortunately I still think it's wrong approach, also sorry. As pointed out above (moved there after reading the patch...) I don't think a chunk size of 1 or any other constant size can make sense. I don't even believe it'll necessarily be constant across an entire query execution (big initially, small at the end). Now, we could move determining that before the query execution into executor initialization, but then we won't yet know how many workers we're going to get. We could add a function setting that at runtime, but that'd mix up responsibilities quite a bit. I also can't agree with having a static snapshot in shared memory put there by the initialization function. For one it's quite awkward to end up with several equivalent snapshots at various places in shared memory. Right now the entire query execution can share one snapshot, this way we'd end up with several of them. Imo for actual parallel query execution the plan should be shared once and then be reused for everything done in the name of the query. Without the need to do that you end up pretty much with only with setup for infrastructure so heap_parallelscan_nextpage is called. How about instead renaming heap_beginscan_internal() to _extended and offering an option to provide a callback + state that determines the next page? Additionally provide some separate functions managing a simple implementation of such a callback + state? Btw, using a atomic uint32 you'd end up without the spinlock and just about the same amount of code... Just do a atomic_fetch_add_until32(var, 1, InvalidBlockNumber)... ;) > >> I think we're in violent agreement here, except for some > >> terminological confusion. Are there N PartialSeqScan nodes, one > >> running in each node, or is there one ParallelSeqScan node, which is > >> copied and run jointly across N nodes? You can talk about either way > >> and have it make sense, but we haven't had enough conversations about > >> this on this list to have settled on a consistent set of vocabulary > >> yet. > > > > I pretty strongly believe that it has to be independent scan nodes. Both > > from a implementation and a conversational POV. They might have some > > very light cooperation between them (e.g. coordinating block ranges or > > such), but everything else should be separate. From an implementation > > POV it seems pretty awful to have executor node that's accessed by > > multiple separate backends - that'd mean it have to be concurrency safe, > > have state in shared memory and everything. > > I don't agree with that, but again I think it's a terminological > dispute. I think what will happen is that you will have a single node > that gets copied into multiple backends, and in some cases a small > portion of its state will live in shared memory. That's more or less > what you're thinking of too, I think. Well, let me put it that way, I think that the tuple flow has to be pretty much like I'd ascii-art'ed earlier. And that only very few nodes will need to coordinate between query execution happening in different workers. With that I mean it has to be possible to have queries like: ParallelismDrivingNode | ---------------- Parallelism boundary | NestLoop / \CSeqScan IndexScan Where the 'coordinated seqscan' scans a relation so that each tuple eventually gets returned once across all nodes, but the nested loop (and through it the index scan) will just run normally, without any coordination and parallelism. But everything below --- would happen multiple nodes. If you agree, yes, then we're in violent agreement ;). The "single node that gets copied" bit above makes me a bit unsure whether we are though. To me, given the existing executor code, it seems easiest to achieve that by having the ParallelismDrivingNode above having a dynamic number of nestloop children in different backends and point the coordinated seqscan to some shared state. As you point out, the number of these children cannot be certainly known (just targeted for) at plan time; that puts a certain limit on how independent they are. But since a large number of them can be independent between workers it seems awkward to generally treat them as being the same node across workers. But maybe that's just an issue with my mental model. > But what I don't want is - if we've got a parallel scan-and-aggregate > happening in N nodes, EXPLAIN shows N copies of all of that - not only > because it's display clutter, but also because a plan to do that thing > with 3 workers is fundamentally the same as a plan to do it with 30 > workers. Those plans shouldn't look different, except perhaps for a > line some place that says "Number of Workers: N". I'm really not concerned with what explain is going to show. We can do quite some fudging there - it's not like it's a 1:1 representation of the query plan. I think we're getting to the point where having a unique mapping from the plan to the execution tree is proving to be rather limiting anyway. Check for example discussion about join removal. But even for current code, showing only the custom plans for the first five EXPLAIN EXECUTEs is pretty nasty (Try explain that to somebody that doesn't know pg internals. Their looks are worth gold and can kill you at the same time) and should be done differently. And I actually can very well imagine that you'd want a option to show the different execution statistics for every worker in the ANALYZE case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: