Thread: parallel joins, and better parallel explain
Attached find a patch that does (mostly) two things. First, it allows the optimizer to generate plans where a Nested Loop or Hash Join appears below a Gather node. This is a big improvement on what we have today, where only a sequential scan can be parallelized; with this patch, entire join problems can be parallelized, as long as they don't need a Merge Join (see below for more on this). Second, it improves the output of EXPLAIN when parallel workers are used. With this patch, EXPLAIN (ANALYZE, VERBOSE) displays not only totals for all workers, as it does currently in master, but also per-worker statistics. Putting these two things together, you can get spiffy stuff like this - thanks to Thom Brown for the query and sample data: rhaas=# explain (analyze, verbose, costs off) SELECT count(*) FROM contacts NATURAL JOIN countries WHERE continent = 'Africa'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Aggregate (actual time=602.527..602.527 rows=1 loops=1) Output: count(*) -> Gather (actual time=0.243..531.129 rows=1185951 loops=1) Number of Workers: 2 -> Hash Join (actual time=0.206..396.106 rows=395317 loops=3) Hash Cond: (contacts.country = countries.country) Worker 0: actual time=0.260..485.785 rows=486461 loops=1 Worker 1: actual time=0.260..483.459 rows=480065 loops=1 -> Parallel Seq Scan on public.contacts (actual time=0.034..143.824 rows=1666667 loops=3) Output: contacts.id, contacts.first_name, contacts.last_name, contacts.age, contacts.country Worker 0: actual time=0.035..176.784 rows=2051492 loops=1 Worker 1: actual time=0.038..174.175 rows=2021506 loops=1 -> Hash (actual time=0.064..0.064 rows=59 loops=3) Output: countries.country Buckets: 1024 Batches: 1 Memory Usage: 11kB Worker 0: actual time=0.070..0.070 rows=59 loops=1 Worker 1: actual time=0.069..0.069 rows=59 loops=1 -> Seq Scan on public.countries (actual time=0.019..0.051 rows=59 loops=3) Output: countries.country Filter: (countries.continent = 'Africa'::text) Rows Removed by Filter: 190 Worker 0: actual time=0.025..0.056 rows=59 loops=1 Worker 1: actual time=0.025..0.058 rows=59 loops=1 Planning time: 0.247 ms Execution time: 603.285 ms (25 rows) The general theory of operation of this patch is that a Parallel Seq Scan can be thought of as a "partial" path - that is, it can be run in multiple workers and will produce part of the results in each worker; when Gather is performed on those results, we get a complete result set. For reasons that should be fairly clear on short reflection, a join between a partial path for one of the two relations and an ordinary path for the other produces a partial path for the result; joining two partial paths would produce wrong answers. Thus, we proceed by generating partial paths for each baserel and joinrel, which can either be gathered to employ parallelism at that level, or used to build partial paths for higher-level joinrels which can then be gathered in turn. As mentioned above, this patch doesn't try to generate Merge Join paths at present. That could be changed, but the plans would probably not be very good in most cases; the problem is of course that only one of the two paths can be partial. So, if you had a sort on both sides, each worker would have to sort part of one relation (which sounds fine) and all of the other one (which sounds bad). You might conceivably win with a Sort on one side and an Index Scan on the other side, but probably not very often. The Gather at the top of the plan tree is order-destroying, so it doesn't even help for the merge ordering to match the final query ordering. I'll put some code in to try partial Merge Join plans if there is a big hue and cry for it, but personally my feeling is that it would be smarter to forget it for now and write the code once we have some better options on the executor side. See the "parallelism + sorting" thread for some discussion of what kinds of executor nodes would be useful in being able to generate better parallel merge joins. Some of that work might even get done in time for 9.6, which would be nice. I thought for a while that I might need some code to prevent the generation of parallel plans in some cases that involve volatile functions. For example, consider this query: select (length(continent) - 3) % 10, count(*) from contacts natural join countries where (length(continent) - 3) % 10 = substr(timeofday()::text, 23, 1)::integer group by 1; Without parallelism, this gets evaluated (on my machine, anyway) as a hash join with countries on the inner side; the volatile but parallel-safe filter condition is applied to the seq scan that feeds the hash join. Thus the decision as to whether each row of the countries table is in or out gets made just once. If this gets run with a parallel hash join, each worker will make its own decision about which rows to include in the hash table, and they probably won't all make the same decisions. This means that the parallel hash join could return a combination of rows that could never be returned by any serial plan. That sounds bad, until you realize that if you disable hash and merge joins and materiailzation, this can also be run as a nested loop plan, in which case - if you can persuade the optimizer to put the countries table on the inner side of the join - you can get the executor to evaluate the filter condition on countries once for every row in the contacts table, and of course there's nothing at all that will make it give the same answer for each row each time. After mulling it over a bit and studying the documentation, I realized that marking a function "volatile" guarantees that it won't be executed *less than once per base table row*. The use of a parallel plan here doesn't violate that rule. "volatile" never guarantees that the function won't be evaluated *more than once per base table row*. So now I think this is all fine. If we're in a serial plan, a volatile filter condition will get evaluated as little as once per row, but maybe as many times as some nested loop around that scan executes. In a parallel plan, it's the same thing, except that the number of loops might be based on the number of parallel workers rather than the number of rows in some outer table. That doesn't seem like an important distinction. This patch expands on and subsumes the patch I posted on the Parallel Append thread. It therefore also does what was discussed over there: pulls up Gather on top of any Append nodes generated by inheritance expansion, which is a good idea for all the reasons already discussed on that thread. Also as discussed on that thread, the cost model here is still not particularly smart and probably needs improvement in a variety of ways. Amit Kapila and others at EnterpriseDB will be looking further into that, and I hope for more feedback from other interested parties as well, but I don't think this patch needs to fix everything that's wrong with parallel costing itself. Even identifying those problems will probably take a fair amount of time and research, and not having this functionality will not make finding those problems any easier - probably the opposite. The EXPLAIN portion of the patch should perhaps be separated out and reviewed/committed separately. I developed it together because I found that the current EXPLAIN output inadequate for understanding how work was divided up between the leader and workers. The EXPLAIN changes made it possible to figure that out. More improvements are possible, but I think this is a big step up over what we have now, so I'm anxious to press forward with it. Let me know if it's helpful to split this part out for separate review; I've left it together for now both because it makes testing the rest of the patch easier and because it's 9:30pm on the day before Thanksgiving. (Happy Thanksgiving, BTW, if you live someplace where that's a thing, as I do!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Attached find a patch that does (mostly) two things. First, it allows
the optimizer to generate plans where a Nested Loop or Hash Join
appears below a Gather node. This is a big improvement on what we
have today, where only a sequential scan can be parallelized; with
this patch, entire join problems can be parallelized, as long as they
don't need a Merge Join (see below for more on this).
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 26, 2015 at 3:45 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Sounds like good progress. Thanks. > This gives us multiple copies of the hash table, which means we must either > use N * work_mem, or we must limit the hash table to work_mem / N per > partial plan. We use N * work_mem in this patch. The other option would be a lot more expensive in terms of planning time, because we'd have to generate one set of hash join paths (or at least hash join costs) for use in serial plans and another set for use in parallel plans. As it is, the parallel stuff just piggybacks on the plans we're generating anyway. We might have to change that at some point, but I think we'll do well to put that point off as long as possible. > How can the partial paths share a hash table? I'm glad you asked that question. For that, we need an allocator that can work with dynamic shared memory segments, and a hash table built on top of that allocator. It's relatively complicated because the same DSM segments can be mapped at different addresses in different processes, so you can't use native pointers. However, I'm pleased to report that my colleague Thomas Munro is working on this problem, and that he will submit this work to pgsql-hackers when it's ready, which may be soon enough that we can consider including this in 9.6 if the design meets with general approval. As you may recall, I previously proposed an allocator framework, somewhat of a WIP in progress at that time, and the reaction here was a bit lukewarm, which is why I shifted from parallel sort to parallel seq scan as a first project. I now think that was a good decision, and as a result of Peter Geoghegan's work on sorting and my own experience further developing the parallelism code, I no longer think that allocator is the right solution for parallel sort anyway. But I still think it might be the right solution for a parallel hash join with a shared hash table. My idea is that you'd end up with a plan like this: Gather -> Hash Join -> Parallel Seq Scan -> Parallel Hash -> Parallel Seq Scan Not only does this build only one copy of the hash table instead of N copies, but we can parallelize the hash table construction itself by having all workers insert in parallel, which is pretty cool. However, I don't expect this to be better than an unshared hash table in all cases. We have a fair amount of evidence that accessing backend-private data structures can sometimes be much faster than accessing shared data structures - cf. not only the syscaches and relcache, but also the use of Materialize nodes by the planner in certain cases even when there are no filtering quals underneath. So there's probably going to be a need to consider both types of plans and decide between them based on memory usage and expected runtime. The details are not all clear to me yet, and most likely we'll have to postpone some of the decisions until the code is written and can be performance-tested to see in which cases one strategy performs better or worse than the other. What I can confirm at this point is that I've thought about the problem you're asking about here, and that EnterpriseDB intends to contribute code to address it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 30, 2015 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Not only does this build only one copy of the hash table instead of N > copies, but we can parallelize the hash table construction itself by > having all workers insert in parallel, which is pretty cool. Hm. The case where you don't want parallel building of the hash table might be substantially simpler. You could build a hash table and then copy it into shared memory as single contiguous read-only data structure optimized for lookups. I have an inkling that there are even ways of marking the memory as being read-only and not needing cache synchronization. -- greg
On Mon, Nov 30, 2015 at 12:01 PM, Greg Stark <stark@mit.edu> wrote: > On Mon, Nov 30, 2015 at 4:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Not only does this build only one copy of the hash table instead of N >> copies, but we can parallelize the hash table construction itself by >> having all workers insert in parallel, which is pretty cool. > > Hm. The case where you don't want parallel building of the hash table > might be substantially simpler. You could build a hash table and then > copy it into shared memory as single contiguous read-only data > structure optimized for lookups. I have an inkling that there are even > ways of marking the memory as being read-only and not needing cache > synchronization. Yes, that's another approach that we could consider. I suspect it's not really a lot better than the parallel-build case. If the inner table is small, then it's probably best to have every worker build its own unshared copy of the table rather than having one worker build the table and everybody else wait, which might lead to stalls during the build phase and additional traffic on the memory bus during the probe phase (though, as you say, giving the kernel a hint could help in some cases). If the inner table is big, then having everybody wait for a single process to perform the build probably sucks. But it's not impossible that there could be cases when it trumps every other strategy. For example, if you're going to be doing a huge number of probes, you could try building the hash table with several different hash functions until you find one that is collision-free or nearly so, and then use that one. The extra effort spent during the build phase might speed up the probe phase enough to win. You can't do that sort of thing so easily in a parallel build. Even apart from that, if you build the hash table locally first and then copy it into shared memory afterwards, you can free up any extra memory and use only the minimal amount that you really need, which could be beneficial in some cases. I'm just not sure that's appealing enough to justify carrying a third system for building hash tables for hash joins. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> Attached find a patch that does (mostly) two things.
>
>
> On Thu, Nov 26, 2015 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Attached find a patch that does (mostly) two things.
> >
>
> I have started looking into this and would like to share few findings
> with you:
>
>
> - There seems to be some inconsistency in Explain's output when
> multiple workers are used.
>
run_cost = run_cost / (path->parallel_degree + 0.5);
On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Above and changes in add_path() makes planner not to select parallel path > for seq scan where earlier it was possible. I think you want to change the > costing of parallel plans based on rows selected instead of total_cost, > but there seems to be some problem in the logic (I think gather node is not > taking into account the reduced cost). Oops. The new version I've attached should fix this. The reason why I needed to make a change there is because previously the number of rows estimated for the Parallel Seq Scan was the total number of rows, not the number of rows per worker. That doesn't really matter when we're only doing Parallel Seq Scan, but if you push a join below the Gather, then the cost of the join won't be computed correctly unless the row count is the number of rows per worker. > - There seems to be some inconsistency in Explain's output when > multiple workers are used. What is going on here is a bit confusing, but in fact I believe it to be more correct than what we get with unpatched master. The problem has to do with the way that the instrumentation counts loops, and the way that EXPLAIN displays that information. In unpatched master, InstrEndLoop() is not called before the worker instrumentation data is aggregated to the leader. Therefore, each node under the Gather ends up with a loop count of 1. Unless, of course, it was executed multiple times in one of the workers, for example because it was on the inside of a nested loop. In that case, it ends up with a loop count equal to the number of times it was executed *minus the number of workers*. Therefore, if there are 4 workers and a leader, and between those 5 processes they executed the inner side of a nested loop 1000 times, the final loop count is 996. With the patch, the loop count is equal to the number of times that the nodes were actually executed. Typically, this ends up being equal to one more than the number of workers, because the leader executes it and so do all the workers, but it can end up being less if not all workers execute a particular node. Of course, it can also be more. If the node is executed repeatedly, the final loop count is equal to the total number of times that the node was executed across the leader and all workers. So, in the above example, the inner side of a nested loop would be 1000, not 996, which has the noteworthy advantage of being correct. What makes the output a tad confusing is that some but not all fields in EXPLAIN output are shown as per loop values. The startup cost, total cost, and row counts are divided by the number of iterations. I have always thought this was a terrible idea: when EXPLAIN tells me about a nested loop with an inner index scan, I want to know the TOTAL time spent on that index scan and the TOTAL number of rows returned, but what I get is the result of dividing those values by the number of loops and rounded off to a number of decimal places that almost entirely eliminate the possibility of extracting useful infromation from the results. However, I expect to be told that other people (especially Tom Lane) don't want to change this, and in any case if we were going to change it I think that would properly be a separate patch. So the net result of this is that the times and row counts are *averages* across all of the loop iterations. In the case of the inner side of a nested loop, this means you can read the data just as you would in a non-parallel plan. For nodes executed exactly once per worker plus once in the master, the value displayed ends up being a per-process average of the amount of time spent, and a per-process average of the number of rows. On the other hand, values for buffers are NOT divided by the loop count, so those values are absolute totals. Once you understand this, I think the data is pretty easy to read. > -> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual > time=33.983..3 > 3592.030 rows=9999 loops=1) > Output: c1, c2 > Number of Workers: 4 > Buffers: shared hit=548 read=142506 > -> Parallel Seq Scan on public.tbl_parallel_test > (cost=0.00..44245.93 > rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5) > Output: c1, c2 > Filter: (tbl_parallel_test.c1 < 10000) > Rows Removed by Filter: 198000 > Buffers: shared hit=352 read=142506 > Worker 0: actual time=18.422..33322.132 rows=2170 loops=1 > Buffers: shared hit=4 read=30765 > Worker 1: actual time=0.803..33283.979 rows=1890 loops=1 > Buffers: shared hit=1 read=26679 > Worker 2: actual time=0.711..33360.007 rows=1946 loops=1 > Buffers: shared hit=197 read=30899 > Worker 3: actual time=15.057..33252.605 rows=2145 loops=1 > Buffers: shared hit=145 read=25433 > Planning time: 0.217 ms > Execution time: 33612.964 ms > (22 rows) > > I am not able to understand how buffer usage add upto what is > shown at Gather node. It doesn't, of course. But I'm not sure it should, and I don't think this patch has changed anything about that either way. The patch only affects the nodes that run in the workers, and Gather doesn't. > - I think it would be better if we add some explanation to Explain - > Verbose section and an Example on the same page in documentation. > This can help users to understand this feature. > > It would be better if we can split this patch into multiple patches like > Explain related changes, Append pushdown related changes, Join > Push down related changes. You can choose to push the patches as > you prefer, but splitting can certainly help in review/verification of the > code. I don't think it really makes sense to split the append push-down changes from the join push-down changes; those share a great deal of code. But I've now split out the EXPLAIN changes. See attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
>
> On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > - There seems to be some inconsistency in Explain's output when
> > multiple workers are used.
>
>
> So the net result of this is that the times and row counts are
> *averages* across all of the loop iterations. In the case of the
> inner side of a nested loop, this means you can read the data just as
> you would in a non-parallel plan. For nodes executed exactly once per
> worker plus once in the master, the value displayed ends up being a
> per-process average of the amount of time spent, and a per-process
> average of the number of rows. On the other hand, values for buffers
> are NOT divided by the loop count, so those values are absolute
> totals. Once you understand this, I think the data is pretty easy to
> read.
>
> > -> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual
> > time=33.983..3
> > 3592.030 rows=9999 loops=1)
> > Output: c1, c2
> > Number of Workers: 4
> > Buffers: shared hit=548 read=142506
> > -> Parallel Seq Scan on public.tbl_parallel_test
> > (cost=0.00..44245.93
> > rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5)
> > Output: c1, c2
> > Filter: (tbl_parallel_test.c1 < 10000)
> > Rows Removed by Filter: 198000
> > Buffers: shared hit=352 read=142506
> > Worker 0: actual time=18.422..33322.132 rows=2170 loops=1
> > Buffers: shared hit=4 read=30765
> > Worker 1: actual time=0.803..33283.979 rows=1890 loops=1
> > Buffers: shared hit=1 read=26679
> > Worker 2: actual time=0.711..33360.007 rows=1946 loops=1
> > Buffers: shared hit=197 read=30899
> > Worker 3: actual time=15.057..33252.605 rows=2145 loops=1
> > Buffers: shared hit=145 read=25433
> > Planning time: 0.217 ms
> > Execution time: 33612.964 ms
> > (22 rows)
> >
> > I am not able to understand how buffer usage add upto what is
> > shown at Gather node.
>
> It doesn't, of course. But I'm not sure it should,
Attachment
My idea is that you'd end up with a plan like this:
Gather
-> Hash Join
-> Parallel Seq Scan
-> Parallel Hash
-> Parallel Seq Scan
Not only does this build only one copy of the hash table instead of N
copies, but we can parallelize the hash table construction itself by
having all workers insert in parallel, which is pretty cool.
What I can confirm at this point is that
I've thought about the problem you're asking about here, and that
EnterpriseDB intends to contribute code to address it.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Do you think it will be useful to display in a similar way if worker > is not able to execute plan (like before it starts execution, the other > workers have already finished the work)? Maybe, but it would clutter the output a good deal. I think we should instead have the Gather node itself display the number of workers that it actually managed to launch, and then the user can infer that any execution nodes that don't mention those workers were not executed by them. > Other than that parallel-explain-v2.patch looks good. OK, I'm going to go ahead and commit that part. > I think the problem is at Gather node, the number of buffers (read + hit) > are greater than the number of pages in relation. The reason why it > is doing so is that in Workers (ParallelQueryMain()), it starts the buffer > usage accumulation before ExecutorStart() whereas in master backend > it always calculate it for ExecutorRun() phase, on changing it to accumulate > for ExecutorRun() phase above problem is fixed. Attached patch fixes the > problem. Why is it a bad thing to capture the cost of doing ExecutorStart() in the worker? I can see there's an argument that changing this would be more consistent, but I'm not totally convinced. The overhead of ExecutorStart() in the leader isn't attributable to any specific worker, but the overhead of ExecutorStart() in the worker can fairly be blamed on Gather, I think. I'm not dead set against this change but I'm not totally convinced, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I think the problem is at Gather node, the number of buffers (read + hit)
> > are greater than the number of pages in relation. The reason why it
> > is doing so is that in Workers (ParallelQueryMain()), it starts the buffer
> > usage accumulation before ExecutorStart() whereas in master backend
> > it always calculate it for ExecutorRun() phase, on changing it to accumulate
> > for ExecutorRun() phase above problem is fixed. Attached patch fixes the
> > problem.
>
> Why is it a bad thing to capture the cost of doing ExecutorStart() in
> the worker? I can see there's an argument that changing this would be
> more consistent, but I'm not totally convinced. The overhead of
> ExecutorStart() in the leader isn't attributable to any specific
> worker, but the overhead of ExecutorStart() in the worker can fairly
> be blamed on Gather, I think.
>
> On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > It would be better if we can split this patch into multiple patches like
> > Explain related changes, Append pushdown related changes, Join
> > Push down related changes. You can choose to push the patches as
> > you prefer, but splitting can certainly help in review/verification of the
> > code.
>
> I don't think it really makes sense to split the append push-down
> changes from the join push-down changes; those share a great deal of
> code.
On Wed, Dec 2, 2015 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Oops. The new version I've attached should fix this. I've been trying to see if parallel join has any effect on PostGIS spatial join queries, which are commonly CPU bound. (My tests [1] on simple parallel scan were very positive, though quite limited in that they only parallelized such a small part of the work). Like Amit, I found the current patches are missing a change to src/include/nodes/relation.h, but just adding in "Relids extra_lateral_rels" to JoinPathExtraData allowed a warning-free build. The assumptions on parallel code in generally are that setting up parallel workers is very costly compared to the work to be done, so to get PostGIS code to parallelize I've been reduced to shoving parallel_setup_cost very low (1) and parallel_tuple_cost as well. Otherwise I just end up with ordinary plans. I did redefine all the relevant functions as "parallel safe" and upped their declared costs significantly. I set up a 8000 record spatial table, with a spatial index, and did a self-join on it. explain analyze select a.gid, b.gid from vada a join vada b on st_intersects(a.geom, b.geom) where a.gid != b.gid; With no parallelism, I got this: set max_parallel_degree = 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.15..227332.48 rows=1822243 width=8) (actual time=0.377..5528.461 rows=52074 loops=1) -> Seq Scan on vada a (cost=0.00..1209.92 rows=8792 width=1189) (actual time=0.027..5.004 rows=8792 loops=1) -> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1 width=1189) (actual time=0.351..0.622 rows=6 loops=8792) Index Cond: (a.geom && geom) Filter: ((a.gid <> gid)AND _st_intersects(a.geom, geom)) Rows Removed by Filter: 3Planning time: 3.976 msExecution time: 5533.573 ms With parallelism, I got this: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.15..226930.05 rows=1822243 width=8) (actual time=0.840..5462.029 rows=52074 loops=1) -> Gather (cost=0.00..807.49 rows=8792 width=1189) (actual time=0.335..39.326 rows=8792 loops=1) Number of Workers: 1 -> Parallel Seq Scan on vada a (cost=0.00..806.61rows=5861 width=1189) (actual time=0.015..10.167 rows=4396 loops=2) -> Index Scan using vada_gix on vada b (cost=0.15..25.71 rows=1 width=1189) (actual time=0.353..0.609 rows=6 loops=8792) Index Cond: (a.geom && geom) Filter: ((a.gid <> gid)AND _st_intersects(a.geom, geom)) Rows Removed by Filter: 3Planning time: 4.019 msExecution time: 5467.126 ms Given that it's a CPU-bound process, I was hoping for something closer to the results of the sequence tests, about 50% time reduction, based on the two cores in my test machine. In general either the parallel planner is way too conservative (it seems), or we need to radically increase the costs of our PostGIS functions (right now, most "costly" functions are cost 100, but I had to push costs up into the 100000 range to get parallelism to kick in sometimes). Some guidelines on cost setting would be useful, something that says, "this function run against this kind of data is cost level 1, compare the time your function takes on 'standard' data to the baseline function to get a cost ratio to use in the function definition" ATB, P. [1] https://gist.github.com/pramsey/84e7a39d83cccae692f8
On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > set enable_hashjoin=off; > set enable_mergejoin=off; [ ... ] > Now here the point to observe is that non-parallel case uses both less > Execution time and Planning time to complete the statement. There > is a considerable increase in planning time without any benefit in > execution. So, you forced the query planner to give you a bad plan, and then you're complaining that the plan is bad? That's not a very surprising result. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > set enable_hashjoin=off;
> > set enable_mergejoin=off;
>
> [ ... ]
>
>
> > Now here the point to observe is that non-parallel case uses both less
> > Execution time and Planning time to complete the statement. There
> > is a considerable increase in planning time without any benefit in
> > execution.
>
> So, you forced the query planner to give you a bad plan, and then
> you're complaining that the plan is bad?
>
>On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In any case,
I have done some more testing using TPC-H benchmark (For some of the queries, specially for Parallel Hash Join), and Results summary is as below.
Planning Time(ms) | ||
Query | Base | Patch |
TPC-H Q2 | 2.2 | 2.4 |
TPCH- Q3 | 0.67 | 0.71 |
TPCH- Q5 | 3.17 | 2.3 |
TPCH- Q7 | 2.43 | 2.4 |
Execution Time(ms) | ||
Query | Base | Patch |
TPC-H Q2 | 2826 | 766 |
TPCH- Q3 | 23473 | 24271 |
TPCH- Q5 | 21357 | 1432 |
TPCH- Q7 | 6779 | 1138 |
I Observed one problem, with Q5 and Q7, there some relation and snapshot references are leaked and i am getting below warning, havn't yet looked into the issue.
WARNING: relcache reference leak: relation "customer" not closed
WARNING: relcache reference leak: relation "customer" not closed
WARNING: relcache reference leak: relation "customer" not closed
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123413
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123413
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123412
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123412
WARNING: relcache reference leak: relation "customer" not closed
CONTEXT: parallel worker, PID 123411
WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
CONTEXT: parallel worker, PID 123411
psql:q7.sql:40: WARNING: relcache reference leak: relation "customer" not closed
psql:q7.sql:40: WARNING: Snapshot reference leak: Snapshot 0x2d1fee8 still referenced
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > set enable_hashjoin=off;
> > set enable_mergejoin=off;
>
> [ ... ]
>
>
> > Now here the point to observe is that non-parallel case uses both less
> > Execution time and Planning time to complete the statement. There
> > is a considerable increase in planning time without any benefit in
> > execution.
>
> So, you forced the query planner to give you a bad plan, and then
> you're complaining that the plan is bad?
>Oh no, I wanted to check the behaviour of parallel vs. non-parallelexecution of joins. I think even if hash and merge join are set tooff, it should have picked up non-parallel NestLoop plan. In any case,I have done some more investigation of the patch and found that evenwithout changing query planner related parameters, it seems to givebad plans (as in example below [1]). I think here the costing of rework eachworker has to do seems to be missing which is causing bad plans tobe selected over good plans. Another point is that with patch, the number ofpaths that we explore to get the cheapest path have increased, do you thinkwe should try to evaluate it? One way is we run some queries where thereare more number of joins and see the impact on planner time and other is wetry to calculate the increase in number of paths that planner explores.[1] -CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROMgenerate_series(1, 10000000) g;CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROMgenerate_series(1, 3000000) g;Analyze t1;Analyze t2;Non-parallel casepostgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------Aggregate (cost=261519.93..261519.94 rows=1 width=0) (actual time=2779.965..2779.965 rows=1 loops=1)-> Hash Join (cost=204052.91..261519.92 rows=1 width=0) (actual time=2017.241..2779.947 rows=101loops=1)Hash Cond: (t2.c1 = t1.c1)-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.073..393.479rows=3000000 loops=1)-> Hash (cost=204052.90..204052.90 rows=1 width=4) (actual time=2017.130..2017.130 rows=101loops=1)Buckets: 1024 Batches: 1 Memory Usage: 12kB-> Seq Scan on t1 (cost=0.00..204052.90 rows=1 width=4) (actual time=0.038..2017.105rows=101 loops=1)Filter: ((c1 >= 100) AND (c1 <= 200))Rows Removed by Filter: 9999899Planning time: 0.113 msExecution time: 2780.000 ms(11 rows)Parallel-caseset max_parallel_degree=4;postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=100895.52..100895.53 rows=1 width=0) (actual time=67871.443..67871.443 rows=1 loops=1)-> Gather (cost=1000.00..100895.52 rows=1 width=0) (actual time=0.653..67871.287 rows=101 loops=1)Number of Workers: 4-> Nested Loop (cost=0.00..99895.42 rows=1 width=0) (actual time=591.408..16455.731 rows=20 loops=5)Join Filter: (t1.c1 = t2.c1)Rows Removed by Join Filter: 60599980-> Parallel Seq Scan on t1 (cost=0.00..45345.09 rows=0 width=4) (actual time=433.350..433.386 rows=20 loops=5)Filter: ((c1 >= 100) AND (c1 <= 200))Rows Removed by Filter: 1999980-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.005..395.480 rows=3000000 loops=101)Planning time: 0.114 msExecution time: 67871.584 ms(12 rows)Without patch, parallel caseset max_parallel_degree=4;Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=103812.21..103812.22 rows=1 width=0) (actual time=1207.043..1207.043 rows=1 loops=1)-> Hash Join (cost=46345.20..103812.21 rows=1 width=0) (actual time=428.632..1207.027 rows=101 loops=1)Hash Cond: (t2.c1 = t1.c1)-> Seq Scan on t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.034..375.670 rows=3000000 loops=1)-> Hash (cost=46345.19..46345.19 rows=1 width=4) (actual time=428.557..428.557 rows=101 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 13kB-> Gather (cost=1000.00..46345.19 rows=1 width=4) (actual time=0.287..428.476 rows=101 loops=1)Number of Workers: 4-> Parallel Seq Scan on t1 (cost=0.00..45345.09 rows=1 width=4) (actual time=340.139..425.591 rows=20 loops=5)Filter: ((c1 >= 100) AND (c1 <= 200))Rows Removed by Filter: 1999980Planning time: 0.116 msExecution time: 1207.196 ms(13 rows)
Attachment
On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In any case,>I have done some more investigation of the patch and found that even>without changing query planner related parameters, it seems to give>bad plans (as in example below [1]). I think here the costing of rework each
I have done some more testing using TPC-H benchmark (For some of the queries, specially for Parallel Hash Join), and Results summary is as below.
Planning Time(ms) Query Base Patch TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5 3.17 2.3 TPCH- Q7 2.43 2.4 Execution Time(ms) Query Base Patch TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5 21357 1432 TPCH- Q7 6779 1138 All Test files and Detail plan output is attached in mailq2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and 7th queryand Results with base and Parallel join are attached in q*_base.out and q*_parallel.out respectively.Summary: With TPC-H queries where ever Hash Join is pushed under gather Node, significant improvement is visible,with Q2, using 3 workers, time consumed is almost 1/3 of the base.
I Observed one problem, with Q5 and Q7, there some relation and snapshot references are leaked and i am getting below warning, havn't yet looked into the issue.
With Regards,
Amit Kapila.
On Wed, Dec 16, 2015 at 9:55 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> In any case,>I have done some more investigation of the patch and found that even>without changing query planner related parameters, it seems to give>bad plans (as in example below [1]). I think here the costing of rework each
I have done some more testing using TPC-H benchmark (For some of the queries, specially for Parallel Hash Join), and Results summary is as below.
Planning Time(ms) Query Base Patch TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5 3.17 2.3 TPCH- Q7 2.43 2.4 Execution Time(ms) Query Base Patch TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5 21357 1432 TPCH- Q7 6779 1138 All Test files and Detail plan output is attached in mailq2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and 7th queryand Results with base and Parallel join are attached in q*_base.out and q*_parallel.out respectively.Summary: With TPC-H queries where ever Hash Join is pushed under gather Node, significant improvement is visible,with Q2, using 3 workers, time consumed is almost 1/3 of the base.
I Observed one problem, with Q5 and Q7, there some relation and snapshot references are leaked and i am getting below warning, havn't yet looked into the issue.While looking at plans of Q5 and Q7, I have observed that Gather ispushed below another Gather node for which we don't have appropriateway of dealing. I think that could be the reason why you are seeingthe errors.Also, I think it would be good if you can once check the plan/executiontime with max_parallel_degree=0 as that can give us base referencedata without parallelism, also I am wondering if have you have changedany other parallel cost related parameter?
With Regards,
Amit Kapila.EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Dec 17, 2015 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > While looking at plans of Q5 and Q7, I have observed that Gather is > pushed below another Gather node for which we don't have appropriate > way of dealing. I think that could be the reason why you are seeing > the errors. Uh oh. That's not supposed to happen. A GatherPath is supposed to have parallel_safe = false, which should prevent the planner from using it to form new partial paths. Is this with the latest version of the patch? The plan output suggests that we're somehow reaching try_partial_hashjoin_path() with inner_path being a GatherPath, but I don't immediately see how that's possible, because create_gather_path() sets parallel_safe to false unconditionally, and hash_inner_and_outer() never sets cheapest_safe_inner to a path unless that path is parallel_safe. Do you have a self-contained test case that reproduces this, or any insight as to how it's happening here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Uh oh. That's not supposed to happen. A GatherPath is supposed to
> have parallel_safe = false, which should prevent the planner from
> using it to form new partial paths. Is this with the latest version
> of the patch? The plan output suggests that we're somehow reaching
> try_partial_hashjoin_path() with inner_path being a GatherPath, but I
> don't immediately see how that's possible, because
> create_gather_path() sets parallel_safe to false unconditionally, and
> hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
> that path is parallel_safe.
create_nestloop_path
{
pathnode->path.param_info =
get_joinrel_parampathinfo(root,
joinrel,
outer_path,
inner_path,
sjinfo,
required_outer,
&restrict_clauses);
pathnode->path.parallel_aware = false;
pathnode->path.parallel_safe = joinrel->consider_parallel; //may be joinrel is parallel safe but particular path is unsafe, so we need take this from inner_path and outer_path
// if any of the child path is parallel unsafe the mark parent as parallel unsafe..
pathnode->path.parallel_safe = (inner_path->parallel_safe & outer_path->parallel_safe);
}
> insight as to how it's happening here?
3. ./dbgen –v –s 5
On Thu, Dec 17, 2015 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> While looking at plans of Q5 and Q7, I have observed that Gather is
> pushed below another Gather node for which we don't have appropriate
> way of dealing. I think that could be the reason why you are seeing
> the errors.
Uh oh. That's not supposed to happen. A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths. Is this with the latest version
of the patch? The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.
Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?
Attachment
On Fri, Dec 18, 2015 at 3:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Dec 18, 2015 at 7.59 AM Robert Haas <robertmhaas@gmail.com> Wrote, >> Uh oh. That's not supposed to happen. A GatherPath is supposed to >> have parallel_safe = false, which should prevent the planner from >> using it to form new partial paths. Is this with the latest version >> of the patch? The plan output suggests that we're somehow reaching >> try_partial_hashjoin_path() with inner_path being a GatherPath, but I >> don't immediately see how that's possible, because >> create_gather_path() sets parallel_safe to false unconditionally, and >> hash_inner_and_outer() never sets cheapest_safe_inner to a path unless >> that path is parallel_safe. > > Yes, you are right, that create_gather_path() sets parallel_safe to false > unconditionally but whenever we are building a non partial path, that time > we should carry forward the parallel_safe state to its parent, and it seems > like that part is missing here.. Ah, right. Woops. I can't exactly replicate your results, but I've attempted to fix this in a systematic way in the new version attached here (parallel-join-v3.patch). >> Do you have a self-contained test case that reproduces this, or any >> insight as to how it's happening here? > > This is TPC-H benchmark case: > we can setup like this.. > 1. git clone https://tkejser@bitbucket.org/tkejser/tpch-dbgen.git > 2. complie using make > 3. ./dbgen –v –s 5 > 4. ./qgen Thanks. After a bit of fiddling I was able to get this to work. I'm attaching two other patches that seem to help this case quite considerably. The first (parallel-reader-order-v1) cause Gather to read from the same worker repeatedly until it can't get another tuple from that worker without blocking, and only then move on to the next worker. With 4 workers, this seems to be drastically more efficient than what's currently in master - I saw the time for Q5 drop from over 17 seconds to about 6 (this was an assert-enabled build running with EXPLAIN ANALYZE, though, so take those numbers with a grain of salt). The second (gather-disuse-physical-tlist.patch) causes Gather to force underlying scan nodes to project, which is a good idea here for reasons very similar to why it's a good idea for the existing node types that use disuse_physical_tlist: forcing extra data through the Gather node is bad. That shaved another half second off this query. The exact query I was using for testing was: explain (analyze, verbose) select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and o_orderdate >= date '1995-01-01' and o_orderdate < date '1995-01-01' + interval '1' year group by n_name order by revenue desc; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
>> unconditionally but whenever we are building a non partial path, that time
>> we should carry forward the parallel_safe state to its parent, and it seems
>> like that part is missing here..
>Ah, right. Woops. I can't exactly replicate your results, but I've
>attempted to fix this in a systematic way in the new version attached
>here (parallel-join-v3.patch).
create table t2 (c1 int, c2 int, c3 text);
insert into t1 values(generate_series(1,10000000), generate_series(1,10000000), repeat('x', 1000));
insert into t2 values(generate_series(1,3000000), generate_series(1,3000000), repeat('x', 5));
On Fri, Dec 18, 2015 at 3:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Dec 18, 2015 at 7.59 AM Robert Haas <robertmhaas@gmail.com> Wrote,
>> Uh oh. That's not supposed to happen. A GatherPath is supposed to
>> have parallel_safe = false, which should prevent the planner from
>> using it to form new partial paths. Is this with the latest version
>> of the patch? The plan output suggests that we're somehow reaching
>> try_partial_hashjoin_path() with inner_path being a GatherPath, but I
>> don't immediately see how that's possible, because
>> create_gather_path() sets parallel_safe to false unconditionally, and
>> hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
>> that path is parallel_safe.
>
> Yes, you are right, that create_gather_path() sets parallel_safe to false
> unconditionally but whenever we are building a non partial path, that time
> we should carry forward the parallel_safe state to its parent, and it seems
> like that part is missing here..
Ah, right. Woops. I can't exactly replicate your results, but I've
attempted to fix this in a systematic way in the new version attached
here (parallel-join-v3.patch).
>> Do you have a self-contained test case that reproduces this, or any
>> insight as to how it's happening here?
>
> This is TPC-H benchmark case:
> we can setup like this..
> 1. git clone https://tkejser@bitbucket.org/tkejser/tpch-dbgen.git
> 2. complie using make
> 3. ./dbgen –v –s 5
> 4. ./qgen
Thanks. After a bit of fiddling I was able to get this to work. I'm
attaching two other patches that seem to help this case quite
considerably. The first (parallel-reader-order-v1) cause Gather to
read from the same worker repeatedly until it can't get another tuple
from that worker without blocking, and only then move on to the next
worker. With 4 workers, this seems to be drastically more efficient
than what's currently in master - I saw the time for Q5 drop from over
17 seconds to about 6 (this was an assert-enabled build running with
EXPLAIN ANALYZE, though, so take those numbers with a grain of salt).
The second (gather-disuse-physical-tlist.patch) causes Gather to force
underlying scan nodes to project, which is a good idea here for
reasons very similar to why it's a good idea for the existing node
types that use disuse_physical_tlist: forcing extra data through the
Gather node is bad. That shaved another half second off this query.
The exact query I was using for testing was:
explain (analyze, verbose) select n_name, sum(l_extendedprice * (1 -
l_discount)) as revenue from customer, orders, lineitem, supplier,
nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey
and l_suppkey = s_suppkey and c_nationkey = s_nationkey and
s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name =
'EUROPE' and o_orderdate >= date '1995-01-01' and o_orderdate < date
'1995-01-01' + interval '1' year group by n_name order by revenue
desc;
On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote, >>> Yes, you are right, that create_gather_path() sets parallel_safe to false >>> unconditionally but whenever we are building a non partial path, that >>> time >>> we should carry forward the parallel_safe state to its parent, and it >>> seems >>> like that part is missing here.. > >>Ah, right. Woops. I can't exactly replicate your results, but I've >>attempted to fix this in a systematic way in the new version attached >>here (parallel-join-v3.patch). > > I Have tested with the latest patch, problem is solved.. > > During my testing i observed one more behaviour in the hash join, where > Parallel hash join is taking more time compared to Normal hash join, I think the gather-reader-order patch will fix this. Here's a test with all three patches. rhaas=# SET max_parallel_degree=0;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.192 ms count ---------2999950 (1 row) Time: 11331.425 ms rhaas=# SET max_parallel_degree=1;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.185 ms count ---------2999950 (1 row) Time: 8796.190 ms rhaas=# SET max_parallel_degree=2;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.192 ms count ---------2999950 (1 row) Time: 8153.258 ms rhaas=# SET max_parallel_degree=3;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.187 ms count ---------2999950 (1 row) Time: 6198.163 ms rhaas=# SET max_parallel_degree=4;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.190 ms count ---------2999950 (1 row) Time: 7511.970 ms rhaas=# SET max_parallel_degree=5;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.152 ms count ---------2999950 (1 row) Time: 7651.862 ms rhaas=# SET max_parallel_degree=6;SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100; SET Time: 0.195 ms count ---------2999950 (1 row) Time: 7584.073 ms -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 22, 2015 at 4:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Dec 18, 2015 at 8:47 PM Robert Wrote,
>>> Yes, you are right, that create_gather_path() sets parallel_safe to false
>>> unconditionally but whenever we are building a non partial path, that
>>> time
>>> we should carry forward the parallel_safe state to its parent, and it
>>> seems
>>> like that part is missing here..
>
>>Ah, right. Woops. I can't exactly replicate your results, but I've
>>attempted to fix this in a systematic way in the new version attached
>>here (parallel-join-v3.patch).
>
> I Have tested with the latest patch, problem is solved..
>
> During my testing i observed one more behaviour in the hash join, where
> Parallel hash join is taking more time compared to Normal hash join,
I think the gather-reader-order patch will fix this. Here's a test
with all three patches.
postgres=# set max_parallel_degree=0;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=223208.93..223208.94 rows=1 width=0) (actual time=2148.840..2148.841 rows=1 loops=1)
-> Hash Join (cost=204052.91..223208.92 rows=1 width=0) (actual time=1925.309..2148.812 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.025..104.028 rows=1000000 loops=1)
-> Hash (cost=204052.90..204052.90 rows=1 width=4) (actual time=1925.219..1925.219 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on t1 (cost=0.00..204052.90 rows=1 width=4) (actual time=0.029..1925.196 rows=101 loops=1)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 9999899
Planning time: 0.470 ms
Execution time: 2148.928 ms
(11 rows)
postgres=# set max_parallel_degree=3;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=78278.36..78278.37 rows=1 width=0) (actual time=19944.113..19944.113 rows=1 loops=1)
-> Gather (cost=1000.00..78278.36 rows=1 width=0) (actual time=0.682..19943.928 rows=101 loops=1)
Number of Workers: 3
-> Nested Loop (cost=0.00..77278.26 rows=1 width=0) (actual time=690.633..6556.201 rows=25 loops=4)
Join Filter: (t1.c1 = t2.c1)
Rows Removed by Join Filter: 25249975
-> Parallel Seq Scan on t1 (cost=0.00..58300.83 rows=0 width=4) (actual time=619.198..619.262 rows=25 loops=4)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 2499975
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..105.757 rows=1000000 loops=101)
Planning time: 0.206 ms
Execution time: 19944.748 ms
postgres=# set max_parallel_degree=1;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=156191.39..156191.40 rows=1 width=0) (actual time=1336.401..1336.401 rows=1 loops=1)
-> Hash Join (cost=137035.38..156191.39 rows=1 width=0) (actual time=1110.562..1336.386 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.025..101.659 rows=1000000 loops=1)
-> Hash (cost=137035.37..137035.37 rows=1 width=4) (actual time=1110.486..1110.486 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Gather (cost=1000.00..137035.37 rows=1 width=4) (actual time=0.493..1110.445 rows=101 loops=1)
Number of Workers: 1
-> Parallel Seq Scan on t1 (cost=0.00..136035.27 rows=1 width=4) (actual time=553.212..1107.992 rows=50 loops=2)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 4999950
Planning time: 0.211 ms
Execution time: 1336.618 ms
(13 rows)
postgres=# set max_parallel_degree=2;
SET
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=101777.29..101777.29 rows=1 width=0) (actual time=1014.506..1014.507 rows=1 loops=1)
-> Hash Join (cost=82621.27..101777.28 rows=1 width=0) (actual time=796.628..1014.493 rows=101 loops=1)
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.023..99.313 rows=1000000 loops=1)
-> Hash (cost=82621.26..82621.26 rows=1 width=4) (actual time=796.552..796.552 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Gather (cost=1000.00..82621.26 rows=1 width=4) (actual time=0.435..796.499 rows=101 loops=1)
Number of Workers: 2
-> Parallel Seq Scan on t1 (cost=0.00..81621.16 rows=0 width=4) (actual time=528.052..793.243 rows=34 loops=3)
Filter: ((c1 >= 100) AND (c1 <= 200))
Rows Removed by Filter: 3333300
Planning time: 0.200 ms
Execution time: 1014.672 ms
(13 rows)
On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> I think the gather-reader-order patch will fix this. Here's a test >> with all three patches. > > Yeah right, After applying all three patches this problem is fixed, now > parallel hash join is faster than normal hash join. Thanks. I've committed the two smaller patches; it seems fairly clear that those are good changes independent of the parallel join stuff. > I have tested one more case which Amit mentioned, I can see in that case > parallel plan (parallel degree>= 3) is still slow, In Normal case it selects > "Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest > Loop Join" which is making it costlier. Hmm, I'm not sure why that is happening. I'll poke at it a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah right, After applying all three patches this problem is fixed, now > parallel hash join is faster than normal hash join. > > I have tested one more case which Amit mentioned, I can see in that case > parallel plan (parallel degree>= 3) is still slow, In Normal case it selects > "Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest > Loop Join" which is making it costlier. While investigating this problem, I discovered that I can produce a regression even on unpatched master: rhaas=# set max_parallel_degree = 0; SET rhaas=# explain select sum(1) from t1; QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=1553572.00..1553572.01 rows=1 width=0) -> Seq Scan on t1 (cost=0.00..1528572.00 rows=10000000 width=0) (2 rows) rhaas=# set max_parallel_degree = 3; SET rhaas=# explain select sum(1) from t1; QUERY PLAN ----------------------------------------------------------------------------------- Aggregate (cost=1462734.86..1462734.87 rows=1 width=0) -> Gather (cost=1000.00..1437734.86 rows=10000000 width=0) Number of Workers: 3 -> Parallel Seq Scan on t1 (cost=0.00..436734.86 rows=10000000 width=0) (4 rows) The problem here is that the planner imagines that the sequential scan is going to parallelize perfectly, which is not the case. A Gather node is ten times as expensive per tuple as a sequential scan, but sequential scan doesn't need to pay a per-page cost, so if you crank the number of workers up high enough, the cost per tuple appears to drop until it eventually gets low enough that paying the cost of a Gather node looks worthwhile. I tweaked cost_seqscan() so that it spreads out the CPU cost among all of the workers but assumes the disk cost has to be paid regardless, and that fixes this problem. It doesn't fix your example, though. Even with the costing changes mentioned above, the planner still thinks a nested loop over two seqscans has something to recommend it: rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=161755.97..161755.98 rows=1 width=0) (actual time=41164.506..41164.507 rows=1 loops=1) Output: count(*) -> Gather (cost=1000.00..161755.97 rows=1 width=0) (actual time=0.436..41164.388 rows=101 loops=1) Number of Workers: 3 -> Nested Loop (cost=0.00..160755.87 rows=1 width=0) (actual time=329.227..12123.414 rows=25 loops=4) Join Filter: (t1.c1 = t2.c1) Rows Removed by Join Filter: 75749975 Worker 0: actual time=439.924..439.924 rows=0 loops=1 Worker 1: actual time=440.776..440.776 rows=0 loops=1 Worker 2: actual time=436.100..6449.041 rows=15 loops=1 -> Parallel Seq Scan on public.t1 (cost=0.00..102442.10 rows=0 width=4) (actual time=220.185..220.228 rows=25 loops=4) Output: t1.c1, t1.c2 Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200)) Rows Removed by Filter: 2499975 Worker 0: actual time=439.922..439.922 rows=0 loops=1 Worker 1: actual time=440.773..440.773 rows=0 loops=1 Worker 2: actual time=0.016..0.055 rows=15 loops=1 -> Seq Scan on public.t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.007..235.143 rows=3000000 loops=101) Output: t2.c1, t2.c2 Worker 2: actual time=0.012..215.711 rows=3000000 loops=15 Planning time: 0.150 ms Execution time: 41164.597 ms But this is not entirely the fault of the parallel query code. If you force a seqscan-over-seqscan plan in the non-parallel cast, it estimates the join cost as 287772.00, only slightly more than the 261522.02 cost units it thinks a non-parallel hash join will cost. In fact, however, the non-parallel hash join runs in 1.2 seconds and the non-parallel nested loop takes 46 seconds. So the first problem here is that a plan that the query planner thinks is only 10% more expensive actually runs for almost 40 times longer. If the planner had accurately estimated the real cost of the nested loop, this plan wouldn't have been chosen. If you set enable_nestloop=false, then you get this plan: rhaas=# set enable_nestloop=false; SET rhaas=# set max_parallel_degree=3; SET rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=160909.22..160909.23 rows=1 width=0) (actual time=647.010..647.011 rows=1 loops=1) Output: count(*) -> Hash Join (cost=103442.21..160909.22 rows=1 width=0) (actual time=234.397..646.985 rows=101 loops=1) Hash Cond: (t2.c1 = t1.c1) -> Seq Scan on public.t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.033..197.595 rows=3000000 loops=1) Output: t2.c1, t2.c2 -> Hash (cost=103442.20..103442.20 rows=1 width=4) (actual time=234.235..234.235 rows=101 loops=1) Output: t1.c1 Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Gather (cost=1000.00..103442.20 rows=1 width=4) (actual time=0.289..234.199 rows=101 loops=1) Output: t1.c1 Number of Workers: 3 -> Parallel Seq Scan on public.t1 (cost=0.00..102442.10 rows=0 width=4) (actual time=171.667..230.080 rows=25 loops=4) Output: t1.c1 Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200)) Rows Removed by Filter: 2499975 Worker 0: actual time=228.628..228.628 rows=0 loops=1 Worker 1: actual time=228.432..228.432 rows=0 loops=1 Worker 2: actual time=229.566..229.566 rows=0 loops=1 Planning time: 0.160 ms Execution time: 647.133 ms (21 rows) And that's a good plan. The parallel nested loop case also suffers from the fact that workers 0 and 1 don't happen to find any of the interesting rows in t1 at all, and worker 2 only finds 15 of them. The leader finds the other 85 and thus has to run most of the iterations of the scan on t2 itself. If the work were divided equally, the parallel nested loop would probably run significantly faster, although it would still be ten times slower than the non-parallel hash join. In the long term, I think the way to fix the uneven work distribution that happens here is to construct the hash table in parallel, as already discussed with Simon upthread. Then we could have a Gather node on top of a Hash Join both inputs to which are Parallel Seq Scans, and now there's basically no risk of a skewed work distribution. While that would be nice to have, I think the big thing to focus on here is how inaccurate the nested loop costing is - as already mentioned, it thinks the non-parallel nested loop is 10% slower than the hash join when it's really forty times slower. The main reason for that is that ((t1.c1 >= 100) AND (t1.c1 <= 200)) actually matches 100 rows, but the planner expects it to match just one. In a real table, there would probably be a unique index on t1 (c1), and that also fixes the problem. If I add that, the non-parallel query runs in 422 ms (with EXPLAIN ANALYZE, on a debug build) and the parallel query runs in 125 ms, and the row count estimates are correct, too. Even if I disable actually using the index, the fact that it fixes the cardinality estimates causes the query to choose a good (parallel!) plan. Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah right, After applying all three patches this problem is fixed, now
> parallel hash join is faster than normal hash join.
>
> I have tested one more case which Amit mentioned, I can see in that case
> parallel plan (parallel degree>= 3) is still slow, In Normal case it selects
> "Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest
> Loop Join" which is making it costlier.
While investigating this problem, I discovered that I can produce a
regression even on unpatched master:
But this is not entirely the fault of the parallel query code. If you
force a seqscan-over-seqscan plan in the non-parallel cast, it
estimates the join cost as 287772.00, only slightly more than the
261522.02 cost units it thinks a non-parallel hash join will cost. In
fact, however, the non-parallel hash join runs in 1.2 seconds and the
non-parallel nested loop takes 46 seconds.
Updated patch attached.
create table t2 (c1 int, c2 int, c3 text);
insert into t1 values(generate_series(1,100000000), generate_series(1,100000000), repeat('x', 100));
insert into t2 values(generate_series(1,48000000), generate_series(1,48000000), repeat('x', 5));
analyze t1;
analyze t2;
Test with: 1GB RAM
-----------------------------
postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12248485.55..12248485.56 rows=1 width=0) (actual time=147490.455..147490.455 rows=1 loops=1)
-> Hash Join (cost=1526963.25..12208485.47 rows=16000033 width=0) (actual time=26652.871..143368.989 rows=47999950 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 50
-> Seq Scan on t1 (cost=0.00..2742412.72 rows=100005072 width=4) (actual time=130.580..40127.004 rows=100000000 loops=1)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=26500.439..26500.439 rows=48000000 loops=1)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.039..11402.343 rows=48000000 loops=1)
Planning time: 0.410 ms
Execution time: 147490.553 ms
(11 rows)
postgres=# set max_parallel_degree=6;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4969933.98..4969933.99 rows=1 width=0) (actual time=386024.487..386024.488 rows=1 loops=1)
-> Gather (cost=1527963.25..4929933.89 rows=16000033 width=0) (actual time=199190.138..379487.861 rows=47999950 loops=1)
Number of Workers: 6
-> Hash Join (cost=1526963.25..3328930.59 rows=16000033 width=0) (actual time=178885.161..320724.381 rows=6857136 loops=7)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 7
-> Parallel Seq Scan on t1 (cost=0.00..421909.65 rows=15385396 width=4) (actual time=106.403..11735.643 rows=14285714 loops=7)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=177959.433..177959.433 rows=48000000 loops=7)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.022..20778.693 rows=48000000 loops=7)
Planning time: 0.372 ms
Execution time: 386025.056 ms
Test with 8GB RAM:
---------------------------
postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12229853.83..12229853.84 rows=1 width=0) (actual time=111113.286..111113.286 rows=1 loops=1)
-> Hash Join (cost=1526963.25..12189853.75 rows=16000033 width=0) (actual time=15830.319..108557.658 rows=47999950 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 50
-> Seq Scan on t1 (cost=0.00..2724138.00 rows=100000000 width=4) (actual time=3.515..43207.798 rows=100000000 loops=1)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=15436.088..15436.088 rows=48000000 loops=1)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.677..6290.310 rows=48000000 loops=1)
Planning time: 0.287 ms
Execution time: 111113.358 ms
(11 rows)
postgres=# set max_parallel_degree=6;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6538149.22..6538149.23 rows=1 width=0) (actual time=172636.184..172636.184 rows=1 loops=1)
-> Gather (cost=1527963.25..6498149.14 rows=16000033 width=0) (actual time=40952.576..168973.552 rows=47999950 loops=1)
Number of Workers: 6
-> Hash Join (cost=1526963.25..4897145.84 rows=16000033 width=0) (actual time=41109.818..151129.893 rows=6857136 loops=7)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 7
-> Parallel Seq Scan on t1 (cost=0.00..1890804.67 rows=16666667 width=4) (actual time=0.492..86241.998 rows=14285714 loops=7)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=40936.920..40936.920 rows=48000000 loops=7)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.024..22644.484 rows=48000000 loops=7)
Planning time: 2.668 ms
Execution time: 172636.647 ms
(13 rows)
--
On Mon, Jan 4, 2016 at 4:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > I tried to create a inner table such that, inner table data don't fit in RAM > (I created VM with 1GB Ram). > Purpose of this is to make Disk scan dominant, > and since parallel join is repeating the Disk Scan and hash table building > of inner table, so there will be lot of Parallel I/O and it has to pay > penalty. > > I think even though, inner table scanning and hash table building is > parallel, but there will be lot of parallel I/O which will become > bottleneck. Hmm. Because only 1/1024th of the hash table fits in work_mem, the executor is going to have to write out all of the tuples that don't belong to the first batch to a temporary file and then read them back in. So each backend is going to write essentially the entirety of t2 out to disk and then read it all back in again. The non-parallel case will also write most of the table contents and then read them back in, but at least it will only be doing that once rather than 7 times, so it's not as bad. Also, with fewer backends running, the non-parallel case will have a bit more memory free for caching purposes. > Do we need to consider the cost for parallel i/o also, i am not sure can we > really do that... ? It seems to me that the problem here is that you've set max_parallel_degree to an unrealistically high value. The query planner is entitled to assume that the user has set max_parallel_degree to a value which is small enough that the workers won't be fighting too viciously with each other over resources. It doesn't really matter whether those resources are CPU resources or I/O resources. I'm wondering if your 1GB VM really even has as many as 7 vCPUs, because that would seem to be something of an unusual configuration - and if it doesn't, then setting max_parallel_degree to a value that high is certainly user error. Even if it does, it's still not right to set the value as high as six unless the system also has enough I/O bandwidth to accommodate the amount of I/O that you expect your queries to generate, and here it seems like it probably doesn't. To put that another way, you can always make parallel query perform badly by telling it to use too many workers relative to the size of the machine you have. This is no different than getting bad query plans by configuring work_mem or effective_cache_size or any other query planner GUC to a value that doesn't reflect the actual execution environment. I would only consider this to be a problem with the parallel join patch if the chosen plan is slower even on a machine that's big enough to justify setting max_parallel_degree=6 in the first place. While studying this example, I thought about whether we try to fix this case by generating a partial hash join path only if we expect a single batch, which would then cause the query planner to plan this query some other way. But after some thought I don't think that's the right approach. Multi-batch hash joins are in general quite a lot slower than single-batch hash joins - and initial_cost_hashjoin knows that - but if the system has adequate I/O bandwidth, that problem shouldn't be any worse for a parallel hash join than it is for a non-parallel hash join. I think the reason you're losing here is because the system either doesn't have as many vCPUs as the number of worker processes you are giving it, or it has a very limited amount of I/O bandwidth that can't handle multiple processes doing sequential I/O at the same time - e.g. a single spinning disk, or a single SSD plus a bunch of virtualization overhead. But that need not be the case. On a system where temporary files are written to a filesystem backend by an array of disks, you might well get some I/O parallelism. Of course if we experiment and find that doesn't work out well for some reason, then we've got a problem, but it doesn't seem implausible that it might be just fine. Another interesting question about this particular query is whether a merge join would have been faster, especially given all Peter Geoghegan's work to improve sort performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 4, 2016 at 4:50 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I tried to create a inner table such that, inner table data don't fit in RAM
> (I created VM with 1GB Ram).
> Purpose of this is to make Disk scan dominant,
> and since parallel join is repeating the Disk Scan and hash table building
> of inner table, so there will be lot of Parallel I/O and it has to pay
> penalty.
>
> I think even though, inner table scanning and hash table building is
> parallel, but there will be lot of parallel I/O which will become
> bottleneck.
Hmm. Because only 1/1024th of the hash table fits in work_mem, the
executor is going to have to write out all of the tuples that don't
belong to the first batch to a temporary file and then read them back
in. So each backend is going to write essentially the entirety of t2
out to disk and then read it all back in again. The non-parallel case
will also write most of the table contents and then read them back in,
but at least it will only be doing that once rather than 7 times, so
it's not as bad. Also, with fewer backends running, the non-parallel
case will have a bit more memory free for caching purposes.
> Do we need to consider the cost for parallel i/o also, i am not sure can we
> really do that... ?
It seems to me that the problem here is that you've set
max_parallel_degree to an unrealistically high value. The query
planner is entitled to assume that the user has set
max_parallel_degree to a value which is small enough that the workers
won't be fighting too viciously with each other over resources. It
doesn't really matter whether those resources are CPU resources or I/O
resources. I'm wondering if your 1GB VM really even has as many as 7
vCPUs, because that would seem to be something of an unusual
configuration - and if it doesn't, then setting max_parallel_degree to
a value that high is certainly user error. Even if it does, it's still
not right to set the value as high as six unless the system also has
enough I/O bandwidth to accommodate the amount of I/O that you expect
your queries to generate, and here it seems like it probably doesn't.
To put that another way, you can always make parallel query perform
badly by telling it to use too many workers relative to the size of
the machine you have. This is no different than getting bad query
plans by configuring work_mem or effective_cache_size or any other
query planner GUC to a value that doesn't reflect the actual execution
environment. I would only consider this to be a problem with the
parallel join patch if the chosen plan is slower even on a machine
that's big enough to justify setting max_parallel_degree=6 in the
first place.
So this time i have configured 8 processor and taken performance again with less number of parallel degree.
8Processor VM ( Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz) --> This machine i7, so i doubt it's really using 8 cores, so i tested with less parallel degree.
postgres=# set max_parallel_degree=3;
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=7920946.47..7920946.48 rows=1 width=0) (actual time=162329.829..162329.829 rows=1 loops=1)
-> Gather (cost=1527963.25..7880946.39 rows=16000033 width=0) (actual time=58233.106..159140.629 rows=47999950 loops=1)
Number of Workers: 3
-> Hash Join (cost=1526963.25..6279943.09 rows=16000033 width=0) (actual time=58346.087..144309.987 rows=11999988 loops=4)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 12
-> Parallel Seq Scan on t1 (cost=0.00..2064959.01 rows=32259701 width=4) (actual time=98.514..27003.566 rows=25000000 loops=4)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=58012.228..58012.228 rows=48000000 loops=4)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.0po0..739461.00 rows=48000100 width=8) (actual time=3.524..9634.181 rows=48000000 loops=4)
Planning time: 1.945 ms
Execution time: 162330.657 ms
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=8744354.81..8744354.82 rows=1 width=0) (actual time=133715.245..133715.245 rows=1 loops=1)
-> Gather (cost=1527963.25..8704354.73 rows=16000033 width=0) (actual time=49240.892..130699.685 rows=47999950 loops=1)
Number of Workers: 2
-> Hash Join (cost=1526963.25..7103351.43 rows=16000033 width=0) (actual time=48916.074..116934.088 rows=15999983 loops=3)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 17
-> Parallel Seq Scan on t1 (cost=0.00..2159049.80 rows=41668780 width=4) (actual time=106.882..22650.646 rows=33333333 loops=3)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=48670.370..48670.370 rows=48000000 loops=3)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.618..7908.589 rows=48000000 loops=3)
Planning time: 0.380 ms
Execution time: 133715.932 ms
(13 rows)
postgres=# set max_parallel_degree=0;
SET
postgres=# explain analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t2.c2 + t1.c1 > 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12248485.55..12248485.56 rows=1 width=0) (actual time=92297.234..92297.234 rows=1 loops=1)
-> Hash Join (cost=1526963.25..12208485.47 rows=16000033 width=0) (actual time=15739.911..89627.652 rows=47999950 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t2.c2 + t1.c1) > 100)
Rows Removed by Join Filter: 50
-> Seq Scan on t1 (cost=0.00..2742412.72 rows=100005072 width=4) (actual time=127.260..24826.175 rows=100000000 loops=1)
-> Hash (cost=739461.00..739461.00 rows=48000100 width=8) (actual time=15560.002..15560.002 rows=48000000 loops=1)
Buckets: 131072 Batches: 1024 Memory Usage: 2856kB
-> Seq Scan on t2 (cost=0.00..739461.00 rows=48000100 width=8) (actual time=0.834..6199.727 rows=48000000 loops=1)
Planning time: 0.244 ms
Execution time: 92298.000 ms
(11 rows)
On Mon, Jan 4, 2016 at 8:52 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > One strange behaviour, after increasing number of processor for VM, > max_parallel_degree=0; is also performing better. So, you went from 6 vCPUs to 8? In general, adding more CPUs means that there is less contention for CPU time, but if you already had 6 CPUs and nothing else running, I don't know why the backend running the query would have had a problem getting a whole CPU to itself. If you previously only had 1 or 2 CPUs then there might have been some CPU competition with background processes, but if you had 6 then I don't know why the max_parallel_degree=0 case got faster with 8. Anyway, I humbly suggest that this query isn't the right place to put our attention. There's no reason why we can't improve things further in the future, and if it turns out that lots of people have problems with the cost estimates on multi-batch parallel hash joins, then we can revise the cost model. We wouldn't treat a single query where a non-parallel multi-batch hash join run slower than the costing would suggest as a reason to revise the cost model for that case, and I don't think this patch should be held to a higher standard. In this particular case, you can easily make the problem go away by tuning configuration parameters, which seems like an acceptable answer for people who run into this, unless it becomes clear that this particular problem is widespread and can't be solved without configuration changes that introduce other issues at the same time. Keep in mind that, right now, the patch is currently doing just about the simplest thing possible, and that's working pretty well. Anything we change at this point is going to be in the direction of adding more complexity than what I've got right now and more than we've got in the corresponding non-parallel case. That's OK, but I think it's appropriate that we only do that if we're pretty sure that those changes are going to be an improvement. And I think, by and large, that we don't have enough perspective on this to know that at this point. Until this gets some wider testing, which probably isn't going to happen very much until this gets committed, it's hard to say which problems are just things we're artificially creating in the lab and which ones are going to be annoyances in the real world. Barring strenuous objections or discovery of more serious problems with this than have turned up so far, I'm inclined to go ahead and commit it fairly soon, so that it attracts some more eyeballs while there's still a little time left in the development cycle to do something about whatever the systematic problems turn out to be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 4, 2016 at 8:52 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> One strange behaviour, after increasing number of processor for VM,
> max_parallel_degree=0; is also performing better.
So, you went from 6 vCPUs to 8? In general, adding more CPUs means
that there is less contention for CPU time, but if you already had 6
CPUs and nothing else running, I don't know why the backend running
the query would have had a problem getting a whole CPU to itself. If
you previously only had 1 or 2 CPUs then there might have been some
CPU competition with background processes, but if you had 6 then I
don't know why the max_parallel_degree=0 case got faster with 8.
Anyway, I humbly suggest that this query isn't the right place to put
our attention. There's no reason why we can't improve things further
in the future, and if it turns out that lots of people have problems
with the cost estimates on multi-batch parallel hash joins, then we
can revise the cost model. We wouldn't treat a single query where a
non-parallel multi-batch hash join run slower than the costing would
suggest as a reason to revise the cost model for that case, and I
don't think this patch should be held to a higher standard. In this
particular case, you can easily make the problem go away by tuning
configuration parameters, which seems like an acceptable answer for
people who run into this,
I have done further testing for observing the plan time, using TPC-H queries and some other many table join queries(7-8 tables)..