Re: parallel joins, and better parallel explain - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: parallel joins, and better parallel explain |
Date | |
Msg-id | CA+TgmobcXyxs+_EmUG56Pt34cmpS1iWrhdBZ-4SUQ1aPuQsG-w@mail.gmail.com Whole thread Raw |
In response to | Re: parallel joins, and better parallel explain (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: parallel joins, and better parallel explain
Re: parallel joins, and better parallel explain Re: parallel joins, and better parallel explain |
List | pgsql-hackers |
On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Above and changes in add_path() makes planner not to select parallel path > for seq scan where earlier it was possible. I think you want to change the > costing of parallel plans based on rows selected instead of total_cost, > but there seems to be some problem in the logic (I think gather node is not > taking into account the reduced cost). Oops. The new version I've attached should fix this. The reason why I needed to make a change there is because previously the number of rows estimated for the Parallel Seq Scan was the total number of rows, not the number of rows per worker. That doesn't really matter when we're only doing Parallel Seq Scan, but if you push a join below the Gather, then the cost of the join won't be computed correctly unless the row count is the number of rows per worker. > - There seems to be some inconsistency in Explain's output when > multiple workers are used. What is going on here is a bit confusing, but in fact I believe it to be more correct than what we get with unpatched master. The problem has to do with the way that the instrumentation counts loops, and the way that EXPLAIN displays that information. In unpatched master, InstrEndLoop() is not called before the worker instrumentation data is aggregated to the leader. Therefore, each node under the Gather ends up with a loop count of 1. Unless, of course, it was executed multiple times in one of the workers, for example because it was on the inside of a nested loop. In that case, it ends up with a loop count equal to the number of times it was executed *minus the number of workers*. Therefore, if there are 4 workers and a leader, and between those 5 processes they executed the inner side of a nested loop 1000 times, the final loop count is 996. With the patch, the loop count is equal to the number of times that the nodes were actually executed. Typically, this ends up being equal to one more than the number of workers, because the leader executes it and so do all the workers, but it can end up being less if not all workers execute a particular node. Of course, it can also be more. If the node is executed repeatedly, the final loop count is equal to the total number of times that the node was executed across the leader and all workers. So, in the above example, the inner side of a nested loop would be 1000, not 996, which has the noteworthy advantage of being correct. What makes the output a tad confusing is that some but not all fields in EXPLAIN output are shown as per loop values. The startup cost, total cost, and row counts are divided by the number of iterations. I have always thought this was a terrible idea: when EXPLAIN tells me about a nested loop with an inner index scan, I want to know the TOTAL time spent on that index scan and the TOTAL number of rows returned, but what I get is the result of dividing those values by the number of loops and rounded off to a number of decimal places that almost entirely eliminate the possibility of extracting useful infromation from the results. However, I expect to be told that other people (especially Tom Lane) don't want to change this, and in any case if we were going to change it I think that would properly be a separate patch. So the net result of this is that the times and row counts are *averages* across all of the loop iterations. In the case of the inner side of a nested loop, this means you can read the data just as you would in a non-parallel plan. For nodes executed exactly once per worker plus once in the master, the value displayed ends up being a per-process average of the amount of time spent, and a per-process average of the number of rows. On the other hand, values for buffers are NOT divided by the loop count, so those values are absolute totals. Once you understand this, I think the data is pretty easy to read. > -> Gather (cost=1000.00..46203.83 rows=9579 width=0) (actual > time=33.983..3 > 3592.030 rows=9999 loops=1) > Output: c1, c2 > Number of Workers: 4 > Buffers: shared hit=548 read=142506 > -> Parallel Seq Scan on public.tbl_parallel_test > (cost=0.00..44245.93 > rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5) > Output: c1, c2 > Filter: (tbl_parallel_test.c1 < 10000) > Rows Removed by Filter: 198000 > Buffers: shared hit=352 read=142506 > Worker 0: actual time=18.422..33322.132 rows=2170 loops=1 > Buffers: shared hit=4 read=30765 > Worker 1: actual time=0.803..33283.979 rows=1890 loops=1 > Buffers: shared hit=1 read=26679 > Worker 2: actual time=0.711..33360.007 rows=1946 loops=1 > Buffers: shared hit=197 read=30899 > Worker 3: actual time=15.057..33252.605 rows=2145 loops=1 > Buffers: shared hit=145 read=25433 > Planning time: 0.217 ms > Execution time: 33612.964 ms > (22 rows) > > I am not able to understand how buffer usage add upto what is > shown at Gather node. It doesn't, of course. But I'm not sure it should, and I don't think this patch has changed anything about that either way. The patch only affects the nodes that run in the workers, and Gather doesn't. > - I think it would be better if we add some explanation to Explain - > Verbose section and an Example on the same page in documentation. > This can help users to understand this feature. > > It would be better if we can split this patch into multiple patches like > Explain related changes, Append pushdown related changes, Join > Push down related changes. You can choose to push the patches as > you prefer, but splitting can certainly help in review/verification of the > code. I don't think it really makes sense to split the append push-down changes from the join push-down changes; those share a great deal of code. But I've now split out the EXPLAIN changes. See attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: