Home > mailing lists

Re: parallel joins, and better parallel explain - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: parallel joins, and better parallel explain
Date	December 2, 2015 21:55:54
Msg-id	CA+TgmobcXyxs+_EmUG56Pt34cmpS1iWrhdBZ-4SUQ1aPuQsG-w@mail.gmail.com Whole thread Raw
In response to	Re: parallel joins, and better parallel explain (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: parallel joins, and better parallel explain Re: parallel joins, and better parallel explain Re: parallel joins, and better parallel explain
List	pgsql-hackers

Tree view

On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Above and changes in add_path() makes planner not to select parallel path
> for seq scan where earlier it was possible. I think you want to change the
> costing of parallel plans based on rows selected instead of total_cost,
> but there seems to be some problem in the logic (I think gather node is not
> taking into account the reduced cost).

Oops.  The new version I've attached should fix this.  The reason why
I needed to make a change there is because previously the number of
rows estimated for the Parallel Seq Scan was the total number of rows,
not the number of rows per worker.  That doesn't really matter when
we're only doing Parallel Seq Scan, but if you push a join below the
Gather, then the cost of the join won't be computed correctly unless
the row count is the number of rows per worker.

> - There seems to be some inconsistency in Explain's output when
> multiple workers are used.

What is going on here is a bit confusing, but in fact I believe it to
be more correct than what we get with unpatched master.  The problem
has to do with the way that the instrumentation counts loops, and the
way that EXPLAIN displays that information.  In unpatched master,
InstrEndLoop() is not called before the worker instrumentation data is
aggregated to the leader.  Therefore, each node under the Gather ends
up with a loop count of 1.  Unless, of course, it was executed
multiple times in one of the workers, for example because it was on
the inside of a nested loop.  In that case, it ends up with a loop
count equal to the number of times it was executed *minus the number
of workers*.  Therefore, if there are 4 workers and a leader, and
between those 5 processes they  executed the inner side of a nested
loop 1000 times, the final loop count is 996.

With the patch, the loop count is equal to the number of times that
the nodes were actually executed.  Typically, this ends up being equal
to one more than the number of workers, because the leader executes it
and so do all the workers, but it can end up being less if not all
workers execute a particular node.  Of course, it can also be more.
If the node is executed repeatedly, the final loop count is equal to
the total number of times that the node was executed across the leader
and all workers.  So, in the above example, the inner side of a nested
loop would be 1000, not 996, which has the noteworthy advantage of
being correct.

What makes the output a tad confusing is that some but not all fields
in EXPLAIN output are shown as per loop values.  The startup cost,
total cost, and row counts are divided by the number of iterations.  I
have always thought this was a terrible idea: when EXPLAIN tells me
about a nested loop with an inner index scan, I want to know the TOTAL
time spent on that index scan and the TOTAL number of rows returned,
but what I get is the result of dividing those values by the number of
loops and rounded off to a number of decimal places that almost
entirely eliminate the possibility of extracting useful infromation
from the results.  However, I expect to be told that other people
(especially Tom Lane) don't want to change this, and in any case if we
were going to change it I think that would properly be a separate
patch.

So the net result of this is that the times and row counts are
*averages* across all of the loop iterations.  In the case of the
inner side of a nested loop, this means you can read the data just as
you would in a non-parallel plan.  For nodes executed exactly once per
worker plus once in the master, the value displayed ends up being a
per-process average of the amount of time spent, and a per-process
average of the number of rows.  On the other hand, values for buffers
are NOT divided by the loop count, so those values are absolute
totals.  Once you understand this, I think the data is pretty easy to
read.

>    ->  Gather  (cost=1000.00..46203.83 rows=9579 width=0) (actual
> time=33.983..3
> 3592.030 rows=9999 loops=1)
>          Output: c1, c2
>          Number of Workers: 4
>          Buffers: shared hit=548 read=142506
>          ->  Parallel Seq Scan on public.tbl_parallel_test
> (cost=0.00..44245.93
>  rows=2129 width=0) (actual time=13.447..33354.099 rows=2000 loops=5)
>                Output: c1, c2
>                Filter: (tbl_parallel_test.c1 < 10000)
>                Rows Removed by Filter: 198000
>                Buffers: shared hit=352 read=142506
>                Worker 0: actual time=18.422..33322.132 rows=2170 loops=1
>                  Buffers: shared hit=4 read=30765
>                Worker 1: actual time=0.803..33283.979 rows=1890 loops=1
>                  Buffers: shared hit=1 read=26679
>                Worker 2: actual time=0.711..33360.007 rows=1946 loops=1
>                  Buffers: shared hit=197 read=30899
>                Worker 3: actual time=15.057..33252.605 rows=2145 loops=1
>                  Buffers: shared hit=145 read=25433
>  Planning time: 0.217 ms
>  Execution time: 33612.964 ms
> (22 rows)
>
> I am not able to understand how buffer usage add upto what is
> shown at Gather node.

It doesn't, of course.  But I'm not sure it should, and I don't think
this patch has changed anything about that either way.  The patch only
affects the nodes that run in the workers, and Gather doesn't.

> -  I think it would be better if we add some explanation to Explain -
> Verbose section and an Example on the same page in documentation.
> This can help users to understand this feature.
>
> It would be better if we can split this patch into multiple patches like
> Explain related changes, Append pushdown related changes, Join
> Push down related changes.  You can choose to push the patches as
> you prefer, but splitting can certainly help in review/verification of the
> code.

I don't think it really makes sense to split the append push-down
changes from the join push-down changes; those share a great deal of
code.  But I've now split out the EXPLAIN changes.  See attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

From: Jeff Janes
Date: 02 December 2015, 21:41:49
Subject: Fwd: Another little thing about psql wrapped expanded output

From: Alvaro Herrera
Date: 02 December 2015, 21:59:21
Subject: Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

Re: parallel joins, and better parallel explain - Mailing list pgsql-hackers

Attachment

Previous

Next