Thread: Planner question

Planner question

From

Tom Raney

Date:

05 September 2008, 15:21:55

Why does the planner consider both input variations of each symmetric merge join?  The README says "there is not a lot
ofdifference" between the two options.  When are there any differences?
 

-Tom Raney

Re: Planner question

From

Jeff Davis

Date:

05 September 2008, 15:41:34

On Fri, 2008-09-05 at 11:21 -0700, Tom Raney wrote:
> Why does the planner consider both input variations of each symmetric merge join?  The README says "there is not a
lotof difference" between the two options.  When are there any differences?

> 
> -Tom Raney
> 

http://archives.postgresql.org/pgsql-general/2008-08/msg00967.php

My understanding from that thread is that if one table has high
ndistinct and the other has low ndistinct, one plan may require more
re-scanning than the other.

Regards,Jeff Davis

Re: Planner question

From

Tom Lane

Date:

05 September 2008, 15:57:33

Tom Raney <raneyt@cecs.pdx.edu> writes:
> Why does the planner consider both input variations of each symmetric merge join?  The README says "there is not a
lotof difference" between the two options.  When are there any differences?
 

The righthand side needs to support mark/restore, the left doesn't;
so depending on plan types one way might need a helper Materialize
node that the other way doesn't.  Also, duplicated values are a bit
cheaper to process on the left than the right.
        regards, tom lane

Re: Planner question

From

Tom Raney

Date:

10 September 2008, 15:21:59

Tom Lane wrote:
> Tom Raney <raneyt@cecs.pdx.edu> writes:
>> Why does the planner consider both input variations of each symmetric merge join?  The README says "there is not a
lotof difference" between the two options.  When are there any differences?
 
> 
> The righthand side needs to support mark/restore, the left doesn't;
> so depending on plan types one way might need a helper Materialize
> node that the other way doesn't.  Also, duplicated values are a bit
> cheaper to process on the left than the right.
> 
>             regards, tom lane
> 

Thank you for the explanation.

On a somewhat related issue, I am a bit stumped on the way path keys 
function.

In the following query and debug data, why would an index scan on a 
single relation contain a path key from a different relation?

optimizer/README says, "The PathKeys data structure represents what is 
known about the sort order of the tuples generated by a particular 
Path.  A path's pathkeys field is a list of PathKey nodes, where the 
n'th item represents the n'th sort key of the result."

Why does the index scan for tenk1 include a path key from 
onek.unique2?  Is it implying an equivalence there?

-Tom Raney


bench=# explain select * from tenk1 JOIN onek ON 
tenk1.unique2=onek.unique2;

RELOPTINFO (tenk1): rows=10000 width=244        path list:        SeqScan(tenk1) rows=10000 cost=0.00..434.00
IdxScan(tenk1)rows=10000 cost=0.00..583.25          pathkeys: ((tenk1.unique2, onek.unique2))  <---
 
        cheapest startup path:        SeqScan(tenk1) rows=10000 cost=0.00..434.00
        cheapest total path:        SeqScan(tenk1) rows=10000 cost=0.00..434.00

RELOPTINFO (onek): rows=1000 width=244        path list:        SeqScan(onek) rows=1000 cost=0.00..44.00
IdxScan(onek)rows=1000 cost=0.00..72.25          pathkeys: ((tenk1.unique2, onek.unique2))
 
        cheapest startup path:        SeqScan(onek) rows=1000 cost=0.00..44.00
        cheapest total path:        SeqScan(onek) rows=1000 cost=0.00..44.00

RELOPTINFO (tenk1/onek): rows=1000 width=488        path list:        MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24
        clauses: tenk1.unique2 = onek.unique2                IdxScan(tenk1) rows=10000 cost=0.00..583.25
 pathkeys: ((tenk1.unique2, onek.unique2))                IdxScan(onek) rows=1000 cost=0.00..72.25
pathkeys:((tenk1.unique2, onek.unique2))        NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96          clauses:
tenk1.unique2= onek.unique2                SeqScan(onek) rows=1000 cost=0.00..44.00                IdxScan(tenk1)
rows=10000cost=0.00..1.70
 
        cheapest startup path:        NestLoop(tenk1/onek) rows=1000 cost=0.00..1756.96          clauses: tenk1.unique2
=onek.unique2                SeqScan(onek) rows=1000 cost=0.00..44.00                IdxScan(tenk1) rows=10000
cost=0.00..1.70
        cheapest total path:        MergeJoin(tenk1/onek) rows=1000 cost=0.52..144.24          clauses: tenk1.unique2 =
onek.unique2               IdxScan(tenk1) rows=10000 cost=0.00..583.25                  pathkeys: ((tenk1.unique2,
onek.unique2))               IdxScan(onek) rows=1000 cost=0.00..72.25                  pathkeys: ((tenk1.unique2,
onek.unique2))
                                       QUERY PLAN
----------------------------------------------------------------------------------------- Merge Join
(cost=0.52..144.24rows=1000 width=488)   Merge Cond: (tenk1.unique2 = onek.unique2)   ->  Index Scan using
tenk1_unique2on tenk1  (cost=0.00..583.25
 
rows=10000 width=244)   ->  Index Scan using onek_unique2 on onek  (cost=0.00..72.25 rows=1000
width=244)
(4 rows)

bench=#

Re: Planner question

From

Tom Lane

Date:

10 September 2008, 15:42:55

Tom Raney <raneyt@cecs.pdx.edu> writes:
> Why does the index scan for tenk1 include a path key from 
> onek.unique2?  Is it implying an equivalence there?

> bench=# explain select * from tenk1 JOIN onek ON 
> tenk1.unique2=onek.unique2;

Yes, for an example like that the planner knows that tenk1.unique2 and
onek.unique2 will have equal values in any valid join row, so it's okay
to suppose that a sort by one is the same as a sort by the other.  So
the pathkey items actually reference sets of variables{tenk1.unique2, onek.unique2}
not just individual variables.

> RELOPTINFO (tenk1): rows=10000 width=244
>          path list:
>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
>          IdxScan(tenk1) rows=10000 cost=0.00..583.25
>            pathkeys: ((tenk1.unique2, onek.unique2))  <---

>          cheapest startup path:
>          SeqScan(tenk1) rows=10000 cost=0.00..434.00

>          cheapest total path:
>          SeqScan(tenk1) rows=10000 cost=0.00..434.00

Hm, I don't recognize this output format, is it coming from some custom
code?
        regards, tom lane

Re: Planner question

From

Tom Raney

Date:

10 September 2008, 17:16:37

Tom Lane wrote:
> Tom Raney <raneyt@cecs.pdx.edu> writes:
>> Why does the index scan for tenk1 include a path key from 
>> onek.unique2?  Is it implying an equivalence there?
> 
>> bench=# explain select * from tenk1 JOIN onek ON 
>> tenk1.unique2=onek.unique2;
> 
> Yes, for an example like that the planner knows that tenk1.unique2 and
> onek.unique2 will have equal values in any valid join row, so it's okay
> to suppose that a sort by one is the same as a sort by the other.  So
> the pathkey items actually reference sets of variables
>     {tenk1.unique2, onek.unique2}
> not just individual variables.

Thanks.

> 
>> RELOPTINFO (tenk1): rows=10000 width=244
>>          path list:
>>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
>>          IdxScan(tenk1) rows=10000 cost=0.00..583.25
>>            pathkeys: ((tenk1.unique2, onek.unique2))  <---
> 
>>          cheapest startup path:
>>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
> 
>>          cheapest total path:
>>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
> 
> Hm, I don't recognize this output format, is it coming from some custom
> code?

Yes, it is.  I thought it was easier to read the OPTIMIZER_DEBUG 
output with the relation names instead of the relation indexes.  I 
will post a patch against CVS HEAD if you think it will help others.

-Tom

Re: Planner question

From

Bruce Momjian

Date:

23 September 2008, 10:26:23

Tom Raney wrote:
> >> RELOPTINFO (tenk1): rows=10000 width=244
> >>          path list:
> >>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
> >>          IdxScan(tenk1) rows=10000 cost=0.00..583.25
> >>            pathkeys: ((tenk1.unique2, onek.unique2))  <---
> > 
> >>          cheapest startup path:
> >>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
> > 
> >>          cheapest total path:
> >>          SeqScan(tenk1) rows=10000 cost=0.00..434.00
> > 
> > Hm, I don't recognize this output format, is it coming from some custom
> > code?
> 
> Yes, it is.  I thought it was easier to read the OPTIMIZER_DEBUG 
> output with the relation names instead of the relation indexes.  I 
> will post a patch against CVS HEAD if you think it will help others.

Personally I would like to see optimizer debug become a configuration
parameter rather than a compile-time parameter.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +