Home > mailing lists

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours! - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
Date	June 1, 2017 18:40:41
Msg-id	3185.1496331641@sss.pgh.pa.us Whole thread Raw
In response to	Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours! (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
List	pgsql-hackers

Tree view

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Which brings me to the slightly suspicious bit. On 9.5, there's no 
> difference between GROUP and GROUP+LIKE cases - the estimates are 
> exactly the same in both cases. This is true too, but only without the 
> foreign key between "partsupp" and "part", i.e. the two non-grouped 
> relations in the join. And what's more, the difference (1737 vs. 16) is 
> pretty much exactly 100x, which is the estimate for the LIKE condition.
> So it kinda seems 9.5 does not apply this condition for semi-joins, 
> while >=9.6 does that.

I got around to looking into this finally.  I assume that when you say
you added a foreign key from partsupp to part, you also created a unique
index on part.p_partkey --- there is no such index in dbt3's version of
the schema, anyway, so I had to add one to create a foreign key.
The unique index itself, never mind the foreign key, changes things
substantially for this query in HEAD, as a result of commit 92a43e485:
the semijoin to part becomes a plain join, and so we go through entirely
different code paths to estimate selectivity.  But without that, there's
clearly something odd going on, because the LIKE condition ought to have
some effect on the estimate of number of matched rows.

I poked into it and found that the problem stems from the fact that the
initial estimate of the top join relation's size is estimated using
agg_lineitem.agg_partkey = part.p_partkey as the join condition, not
the original partsupp.ps_partkey = part.p_partkey qual.  We can get
the former from the latter via equivalence-clause deductions, and
I think it's just bad luck that the join is formed first in that
direction.  What that means is that eqjoinsel_semi() finds no statistics
for the LHS variable, although it does have stats for the RHS.  There
is logic in eqjoinsel_semi() that attempts to reduce the semijoin
selectivity estimate proportionally to whatever restriction clauses were
applied to the RHS ... but if you look closely, that code has no effect
if we lack stats for the LHS.  We'll fall past both the MCV-dependent
calculation and the nd1-vs-nd2 test and end up taking the selectivity
as just 0.5, independently of anything else.

It seems reasonable to me that we ought to derate that default estimate
by whatever we'd concluded the restriction selectivity on the inner rel
is.  So I experimented with the attached patch and verified that it
results in the LIKE affecting the final rowcount estimate in this
situation.  I've not tested it much further than that, other than checking
that we still pass regression tests.

I'm not sure if we ought to think about applying this now.  It will likely
make things worse not better for the Q20 query, because it will cause the
underestimate for the full query to be even further off.  Still, in
principle it seems like it ought to be an improvement in most cases.

The thing that would actually have a chance of improving matters for Q20
would be if we could see our way to looking through the aggregation
subquery and applying the foreign key constraint for lineitem.  That
seems like a research project though; it's surely not happening for v10.

I'm also wondering idly if we could fix things so that the join size
estimate gets formed from a join condition that we have stats for rather
than one we lack them for.  But I have no clear ideas on ways to go
about that that wouldn't be horrid kluges.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 6e491bb..b875d5d 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*************** eqjoinsel_semi(Oid operator,
*** 2603,2610 ****
           *
           * Crude as the above is, it's completely useless if we don't have
           * reliable ndistinct values for both sides.  Hence, if either nd1 or
!          * nd2 is default, punt and assume half of the uncertain rows have
!          * join partners.
           */
          if (!isdefault1 && !isdefault2)
          {
--- 2603,2609 ----
           *
           * Crude as the above is, it's completely useless if we don't have
           * reliable ndistinct values for both sides.  Hence, if either nd1 or
!          * nd2 is default, we can't use this.
           */
          if (!isdefault1 && !isdefault2)
          {
*************** eqjoinsel_semi(Oid operator,
*** 2616,2622 ****
--- 2615,2635 ----
                  uncertainfrac = nd2 / nd1;
          }
          else
+         {
+             /*
+              * In this situation, we basically assume half of the uncertain
+              * rows have join partners.  However, we'd still like to respond
+              * to restriction clauses applied to the inner rel, so what we
+              * really do is assume half of the uncertain rows have partners in
+              * the underlying inner rel, then reduce that fraction by the
+              * previously-determined selectivity of the inner restrictions.
+              */
              uncertainfrac = 0.5;
+             if (vardata2->rel &&
+                 vardata2->rel->rows > 0 &&
+                 vardata2->rel->rows < vardata2->rel->tuples)
+                 uncertainfrac *= vardata2->rel->rows / vardata2->rel->tuples;
+         }
          uncertain = 1.0 - matchfreq1 - nullfrac1;
          CLAMP_PROBABILITY(uncertain);
          selec = matchfreq1 + uncertainfrac * uncertain;
*************** eqjoinsel_semi(Oid operator,
*** 2624,2631 ****
      else
      {
          /*
!          * Without MCV lists for both sides, we can only use the heuristic
!          * about nd1 vs nd2.
           */
          double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;

--- 2637,2644 ----
      else
      {
          /*
!          * Without MCV lists for both sides, we can only use the heuristics
!          * described above about nd1 vs nd2 and inner restriction clauses.
           */
          double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;

*************** eqjoinsel_semi(Oid operator,
*** 2637,2643 ****
--- 2650,2662 ----
                  selec = (nd2 / nd1) * (1.0 - nullfrac1);
          }
          else
+         {
              selec = 0.5 * (1.0 - nullfrac1);
+             if (vardata2->rel &&
+                 vardata2->rel->rows > 0 &&
+                 vardata2->rel->rows < vardata2->rel->tuples)
+                 selec *= vardata2->rel->rows / vardata2->rel->tuples;
+         }
      }

      free_attstatsslot(&sslot1);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

From: Masahiko Sawada
Date: 01 June 2017, 18:32:36
Subject: Re: [HACKERS] An incomplete comment sentence in subtrans.c

From: Dilip Kumar
Date: 01 June 2017, 18:41:32
Subject: Re: [HACKERS] <> join selectivity estimate question

Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours! - Mailing list pgsql-hackers

Previous

Next