Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: BUG #15160: planner overestimates number of rows in join whenthere are more than 200 rows coming from CTE
Date
Msg-id CAAKRu_Y+d3SKkofNK7znhwgLauPUCVyZU=H0mvJH1LG7+tMyXg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE
List pgsql-hackers
Thanks for the quick responses. I've put some inline follow-up questions.

On a separate note, I had one additional code clarity feedback. I felt that
eqjoinsel could be reorganized a bit for readability/clarity for the reader.
For example, eqjoinsel_inner uses only the AttStatsSlots up until here and then
suddenly uses the original stats object and the ndvs which we passed in:

    else
    {
        ...
        double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;
        double        nullfrac2 = stats2 ? stats2->stanullfrac : 0.0;
 
        selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
        if (nd1 > nd2)
            selec /= nd1;
       else
            selec /= nd2;
    }
 
It would make the process of calculating selectivity for an equijoin more clear
to the reader if the nullfraction calculation was pulled out into the main
eqjoinsel function.
 
Having a clear set of steps in eqjoinsel would be helpful. Basically, my
understanding of an overview of the steps is the following:
 
    1) get NDVs
    2) get nullfrac
    3) get MCVs
    4) calculate selectivity
 
Based on this assumption, I've attached a patch with a rough idea for an
alternative structure that I think would be more clear to the reader.
 
> I could not devise an example in which the previous method of calculating
> selectivity would have produced a better estimate. However, one question I have
> after thinking through the optimization is the following:
> ...
> To summarize:
> Selectivity Type                  |  if nd1 <= nd2 |   if nd1 > nd2 |
> ----------------------------------|----------------|-----------------
> inner-join selectivity * ntuples2 | ntuples2 / nd2 | ntuples2 / nd1 |
> semi-join selectivity             |              1 |      nd2 / nd1 |

Um, mumble.  Those functions could be using different values of nd2
thanks to the clamping logic near the head of eqjoinsel_semi, so I'm
not sure that the comparison you're making really holds.

That's a good point. Taking another look at that clamping logic, I realized
that I don't really understand why that clamping would be done for a semi-join
and not for an inner join. It seems like for an inner join it is also true that
the the nd1 cannot be greater than outer rel estimated tuples and nd2 could not
be greater than inner rel estimated tuples.

Also, I don't understand when vardata2->rel->rows and inner_rel->rows would be
different. I thought the point of doing this clamping was that, if you have a
restriction, like the predicate in this subquery select * from foo where a in
(select b from bar where b > 10); your row estimate for bar and your row
estimate for the rows out for that subquery would be different. However, I
looked at the RelOptInfos for vardata2->rel and inner_rel for this query and it
seems like they are referencing the same relation and have the same rows
estimate, so I'm confused when the rows would be different.

> If there is a reason to keep the existing formula, then I have an additional
> question about the proposed selectivity calculation:
>     selec = Min(selec, nd2 * selec_inner);
> When would it be incorrect to instead multiply by inner side NDVs?

I'm confused ... isn't that exactly what this is doing?

Sorry, typo, I was asking why
selec = Min(selec, nd2 * selec_inner);
could not be used instead of what is in the patch
selec = Min(selec, inner_rel->rows * selec_inner);

Thanks,
Melanie
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Constraint documentation
Next
From: Tom Lane
Date:
Subject: Re: Problem while updating a foreign table pointing to a partitioned table on foreign server