Home > mailing lists

RE: Improve selectivity estimate for range queries - Mailing list pgsql-hackers

From	Yuzuko Hosoya
Subject	RE: Improve selectivity estimate for range queries
Date	December 21, 2018 08:21:43
Msg-id	009c01d49906$350ee890$9f2cb9b0$@lab.ntt.co.jp Whole thread Raw
In response to	Re: Improve selectivity estimate for range queries (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses	Re: Improve selectivity estimate for range queries
List	pgsql-hackers

Tree view

Hi,

Thanks for the comments.
I attach the v2 patch.

> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> Sent: Friday, December 21, 2018 12:25 PM
> 
> Hello.
> 
> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> > In my environment, the selectivity for id > 0 was 0.99990000000000001,
> > and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> > value of rqlist->hibound and rqlist->lobound are set respectively.
> > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a
> > result, the final selectivity is computed according to
> > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement
> > was satisfied. However, these selectivities were computed according to
> > the statistics, so the final selectivity had to be calculated from rqlist->hibound +
rqlist->lobound
> - 1.0.
> > My test case might be uncommon, but I think it would cause some problems.
> 
> Yeah, its known behavior as the comment just above. If in your example query, the table weer very
large
> and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure
> how much it matters and your example doesn't show that.
> 
> The behvavior discussed here is introduced by this commit.
> 
> | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
> | Author: Tom Lane <tgl@sss.pgh.pa.us>
> | Date:   Tue Nov 9 00:34:46 2004 +0000
> |
> |     Use a hopefully-more-reliable method of detecting default selectivity
> |     estimates when combining the estimates for a range query.  As pointed out
> |     by Miquel van Smoorenburg, the existing check for an impossible combined
> |     result would quite possibly fail to detect one default and one non-default
> |     input.  It seems better to use the default range query estimate in such
> |     cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
> |     This is a bit ugly because it introduces additional coupling between
> |     clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
> |     there wasn't plenty already...
> 
> > To handle such cases I've thought up of an idea based on a previous
> > commit[1] which solved a similar problem related to
> > DEFAULT_NUM_DISTINCT.  Just like this modification, I added a flag
> > which shows whether the selectivity
> 
> The commit [1] added almost no complexity, but this does. Affects many functions to introduce
tighter
> coupling between operator-selectivity functions and clause selectivity functions.
> isdefault travells alone too long just to bind remote functions. We might need more pricipled way
to
> handle that.
>
Yes, indeed.  But I have not found better way than this approach yet.
 
> > was calculated according to the statistics or not to
> > clauselist_selectivity, and used it as a condition of the if statement
> > instead of
> > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> > I am afraid that changes were more than I had expected, but we can
> > estimate selectivity correctly.
> >
> > Patch attached.  Do you have any thoughts?
> 
> I've just run over the patch, but some have comments.
> 
> +            if (isdefault)
> +                rqelem->lobound_isdefault = true;
> 
> This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds'
> isdefault.
> 
Indeed.  I fixed it.

> As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead
> to similar kind of problem?
>
I didn't encounter similar problems, but as you said, such problems would be occurred due to the
condition, selec != DEFAULT_INEQ_SEL.  I changed these conditions into "!isdefault".

> 
> 
>      Selectivity lobound;        /* Selectivity of a var > something clause */
>      Selectivity hibound;        /* Selectivity of a var < something clause */
> +    bool        lobound_isdefault;    /* lobound is a default selectivity? */
> +    bool        hibound_isdefault;    /* hibound is a default selectivity? */
>  } RangeQueryClause;
> 
> Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity.
> The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is
saying
> that it's better that Selectivity has certinty component then building a selectivity arithmetics
> involving uncertainty, though I don't have a clear picture.
>
I see.  Indeed if we would change Selectivity so that it includes 
information about default/non-default, such problems I mentioned 
would be solved.  But I'm afraid that this modification would lead
to a big impact, since there are lots of functions that calculate
selectivity.  I also don't have a concreate idea, so I will think 
about it.  Besides that, I'd like to ask community whether such
modification would be acceptable or not.

Best regards,

Yuzuko Hosoya
NTT Open Source Software Center

> 
> > [1]
> > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2
> > 777d0b733220d9029f78817af8ce
> 
> regards.
> 
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

Attachment

v2_improve_selectivity_estimate_for_range_queries.patch

pgsql-hackers by date:

From: Edmund Horner
Date: 21 December 2018, 08:04:58
Subject: Re: Tid scan improvements

From: "Nagaura, Ryohei"
Date: 21 December 2018, 08:34:18
Subject: RE: [suggestion]support UNICODE host variables in ECPG

RE: Improve selectivity estimate for range queries - Mailing list pgsql-hackers

Attachment

Previous

Next