RE: Improve selectivity estimate for range queries - Mailing list pgsql-hackers
From | Yuzuko Hosoya |
---|---|
Subject | RE: Improve selectivity estimate for range queries |
Date | |
Msg-id | 009c01d49906$350ee890$9f2cb9b0$@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Improve selectivity estimate for range queries (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Improve selectivity estimate for range queries
|
List | pgsql-hackers |
Hi, Thanks for the comments. I attach the v2 patch. > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Sent: Friday, December 21, 2018 12:25 PM > > Hello. > > At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in > <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> > > In my environment, the selectivity for id > 0 was 0.99990000000000001, > > and the selectivity for id < 10000 was 0.33333333333333331. Then, the > > value of rqlist->hibound and rqlist->lobound are set respectively. > > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a > > result, the final selectivity is computed according to > > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement > > was satisfied. However, these selectivities were computed according to > > the statistics, so the final selectivity had to be calculated from rqlist->hibound + rqlist->lobound > - 1.0. > > My test case might be uncommon, but I think it would cause some problems. > > Yeah, its known behavior as the comment just above. If in your example query, the table weer very large > and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure > how much it matters and your example doesn't show that. > > The behvavior discussed here is introduced by this commit. > > | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a > | Author: Tom Lane <tgl@sss.pgh.pa.us> > | Date: Tue Nov 9 00:34:46 2004 +0000 > | > | Use a hopefully-more-reliable method of detecting default selectivity > | estimates when combining the estimates for a range query. As pointed out > | by Miquel van Smoorenburg, the existing check for an impossible combined > | result would quite possibly fail to detect one default and one non-default > | input. It seems better to use the default range query estimate in such > | cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL. > | This is a bit ugly because it introduces additional coupling between > | clauselist_selectivity and scalarltsel/scalargtsel, but it's not like > | there wasn't plenty already... > > > To handle such cases I've thought up of an idea based on a previous > > commit[1] which solved a similar problem related to > > DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag > > which shows whether the selectivity > > The commit [1] added almost no complexity, but this does. Affects many functions to introduce tighter > coupling between operator-selectivity functions and clause selectivity functions. > isdefault travells alone too long just to bind remote functions. We might need more pricipled way to > handle that. > Yes, indeed. But I have not found better way than this approach yet. > > was calculated according to the statistics or not to > > clauselist_selectivity, and used it as a condition of the if statement > > instead of > > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. > > I am afraid that changes were more than I had expected, but we can > > estimate selectivity correctly. > > > > Patch attached. Do you have any thoughts? > > I've just run over the patch, but some have comments. > > + if (isdefault) > + rqelem->lobound_isdefault = true; > > This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds' > isdefault. > Indeed. I fixed it. > As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead > to similar kind of problem? > I didn't encounter similar problems, but as you said, such problems would be occurred due to the condition, selec != DEFAULT_INEQ_SEL. I changed these conditions into "!isdefault". > > > Selectivity lobound; /* Selectivity of a var > something clause */ > Selectivity hibound; /* Selectivity of a var < something clause */ > + bool lobound_isdefault; /* lobound is a default selectivity? */ > + bool hibound_isdefault; /* hibound is a default selectivity? */ > } RangeQueryClause; > > Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity. > The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is saying > that it's better that Selectivity has certinty component then building a selectivity arithmetics > involving uncertainty, though I don't have a clear picture. > I see. Indeed if we would change Selectivity so that it includes information about default/non-default, such problems I mentioned would be solved. But I'm afraid that this modification would lead to a big impact, since there are lots of functions that calculate selectivity. I also don't have a concreate idea, so I will think about it. Besides that, I'd like to ask community whether such modification would be acceptable or not. Best regards, Yuzuko Hosoya NTT Open Source Software Center > > > [1] > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2 > > 777d0b733220d9029f78817af8ce > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Attachment
pgsql-hackers by date: