Re: estimating # of distinct values - Mailing list pgsql-hackers
From | Nathan Boley |
---|---|
Subject | Re: estimating # of distinct values |
Date | |
Msg-id | AANLkTi=wNuG0p4gvLMXo+uXhCvspHPq340r0y98tNofj@mail.gmail.com Whole thread Raw |
In response to | Re: estimating # of distinct values (Tomas Vondra <tv@fuzzy.cz>) |
Responses |
Re: estimating # of distinct values
|
List | pgsql-hackers |
> > Not true. You're missing the goal of this effort - to get ndistinct > estimate for combination of multiple columns. Which is usually > pathological in case of dependent columns. <snip> > Again, don't think about a single column (although even in that case > there are known fail cases). Think about multiple columns and the number > of distinct combinations. With attribute value independence assumption, > this is usually significantly underestimated. And that's a problem as it > often leads to an index scan instead of sequential scan (or something > like that). My point was that, in the case of single columns, we've done pretty well despite the poor ndistinct estimates. I suspect the same will be true in the case of multiple columns; good heuristics will trump minimax estimators. >> ( as I've advocated for before ) this means parameterizing the >> distribution of values ( ie, find that the values are roughly uniform >> ), maybe this means estimating the error of our statistics and using >> the most robust rather than the best plan, or maybe it means something >> totally different. But: ( and the literature seems to support me ) > > Which literature? I've read an awful lot of papers on this topic lately, > and I don't remember anything like that. If there's an interesting > paper, I'd like to read it. Yes, all the papers state estimating a > ndistinct is a particularly tricky task, but that's somehow expected? It is definitely expected - non-parametric minimax results are always *very* weak. The papers that you've cited seem to confirm this; bounding ndistinct estimation error is tricky in the general case ( even with very simple loss functions, which we do not have ). The same is true for other non-parametric methods. Think about kernel density estimation: how many data points do you need to estimate the density of a normal distribution well? What about if you *know* that the data is normal ( or even that it comes from a big family? ). > No, I'm not trying to do this just to improve the ndistinct estimate. > The goal is to improve the cardinality estimate in case of dependent > columns, and one of the papers depends on good ndistinct estimates. > > And actually it does not depend on ndistinct for the columns only, it > depends on ndistinct estimates for the combination of columns. So > improving the ndistinct estimates for columns is just a necessary first > step (and only if it works reasonably well, we can do the next step). I think that any approach which depends on precise estimates of ndistinct is not practical. I am very happy that you've spent so much time on this, and I'm sorry if my previous email came off as combative. My point was only that simple heuristics have served us well in the past and, before we go to the effort of new, complicated schemes, we should see how well similar heuristics work in the multiple column case. I am worried that if the initial plan is too complicated then nothing will happen and, even if something does happen, it will be tough to get it committed ( check the archives for cross column stat threads - there are a lot ). Best, Nathan Boley
pgsql-hackers by date: