Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics - Mailing list pgsql-hackers
From | Nathan Boley |
---|---|
Subject | Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics |
Date | |
Msg-id | 6fa3b6e20806091051j67347c3aj6658920e95855ef8@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
|
List | pgsql-hackers |
> Your argument seems to consider only columns having a normal > distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. > How badly does it fall apart for non-normal distributions? This should work to the extent that there is a correlation between the number of distinct values in a histogram bin and the 'size' of the bin. I know that this isn't a very good answer, but it's a hard problem in the general case. However, for any non-uniform probability distribution there exists a measure under which there is a non-zero correlation between these. I wrote up a hand-wavie semi-formal argument at the end, but it's not tough to see intuitively. Just think about the graph of the cdf. The histogram boundaries are horizontal lines that are equally spaced from top to bottom. If we rescale the x-axis st there is a fixed distance between each distinct value, and define the distance as ( ndistinct in a given interval)/(total ndistinct) then the only place where this doesn't hold is when the CDF is f(x) = x, aka the dist is uniform. And even then we just get a 0 coefficient, which is exactly what we are always assuming now. Obviously we run into problems when a) we have a poor estimate for ndistinct - but then we have worse problems b) our length measure doesn't correspond well with ndistinct in an interval expanding on b)... your mention of Zipfian distributions is particularly good example of where this could be poor. Right now ( someone correct me if I'm wrong ) words are sorted alphabetically. However, if we wanted this estimator to do a good job, we would sort them by their length or, better yet, frequency in the english language ( which is highly correlated with length ). > (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.) This should work well for any power law distribution. If people are curious, I can rerun my previous example using a power law distribution instead of normal. However, the easy way of thinking about all of this is that we're building a linear model between ndistinct and histogram bucket width. It's intuitive to expect there to be a correlation between the two, and so the model should have some predictive power. -Nathan <somewhat formal - probably will be difficult to read without some basic probability theory> To see this, first note that we can alternatively define the uniform distribution on [a,b] as the distribution whose CDF is a straight line that passes through both (a,0) and (b,1) ( just integrate the PDF ). So any non-uniform distribution will have a CDF with slope that is both below and above 1/(b-a) at some set of points, implying the existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) ) < ( i2 - i1 )/(b-a). Then, under the constraints of the probability measure, there exists a second disjoint interval st ( CDF(i2') - CDF(i1') ) > ( i2' - i1' )/(b-a). In other words, Next, assume that the number of potential distinct values in our interval scales linearly with the length of the interval. Although it seems as if this assumption could be somewhat burdensome, there always exists a measure under which this is true for sets with a defined order relation. ( As remarked earlier by Tom, we are already making this assumption ). To see this, consider defining the length(i1, i2) as ( the number of distinct value in [i1, i2] )/( total num distinct values ), where the number of distinct values is the set of values { v | v >= i1 and v <= i2 }. Next, note that the joint distribution of identically distributed, independent random variables is multinomial with cell probabilities given by the value of the pdf at each distinct point. Next, I'll state without proof that for an IID RV the expected number of distinct values is maximized for a uniform distribution ( this is pretty easy to see: think about the binomial case. Do you want your cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) ) Finally, note that the number of expected distinct values decreases faster than linearly in the length of the interval. This is pretty clear when we consider the sparse case. As the number of potential entries ( in this case, the interval length) approaches infinity, the probability of a new entry being distinct approaches 1. This means that, in this limit, every new entry ends up being distinct, aka the number of distinct values scales linearly in the number of new entries. As the interval shrinks, new entries have some probability of being repeats. As the interval shrinks to 0, there is a zero probability of new entries being unique. Since, a) there doesn't exists a linear relationship that contains the two boundary points b) the multinomial distribution of the PDF is continuous c) the relationship is clearly decreasing we can surmise that it is sub-linear. Therefore, we have two intervals that have sub and super linear slopes that cancel one another. However, the change in ndistinct for the super-linear slope scales super-linearly, while the change in ndistinct for the sub-linear scales sub-linearly. So there is a net correlation. </somewhat formal>
pgsql-hackers by date: