Re: benchmarking the query planner - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: benchmarking the query planner |
Date | |
Msg-id | 603c8f070812120414g2daba35egb10cc5d7cc2d7dee@mail.gmail.com Whole thread Raw |
In response to | Re: benchmarking the query planner (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: benchmarking the query planner
|
List | pgsql-hackers |
On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Robert Haas" <robertmhaas@gmail.com> writes: >> I had this idle thought too, but I didn't write it down because... > >>> ought to be, but it seems like it ought to be possible to determine >>> that given a desired maximum error in the overall estimate. I'm also >>> not very clear on what the "total frequency" computations (matchfreq2 >>> and unmatchfreq2 in the current code) ought to look like if we are using >>> a variable subset of the inner list. > >> ...of this exact concern, which I think is an insurmountable problem. > > Maybe so. If we stick to the other design (end both lists at a preset > frequency threshold) then the math clearly goes through the same as > before, just with num_mcvs that are determined differently. But can > we prove anything about the maximum error added from that? I don't think so, because in that design, it's entirely possible that you'll throw away the entire MCV list if all of the entries are below the threshold (as in the example we were just benchmarking, supposing a threshold of 0.001). An alternative is to pick a threshold T for the maximum number of equality probes that you're willing to suffer through. Then given two MCV lists of lengths M1 and M2 such that M1 * M2 > T, pick the largest N such that MIN(M1, N) * MIN(M2, N) <= T. This guarantees that you always use at least T^(1/2) MCVs. If you compare this approach with T = 10^6 vs. simply chopping off the MCV list at p = 0.001, this approach will be more accurate at the cost of more comparisons. For example in our test case where all the comparisons fail, chopping off the MCV list at p = 0.001 results in ignoring it completely, whereas with this approach we use all 1000 entries just as before. So it might be appropriate to choose a lower threshold like, say, T = 10^5, since otherwise we're not preventing any computation. (I suppose you could even make this a GUC since any choice of value is going to be pretty arbitrary...) I'm not sure to what extent we can bound the amount of error with this approach, but it's definitely better. As we've seen, a frequency cutoff can throw away the entire MCV list; this approach always retains at least T^(1/2) entries, and more if the other list happens to be shorter than T^(1/2). So we can say that the result will never be worse than it would have been had you set the statistics target to T^(1/2). ...Robert
pgsql-hackers by date: