Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | CAEZATCUtvWzQ8BPgNcY8u6tnyAf6cvVoVC6RRdqbMuDNHfuS2g@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
List | pgsql-hackers |
On 13 July 2018 at 18:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I'm not so sure. The issue is that a lot of the MCV deductions depends > on whether we can answer questions like "Is there a single match?" or > "If we got a match in MCV, do we need to look at the non-MCV part?" This > is not very different from the single-column estimates, except of course > here we need to look at multiple columns. > > The top-level clauses allow us to make such deductions, with deeper > clauses it's much more difficult (perhaps impossible). Because for > example with (a=1 AND b=1) there can be just a single match, so if we > find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR > b=2)) it's not that simple, because there may be multiple combinations > and so a match in MCV does not guarantee anything. Actually, it guarantees a lower bound on the overall selectivity, and maybe that's the best that we can do in the absence of any other stats. I did wonder if maybe we could do better by tracking allowed value counts. E.g., with a clause like ((a=1 OR a=2) AND (b=1 OR b=2)) it would be fairly simple to see that there are 2 allowed values of a, and 2 allowed values of b, so 4 allowed values overall. If we had, say, 3 MCV matches, we'd then know to factor in something extra for the 1 non-MCV match. I'm not sure what to do with non-equality clauses though. >> I think perhaps it might be better not to attempt to combine the >> *overall* selectivity returned by mcv_clauselist_selectivity() with >> that returned by clauselist_selectivity(), but rather merge the logic >> of these two functions together into a single traversal of the >> clause-tree. That way, the various individual selectivities can be >> combined on a clause-by-clause basis to give the best running total >> based on the available information. Even when the multivariate stats >> are incomplete, they may still provide a useful lower bound on the >> selectivity. > > I don't follow. The example you presented above showed multivariate > stats producing over-estimates, so how would it be helpful to use that > as lower boundary for anything? No, the multivariate MCV stats were producing good estimates, even for the complex clauses, because they were all common values in my example. The problem was that the good MCV estimate was then being ruined by adding on extra factors because at the top-level it didn't appear to be a full match. >> If/when all MCV columns have been matched exactly, that >> lower bound might turn into the appropriate overall result, if there >> is a matching MCV entry. > > Isn't the problem illustrated by the examples that we don't know if the > MCV matches represent all matches, or if there may be matches in the > histogram? The example illustrated a case where the MCV matches represented all the matches, but we failed to recognise that. Now we could fix that to reliably detect cases where the MCV matches represented all the matches, but it's still not entirely obvious what to do when they don't. What I'm considering is an algorithm where we simultaneously compute 3 things: simple_sel - The result we would get without multivariate stats (*) mcv_sel - The multivariate MCV result hist_sel - The multivariate histogram result (*) except that at each stage where we add a new clause to the simple_sel value, we improve upon that estimate by factoring in a lower bound from the multivariate stats so far, so that even if the multivariate stats fail to generate anything at the end, we've managed to account for some of the non-independence of the columns. If the MCV matches represented all the matches, then at the end we would have simple_sel = mcv_sel, and hist_sel = 0, and we'd be done. Otherwise, we'd have simple_sel >= mcv_sel, and a possibly non-zero hist_sel, but if we had managed to factor in both mcv_sel and hist_sel to simple_sel at each stage as we went along, then simple_sel is the best overall estimate we can return. Perhaps this is not so very different from what you're currently doing, except that with this approach we might also end up with mcv_sel = 0 and hist_sel = 0, but still have an improved simple_sel estimate that had accounted for some the multivariate stats available along the way. Regards, Dean
pgsql-hackers by date: