Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | 7d12b5d1-ac77-67e3-c67e-7fcbd3753e9c@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 03/29/2018 02:27 AM, Dean Rasheed wrote: > On 28 March 2018 at 15:50, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> After thinking about this a bit more, I'm not sure if updating the info >> based on recursive calls makes sense. The fullmatch flag was supposed to >> answer a simple question - can there be just a single matching item? >> >> If there are equality conditions on all columns, there can be just a >> single matching item - if we have found it in the MCV (i.e. s1 > 0.0), >> then we don't need to inspect the non-MCV part. >> >> But handling this in recursive manner breaks this entirely, because with >> something like >> >> (a=1) AND (b=1 OR b=2) >> >> you suddenly can have multiple matching items. Which makes the fullmatch >> flag somewhat useless. >> >> So I think we should be looking at top-level equality clauses only, just >> like number_of_groups() does. >> > > I'm not quite sure what you mean by that, but it sounds a bit limiting > in terms of the kinds of user queries that would be supported. > Let me explain. The question is "Can there be just a single combination of values matching the conditions?" which (if true) allows us to produce better estimates. If we found a match in the MCV, we don't need to look at the non-MCV part. If not found in the MCV, we can compute an average selectivity as 1/ndistinct (possibly using the ndistinct coefficients). If we can't deduce the existence of a single possible match, we have to compute an estimate in a more generic way. With (a=1 AND b=1) and stats on (a,b) there's just a single possible match (1,1), so that's fine. But it does not work once we start looking for equalities nested deeper - for example (a=1 AND (b=1 OR b=2)) can be translated as ((a=1 AND b=1) OR (a=1 AND b=2)) so technically there's an equality on each column, but there are two possible matches (1,1) and (1,2). So the optimization does not work. Does that clarify what I meant? Although, perhaps we could improve this by deducing number of possible matches and then track matching items in the MCV list. But that seems quite a bit harder. (Of course, we need to consider the non-equality clauses in both cases, the WIP patch does not do that yet.) > >> I think we can remove the fullmatch flag from mcv_update_bitmap >> entirely. All we need to know is the presence of equality clauses and >> whether there was a match in MCV (which we know from s1 > 0.0). >> > > I agree with removing the fullmatch flag, but I don't think we > actually need to know about the presence of equality clauses: > > The way that mcv_update_bitmap() recursively computes the set of > matching MCVs seems to be correct. That gives us a value (call it > mcv_matchsel) for the proportion of the table's rows that are in the > MCV list and satisfy the clauses in stat_clauses. > Sure, but the extra bit of information allows us to (a) ignore the non-MCV part and (b) apply the 1/ndistinct estimate. > We can also estimate that there are (1-mcv_totalsel)*N rows that are > not in the MCV list, for which the MCV stats therefore tell us > nothing. The best way to estimate those rows would seem to be to use > the logic from the guts of clauselist_selectivity(), without > consulting any extended MCV stats (but still consulting other extended > stats, I think). Doing that would return a selectivity value (call it > nonmcv_sel) for those remaining rows. Then a reasonable estimate for > the overall selectivity would seem to be > > mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel > > and there would be no need for mcv_update_bitmap() to track eqmatches > or return fullmatch, and it wouldn't actually matter whether or not we > had equality clauses or if all the MCV columns were used. > Right, although I'm not sure about fallback to clauselist_selectivity() which kinda throws away the statistical dependency. That's why I think we should use 1/ndistinct for equality clauses, and then perhaps leverage the MCV for non-equality clauses somehow. It just occurred we can apply the 1/ndistinct estimate for equalities even when we it's not a 'fullmatch'. So what I propose is roughly this 1) compute selectivity "mcv_sel" using MCV 2) see if there can be just a single match, and (mcv_sel > 0) - if yes, we're done and we don't need to look at non-MCV part 3) split the clauses into top-level equality clauses and the rest 4) estimate "equal_sel" for equality clauses using 1/ndistinct 5) estimate the "inequal_sel" for remaining clauses using MCV (assumes the selectivity will be the same on non-MCV part) 6) total selectivity is mcv_sel + (1 - mcv_totalsel) * equal_sel * inequal_sel We may need to fall back to clauselist_selectivity() in some cases, of course, but I think we should leverage the MCV as much as possible. Another thing is that some of this will change once the histograms are considered, which helps with estimating the non-MCV part. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: