Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | d6e5e15d-3f53-239a-7064-bb99f037a0da@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 07/13/2018 01:19 PM, Dean Rasheed wrote: > On 24 June 2018 at 20:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Attached is a rebased version of this patch series, mostly just fixing >> the breakage caused by reworked format of initial catalog data. >> >> Aside from that, the MCV building now adopts the logic introduced by >> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems >> pretty good and I don't see why multi-column MCV lists should use >> something special. > > Agreed. > > >> I'm sure there are plenty of open questions to discuss, particularly >> stuff related to combining the various types of statistics to the final >> estimate (a lot of that was already improved based on Dean's reviews). > > Yes, that's definitely one of the trickier parts of this. I don't > think that the current algorithm is ideal as it stands. In particular, > the way that it attempts to handle complex combinations of clauses > doesn't look right. I think mcv_clauselist_selectivity() and > histogram_clauselist_selectivity() are plausibly correct, but the way > that the resulting selectivities are combined in > statext_clauselist_selectivity() doesn't seem right. In particular, > the use of estimate_equality_groups() to count "nmatches" and > "fullmatch" only takes into account top-level equality clauses, so it > will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which > might be fully covered by, say, the MCV stats. Consider, for example, > the following simple test case: > > > create table foo(a int, b int); > insert into foo select 1,1 from generate_series(1,50000); > insert into foo select 1,2 from generate_series(1,40000); > insert into foo select 1,x/10 from generate_series(30,250000) g(x); > insert into foo select 2,1 from generate_series(1,30000); > insert into foo select 2,2 from generate_series(1,20000); > insert into foo select 2,x/10 from generate_series(30,500000) g(x); > insert into foo select 3,1 from generate_series(1,10000); > insert into foo select 3,2 from generate_series(1,5000); > insert into foo select 3,x from generate_series(3,600000) g(x); > insert into foo select x,x/10 from generate_series(4,750000) g(x); > > create statistics foo_mcv_ab (mcv) on a,b from foo; > analyse foo; > > explain analyse select * from foo where a=1 and b=1; > -- Actual rows: 50000, Estimated: 52690 (14149 without MV-stats) > > explain analyse select * from foo where a=1 and b=2; > -- Actual rows: 40000, Estimated: 41115 (10534 without MV-stats) > > explain analyse select * from foo where a=1 and (b=1 or b=2); > -- Actual rows: 90000, Estimated: 181091 (24253 without MV-stats) > > explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2); > -- Actual rows: 140000, Estimated: 276425 (56716 without MV-stats) > > > In the first 2 queries the multivariate MCV stats help a lot and give > good estimates, but in the last 2 queries the estimates are around > twice as large as they should be, even though good MCV stats are > available on those specific values. > I'm not so sure. The issue is that a lot of the MCV deductions depends on whether we can answer questions like "Is there a single match?" or "If we got a match in MCV, do we need to look at the non-MCV part?" This is not very different from the single-column estimates, except of course here we need to look at multiple columns. The top-level clauses allow us to make such deductions, with deeper clauses it's much more difficult (perhaps impossible). Because for example with (a=1 AND b=1) there can be just a single match, so if we find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR b=2)) it's not that simple, because there may be multiple combinations and so a match in MCV does not guarantee anything. I don't think there's a way around this. The single-dimensional case does not have this issue, of course. > The tricky thing is to work out how to correctly combine the various > stats that are available. In the above examples, the selectivity > returned by mcv_clauselist_selectivity() would have been basically > correct, but since it will have not been identified as a fullmatch and > some non-equality clauses will have been seen at the top-level (the OR > clauses), it ends up adding on additional selectivities from > clauselist_selectivity(). > > I think perhaps it might be better not to attempt to combine the > *overall* selectivity returned by mcv_clauselist_selectivity() with > that returned by clauselist_selectivity(), but rather merge the logic > of these two functions together into a single traversal of the > clause-tree. That way, the various individual selectivities can be > combined on a clause-by-clause basis to give the best running total > based on the available information. Even when the multivariate stats > are incomplete, they may still provide a useful lower bound on the > selectivity. I don't follow. The example you presented above showed multivariate stats producing over-estimates, so how would it be helpful to use that as lower boundary for anything? > If/when all MCV columns have been matched exactly, that > lower bound might turn into the appropriate overall result, if there > is a matching MCV entry. Isn't the problem illustrated by the examples that we don't know if the MCV matches represent all matches, or if there may be matches in the histogram? > For example, suppose that there are MCV stats > on 3 columns a,b,c and a WHERE clause like (a=1 AND b=2 AND c=3). You > might process that something like: > > * Get sel(a=1) using the normal univariate stats > * Update the multivariate MCV stats match bitmap based on a=1 > * Get sel(b=2) using the normal univariate stats > * Compute the total selectivity assuming independence: > total_sel = sel(a=1)*sel(b=2) > * Update the multivariate MCV stats match bitmap based on b=2 > * Compute the multivariate MCV selectivity so far: > mcv_sel = Sum of MCV frequencies that match so far > * Use that as a lower bound: > total_sel = Max(total_sel, mcv_sel) > * Get sel(c=3) using the normal univariate stats > * Compute the new total selectivity assuming independence: > total_sel *= sel(c=3) > * Update the multivariate MCV stats match bitmap based on c=3 > * Compute the new multivariate MCV selectivity: > mcv_sel = Sum of MCV frequencies that now match > * Use that as a new lower bound: > total_sel = Max(total_sel, mcv_sel) > > so it becomes simpler to merge the selectivities, because it need only > worry about one clause at a time, and it still makes use of partial > information. > I'm not sure how this makes it any simpler? It's pretty much how we do it now - we update the bitmaps clause-by-clause. We can probably make better use of the univariate estimates, using them to deduce upper/lower boundaries in various places (because the multivariate stats are generally coarser than univariate ones). > If there was no MCV entry for (a=1,b=2,c=3), it will still have made > use of any MCV frequencies for (a=1,b=2) to give a somewhat better > estimate, and it will have made use of any available univariate stats, > which might be better under some circumstances. > IMHO it's quite dangerous to use MCV like this. The values that get into MCV lists are often/usually somewhat exceptional, and using them to estimate probability distributions on the non-MCV part is tricky. > I think this approach generalises quite simply to arbitrary AND/OR > combinations, and as discussed before, I don't think that it needs to > handle NOT clauses except in the special case of a "NOT bool_var" > clause. > > One drawback of this approach is that the result will depend on the > order the clauses are processed, but maybe that's OK, given that we > can't reasonably try all possible combinations. > > >> On thing that occurred to me while comparing the single-column logic (as >> implemented in selfuncs.c) and the new multi-column stuff, is dealing >> with partially-matching histogram buckets. >> >> In the single-column case, we pretty much assume uniform distribution in >> each bucket, and linearly interpolate the selectivity. So for a bucket >> with boundaries [0, 10] and condition "x <= 5" we return 0.5, for "x < >> 7" we return 0.7 and so on. This is what convert_to_scalar() does. >> >> In the multi-column case, we simply count each matching bucket as 0.5, >> without any attempts to linearly interpolate. It would not be difficult >> to call "convert_to_scalar" for each condition (essentially repeating >> the linear interpolation for each column), but then what? We could >> simply compute a product of those results, of course, but that only >> works assuming independence. And that's exactly the wrong thing to >> assume here, considering the extended statistics are meant for cases >> where the columns are not independent. >> >> So I'd argue the 0.5 estimate for partially-matching buckets is the >> right thing to do here, as it's minimizing the average error. > > Hmm, that seems a bit dubious to me. > > I think anything that tried to interpolate a value between 0 and the > bucket frequency ought to be better, at least in cases where nearly > all or nearly none of the bucket is matched. So then it just becomes a > question of how best to interpolate. > > As you say, if the columns were independent, simply taking the product > would probably be right, but if the columns were fully dependent on > one another, it's not at all obvious what the best interpolation is, > because the bucket may be long and thin, with most of the data at one > end. However, in the absence of any other information, a reasonable > approach might be just to take the geometric mean -- i.e., the nth > root of the product. > > So perhaps a reasonable interpolation algorithm would be to take the > product to some power, determined from an estimate of the degree of > dependence in the histogram. I think there's enough information in the > histogram data to get an estimate of that -- the bucket's size > relative to the total data extents vs the bucket frequency. > That's an interesting idea. I'll explore doing something like that. > > On a different note, reading another recent thread [1] made me realise > there's another thing this patch needs to worry about -- the new code > needs to be calling statistic_proc_security_check() to determine > whether it's OK to be using these statistics -- c.f. commit > e2d4ef8de8. > > Similarly, I think that access to pg_statistic_ext should be > restricted in the same way that access to pg_statistic is, with a SB > view on top. It's probably OK as it is now with just ndistinct and > dependency degree stats, since they don't reveal actual data values, > but the addition of MCV stats changes that. > Phew! Who needs security? ;-) > > That's it for now. I hope some of that was useful. > Certainly. Thanks for sharing the thoughts. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: