Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics / patch v7 |
Date | |
Msg-id | 559C2BD7.6040508@2ndquadrant.com Whole thread Raw |
In response to | Re: multivariate statistics / patch v7 (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: multivariate statistics / patch v7
|
List | pgsql-hackers |
Hi, On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote: > Hi, Tomas. I'll kick the gas pedal. > >>> Thank you, it looks clearer. I have some comment for the brief look >>> at this. This patchset is relatively large so I will comment on >>> "per-notice" basis.. which means I'll send comment before examining >>> the entire of this patchset. Sorry in advance for the desultory >>> comments. >> >> Sure. If you run into something that's not clear enough, I'm happy to >> explain that (I tried to cover all the important details in the >> comments, but it's a large patch, indeed.) > > >>> - Single-variate stats have a mechanism to inject arbitrary >>> values as statistics, that is, get_relation_stats_hook and the >>> similar stuffs. I want the similar mechanism for multivariate >>> statistics, too. >> >> Fair point, although I'm not sure where should we place the hook, >> how exactly should it be defined and how useful that would be in >> the end. Can you give an example of how you'd use such hook? ... > > We should carefully design the API to be able to point the pertinent > stats for every situation. Mv stats is based on the correlation of > multiple columns so I think only relid and attributes list are > enough as the parameter. > > | if (st.relid == param.relid && bms_equal(st.attnums, param.attnums)) > | /* This is the stats to be wanted */ > > If we can filter the appropriate stats from all the stats using > clauselist, we definitely can make the appropriate parameter (column > set) prior to retrieving mv statistics. Isn't it correct? Let me briefly explain how the current clauselist_selectivity implementation works. (1) check if there are multivariate statistics on the table - if not, skip the multivariate parts altogether (thepoint of this is to minimize impact on users who don't use the new feature) (2) see if the are clauses compatible with multivariate stats - this only checks "general compatibility" without actuallychecking the existing stats (the point is to terminate early, if the clauses are not compatible somehow- e.g. if the clauses reference only a single attribute, use unsupported operators etc.) (3) if there are multivariate stats and compatible clauses, the function choose_mv_stats tries to find the best combinationof multivariate stats with respect to the clauses (details later) (4) the clauses are estimated using the stats, the remaining clauses are estimated using the current statistics (singleattribute) The only way to reliably inject new stats is by calling a hook before (1), allowing it to arbitrarily modify the list of stats. Based on the use cases you provided, I don't think it makes much sense to add additional hooks in the other phases. At this place it's however now known what clauses are compatible with multivariate stats, or what attributes they are referencing. It might be possible to simply call pull_varattnos() and pass it to the hook, except that does not work with RestrictInfo :-/ Or maybe we could / should not put the hook into clauselist_selectivity but somewhere else? Say, to get_relation_info where we actually read the list of stats for the relation? >> >> >>> 0001: >>> >>> - I also don't think it is right thing for expression_tree_walker >>> to recognize RestrictInfo since it is not a part of expression. >> >> Yes. In my working git repo, I've reworked this to use the second >> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker: >> >> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f >> >> Do you think this is the correct solution? If not, how to fix it? > > The reason why I think it is not appropreate is that RestrictInfo > is not a part of expression. > > Increasing selectivity of a condition by column correlation is > occurs only for a set of conjunctive clauses. OR operation > devides the sets. Is it agreeable? RestrictInfos can be nested > each other and we should be aware of the AND/OR operators. This > is what expression_tree_walker doesn't. I still don't understand why you think we need to differentiate between AND and OR operators. There's nothing wrong with estimating OR clauses using multivariate statistics. > > Perhaps we should provide the dedicate function such like > find_conjunctive_attr_set which does this, Perhaps. The reason why I added support for RestrictInfo into the existing walker implementations is that it seemed like the easiest way to fix the issue. But if there are reasons why that's incorrect, then inventing a new function is probably the right way. > > - Check the type top expression of the clause > > - If it is a RestrictInfo, check clause_relids then check > clause.> > - If it is a bool OR, stop to search and return empty set of > attributes.> > - If it is a bool AND, make further check of the components. A > list of RestrictInfo should be treaed as AND connection.> > - If it is operator exression, collect used relids and attrs > walking the expression tree.> > I should missing something but I think the outline is correct. As I said before, there's nothing wrong with estimating OR clauses using multivariate statistics. So OR and AND should be handled exactly the same. I think you're missing the fact that it's not enough to look at the relids from the RestrictInfo - we need to actually check what clauses are used inside, i.e. we need to check the clauses. That's because only some of the clauses are compatible with multivariate stats, and only if all the clauses of the BoolExpr are "compatible" then we can estimate the clause as a whole. If it's a mix of supported and unsupported clauses, we can simply pass it to clauselist_selectivity which will repeat the whole process with. > Addition to that we should carefully avoid duplicate correction > using the same mv statistics. Sure. That's what choose_mv_statistics does. > > I haven't understood what choose_mv_satistics precisely but I > suppose what this function does would be split into the 'making > parameter to find stats' part and 'matching the parameter with > stats in order to retrieve desired stats' part. Could you > reconstruct this process into the form like this? The goal of choose_mv_statistics does is very simple - given a list of clauses, it tries to find the best combination of statistics, exploiting as much information as possible. So let's say you have clauses WHERE a=1 AND b=1 AND c=1 AND d=1 but you only have statistics on [a,b], [b,c] and [b,c,d]. The simplest approach would be to use the 'largest' statistics, covering the most columns from the clauses - in this case [b,c,d]. This is what the initial patches do. The last patch improves this significantly, by combining the statistics using conditional probability. In this case it'd probably use all three statistics, effectively decomposing the selectivity like this: P(a=1,b=1,c=1,d=1) = P(a=1,b=1) * P(c=1|b=1) * P(d=1|b=1,c=1) [a,b] [b,c] [b,c,d] And each of those probabilities can be estimated using one of the stats. > I feel it is too invasive, or exccesively intermix(?)ed. I don't think it really fits your model - the hook has to be called much sooner, effectively at the very beginning of the clauselist_selectivity or even before that. Otherwise it might not get called at all (e.g. if there are no multivariate stats on the table, this whole part will be skipped). >> Why should it stop at disjunctions? There's nothing wrong with using >> multivariate stats to estimate OR-clauses, IMHO. > > Mv statistics represents how often *every combination of the > column values* occurs. Is it correct? Where the combination can > be replaced with coexists, that is AND. For example MV-MCV. > > (a, b, c) freq > (1, 2, 3) 100 > (1, 2, 5) 50 > (1, 3, 8) 20 > (1, 7, 2) 5 > =============== > total 175 > > | select * from t where a = 1 and b = 2 and c = 3; > | SELECT 100 > > This is correct, > > | select * from t where a = 1 and b = 2 or c = 3; > | SELECT 100 > > This is *not* correct. The correct number of tuples is 150. > This is a simple example where OR breaks MV stats assumption. No, it does not. I'm not sure where are the numbers coming from, though. So let's see how this actually works with multivariate statistics. I'll create a table with the 4 combinations you used in your example, but with 1000x more rows, to make the estimates a bit more accurate: CREATE TABLE t (a INT, b INT, c INT); INSERT INTO t SELECT 1, 2, 3 FROM generate_series(1,100000); INSERT INTO t SELECT 1, 2, 5 FROM generate_series(1,50000); INSERT INTO t SELECT 1, 3, 8 FROM generate_series(1,20000); INSERT INTO t SELECT 1, 7, 2 FROMgenerate_series(1,5000); ALTER TABLE t ADD STATISTICS (mcv) ON (a,b,c); ANALYZE t; And now let's see the two queries: EXPLAIN select * from t where a = 1 and b = 2 and c = 3; QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..4008.50 rows=100403 width=12) Filter:((a = 1) AND (b = 2) AND (c = 3)) (2 rows) EXPLAIN select * from t where a = 1 and b = 2 or c = 3; QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..4008.50 rows=150103 width=12) Filter:(((a = 1) AND (b = 2)) OR (c = 3)) (2 rows) So the first query estimates 100k rows, the second one 150k rows. Exactly as expected, because MCV lists are discrete, match perfectly the data and behave exactly like your mental model. If you try this with histograms though, you'll get the same estimate in both cases: ALTER TABLE t DROP STATISTICS ALL; ALTER TABLE t ADD STATISTICS (histogram) ON (a,b,c); ANALYZE t; EXPLAIN select * from t where a = 1 and b = 2 and c = 3; QUERY PLAN --------------------------------------------------------- Seq Scan on t (cost=0.00..4008.50 rows=52707 width=12) Filter:((a = 1) AND (b = 2) AND (c = 3)) (2 rows) EXPLAIN select * from t where a = 1 and b = 2 or c = 3; QUERY PLAN --------------------------------------------------------- Seq Scan on t (cost=0.00..4008.50 rows=52707 width=12) Filter:(((a = 1) AND (b = 2)) OR (c = 3)) (2 rows) That's unfortunate, but it has nothing to do with some assumptions of multivariate statistics. The "problem" is that histograms are naturally fuzzy, and both conditions hit the same bucket. The solution is simple - don't use histograms for such discrete data. >>> ==== >>> =# CREATE TABLE t1 (a int, b int, c int); >>> =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, >>> 9999) a); >>> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; >>> Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12) >>> =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c); >>> =# ANALZYE t1; >>> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; >>> Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12) >>> ==== >>> Rows changed unwantedly. >> >> That has nothing to do with OR clauses, but rather with using a >> type of statistics that does not fit the data and queries. >> Histograms are quite inaccurate for discrete data and equality >> conditions - in this case the clauses probably match one bucket, >> and so we use 1/2 the bucket as an estimate. There's nothing wrong >> with that. >> >> So let's use MCV instead: > > Hmm, it's not a problem what specific number is displayed as > rows. What is crucial is the fact that rows has changed even > though it shouldn't have changed. As I demonstrated above. Again, that has nothing to do with any assumptions, and it certainly does not demonstrate that OR clauses should not be handled by multivariate statistics. In this case, you're observing two effects. (1) Natural inaccuracy of histograms when used for discrete data, especially in combination with equality conditions(because that's impossible to estimate accurately with histograms). (2) The original estimate (without multivariate statistics) is only seemingly accurate, because it falsely assumesindependence. It simply assumes that each condition matches 1/10000 of the table, and multiplies that, getting~0.00001 row estimate. This is rounded up to 1, which is accidentally the exact value. Let me demonstrate this on two examples - one with discrete data, one with continuous distribution. 1) discrete data CREATE TABLE t (a INT, b INT, c INT); INSERT INTO t SELECT i/1000, 2*(i/1000), 3*(i/1000) FROMgenerate_series(1, 1000000) s(i); ANALYZE t; -- no multivariate stats (so assumption of independence) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3; Seq Scan on t (cost=0.00..22906.00 rows=1 width=12) (actual time=0.290..59.120 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3; Seq Scan on t (cost=0.00..22906.00 rows=966 width=12) (actual time=0.434..117.643 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6; Seq Scan on t (cost=0.00..22906.00 rows=966 width=12) (actual time=0.433..96.956 rows=2000 loops=1) -- now let's add a histogram ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c); ANALYZE t; EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3; Seq Scan on t (cost=0.00..22906.00 rows=817 width=12) (actual time=0.268..116.318 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3; Seq Scan on t (cost=0.00..22906.00 rows=30333 width=12) (actual time=0.435..93.232 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6; Seq Scan on t (cost=0.00..22906.00 rows=30333 width=12) (actual time=0.434..122.930 rows=2000 loops=1) -- now let's use a MCV list ALTER TABLE t DROP STATISTICS ALL; ALTER TABLE t ADD STATISTICS (mcv) on (a,b,c); ANALYZE t; EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3; Seq Scan on t (cost=0.00..22906.00 rows=767 width=12) (actual time=0.268..70.604 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3; Seq Scan on t (cost=0.00..22906.00 rows=767 width=12) (actual time=0.268..70.604 rows=1000 loops=1) EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6; Seq Scan on t (cost=0.00..22906.00 rows=1767 width=12) (actual time=0.428..100.607 rows=2000 loops=1) The default estimate of AND query is rather bad. For OR clause, it's not that bad (the OR selectivity is not that bad when it comes to dependency, but it's not difficult to construct counter examples). The histogram is not that good - for the OR queries it often results in over-estimates (for equality conditions on discrete data). But the MCV estimates are very accurate. The slight under-estimate is probably caused by the block sampling we're using to get sample rows. 2) continuous data (I'll only show histograms) CREATE TABLE t (a FLOAT, b FLOAT, c FLOAT); INSERT INTO t SELECT r, r + r*(random() - 0.5)/2, r + r*(random() - 0.5)/2 FROM (SELECT random() as r FROM generate_series(1,1000000)) foo; ANALYZE t; -- no multivariate stats EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3; Seq Scan on t (cost=0.00..23870.00 rows=28768 width=24) (actual time=0.026..323.383 rows=273897 loops=1) EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3; Seq Scan on t (cost=0.00..23870.00 rows=372362 width=24) (actual time=0.026..375.005 rows=317533 loops=1) EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9; Seq Scan on t (cost=0.00..23870.00 rows=192979 width=24) (actual time=0.026..431.376 rows=393528 loops=1) -- histograms ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c); ANALYZE t; EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3; Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24) (actual time=0.021..330.487 rows=273897 loops=1) EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3; Seq Scan on t (cost=0.00..23870.00 rows=14317 width=24) (actual time=0.027..906.321 rows=966870 loops=1) EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9; Seq Scan on t (cost=0.00..23870.00 rows=20367 width=24) (actual time=0.028..452.494 rows=393528 loops=1) This seems wrong, because the estimate for the OR queries should not be lower than the estimate for the first query (with just AND), and it should not increase when increasing the boundary. I'd bet this is a bug in how the inequalities are handled with histograms, or how the AND/OR clauses are combined. I'll look into that. But once again, there's nothing that would make OR clauses somehow incompatible with multivariate stats. kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: