Home > mailing lists
Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: multivariate statistics / patch v7
Date	July 7, 2015 19:43:30
Msg-id	559C2BD7.6040508@2ndquadrant.com Whole thread Raw
In response to	Re: multivariate statistics / patch v7 (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses	Re: multivariate statistics / patch v7
List	pgsql-hackers
Tree view
Hi,

On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote:
> Hi, Tomas. I'll kick the gas pedal.
>
>>> Thank you, it looks clearer. I have some comment for the brief look
>>> at this. This patchset is relatively large so I will comment on
>>> "per-notice" basis.. which means I'll send comment before examining
>>> the entire of this patchset. Sorry in advance for the desultory
>>> comments.
>>
>> Sure. If you run into something that's not clear enough, I'm happy to
>> explain that (I tried to cover all the important details in the
>> comments, but it's a large patch, indeed.)
>
>
>>> - Single-variate stats have a mechanism to inject arbitrary
>>>     values as statistics, that is, get_relation_stats_hook and the
>>>     similar stuffs. I want the similar mechanism for multivariate
>>>     statistics, too.
>>
>> Fair point, although I'm not sure where should we place the hook,
>> how exactly should it be defined and how useful that would be in
>> the end. Can you give an example of how you'd use such hook?

...

>
> We should carefully design the API to be able to point the pertinent
> stats for every situation. Mv stats is based on the correlation of
> multiple columns so I think only relid and attributes list are
> enough as the parameter.
>
> | if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
> |    /* This is the stats to be wanted  */
>
> If we can filter the appropriate stats from all the stats using
> clauselist, we definitely can make the appropriate parameter (column
> set) prior to retrieving mv statistics. Isn't it correct?

Let me briefly explain how the current clauselist_selectivity 
implementation works.
  (1) check if there are multivariate statistics on the table - if not,      skip the multivariate parts altogether
(thepoint of this is to      minimize impact on users who don't use the new feature)
 
  (2) see if the are clauses compatible with multivariate stats - this      only checks "general compatibility" without
actuallychecking the      existing stats (the point is to terminate early, if the clauses      are not compatible
somehow- e.g. if the clauses reference only a      single attribute, use unsupported operators etc.)
 
  (3) if there are multivariate stats and compatible clauses, the      function choose_mv_stats tries to find the best
combinationof      multivariate stats with respect to the clauses (details later)
 
  (4) the clauses are estimated using the stats, the remaining clauses      are estimated using the current statistics
(singleattribute)
 

The only way to reliably inject new stats is by calling a hook before 
(1), allowing it to arbitrarily modify the list of stats. Based on the 
use cases you provided, I don't think it makes much sense to add 
additional hooks in the other phases.

At this place it's however now known what clauses are compatible with 
multivariate stats, or what attributes they are referencing. It might be 
possible to simply call pull_varattnos() and pass it to the hook, except 
that does not work with RestrictInfo :-/

Or maybe we could / should not put the hook into clauselist_selectivity 
but somewhere else? Say, to get_relation_info where we actually read the 
list of stats for the relation?

>>
>>
>>> 0001:
>>>
>>> - I also don't think it is right thing for expression_tree_walker
>>>     to recognize RestrictInfo since it is not a part of expression.
>>
>> Yes. In my working git repo, I've reworked this to use the second
>> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:
>>
>> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f
>>
>> Do you think this is the correct solution? If not, how to fix it?
>
> The reason why I think it is not appropreate is that RestrictInfo
> is not a part of expression.
>
> Increasing selectivity of a condition by column correlation is
> occurs only for a set of conjunctive clauses. OR operation
> devides the sets. Is it agreeable? RestrictInfos can be nested
> each other and we should be aware of the AND/OR operators. This
> is what expression_tree_walker doesn't.

I still don't understand why you think we need to differentiate between 
AND and OR operators. There's nothing wrong with estimating OR clauses 
using multivariate statistics.

>
> Perhaps we should provide the dedicate function such like
> find_conjunctive_attr_set which does this,

Perhaps. The reason why I added support for RestrictInfo into the 
existing walker implementations is that it seemed like the easiest way 
to fix the issue. But if there are reasons why that's incorrect, then 
inventing a new function is probably the right way.

>
> - Check the type top expression of the clause
>
>    - If it is a RestrictInfo, check clause_relids then check
>      clause.>
>    - If it is a bool OR, stop to search and return empty set of
>      attributes.>
>    - If it is a bool AND, make further check of the components. A
>      list of RestrictInfo should be treaed as AND connection.>
>    - If it is operator exression, collect used relids and attrs
>      walking the expression tree.>
> I should missing something but I think the outline is correct.

As I said before, there's nothing wrong with estimating OR clauses using 
multivariate statistics. So OR and AND should be handled exactly the same.

I think you're missing the fact that it's not enough to look at the 
relids from the RestrictInfo - we need to actually check what clauses 
are used inside, i.e. we need to check the clauses.

That's because only some of the clauses are compatible with multivariate 
stats, and only if all the clauses of the BoolExpr are "compatible" then 
we can estimate the clause as a whole. If it's a mix of supported and 
unsupported clauses, we can simply pass it to clauselist_selectivity 
which will repeat the whole process with.

> Addition to that we should carefully avoid duplicate correction
> using the same mv statistics.

Sure. That's what choose_mv_statistics does.

>
> I haven't understood what choose_mv_satistics precisely but I
> suppose what this function does would be split into the 'making
> parameter to find stats' part and 'matching the parameter with
> stats in order to retrieve desired stats' part. Could you
> reconstruct this process into the form like this?

The goal of choose_mv_statistics does is very simple - given a list of 
clauses, it tries to find the best combination of statistics, exploiting 
as much information as possible.

So let's say you have clauses
   WHERE a=1 AND b=1 AND c=1 AND d=1

but you only have statistics on [a,b], [b,c] and [b,c,d].

The simplest approach would be to use the 'largest' statistics, covering 
the most columns from the clauses - in this case [b,c,d]. This is what 
the initial patches do.

The last patch improves this significantly, by combining the statistics 
using conditional probability. In this case it'd probably use all three 
statistics, effectively decomposing the selectivity like this:
  P(a=1,b=1,c=1,d=1) = P(a=1,b=1) * P(c=1|b=1) * P(d=1|b=1,c=1)                         [a,b]         [b,c]
[b,c,d]

And each of those probabilities can be estimated using one of the stats.


> I feel it is too invasive, or exccesively intermix(?)ed.

I don't think it really fits your model - the hook has to be called much 
sooner, effectively at the very beginning of the clauselist_selectivity 
or even before that. Otherwise it might not get called at all (e.g. if 
there are no multivariate stats on the table, this whole part will be 
skipped).

>> Why should it stop at disjunctions? There's nothing wrong with using
>> multivariate stats to estimate OR-clauses, IMHO.
>
> Mv statistics represents how often *every combination of the
> column values* occurs. Is it correct? Where the combination can
> be replaced with coexists, that is AND. For example MV-MCV.
>
> (a, b, c) freq
> (1, 2, 3)  100
> (1, 2, 5)   50
> (1, 3, 8)   20
> (1, 7, 2)    5
> ===============
> total      175
>
> | select * from t where a = 1 and b = 2 and c = 3;
> | SELECT 100
>
> This is correct,
>
> | select * from t where a = 1 and b = 2 or c = 3;
> | SELECT 100
>
> This is *not* correct. The correct number of tuples is 150.
> This is a simple example where OR breaks MV stats assumption.

No, it does not.

I'm not sure where are the numbers coming from, though. So let's see how 
this actually works with multivariate statistics. I'll create a table 
with the 4 combinations you used in your example, but with 1000x more 
rows, to make the estimates a bit more accurate:
   CREATE TABLE  t (a INT, b INT, c INT);
   INSERT INTO t SELECT 1, 2, 3 FROM generate_series(1,100000);   INSERT INTO t SELECT 1, 2, 5 FROM
generate_series(1,50000);  INSERT INTO t SELECT 1, 3, 8 FROM generate_series(1,20000);   INSERT INTO t SELECT 1, 7, 2
FROMgenerate_series(1,5000);
 
   ALTER TABLE t ADD STATISTICS (mcv) ON (a,b,c);
   ANALYZE t;

And now let's see the two queries:

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;                        QUERY PLAN
---------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=100403 width=12)
Filter:((a = 1) AND (b = 2) AND (c = 3))
 
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;                        QUERY PLAN
---------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=150103 width=12)
Filter:(((a = 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

So the first query estimates 100k rows, the second one 150k rows. 
Exactly as expected, because MCV lists are discrete, match perfectly the 
data and behave exactly like your mental model.

If you try this with histograms though, you'll get the same estimate in 
both cases:
    ALTER TABLE t DROP STATISTICS ALL;    ALTER TABLE t ADD STATISTICS (histogram) ON (a,b,c);    ANALYZE t;

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;                       QUERY PLAN
--------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
Filter:((a = 1) AND (b = 2) AND (c = 3))
 
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;                       QUERY PLAN
--------------------------------------------------------- Seq Scan on t  (cost=0.00..4008.50 rows=52707 width=12)
Filter:(((a = 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

That's unfortunate, but it has nothing to do with some assumptions of 
multivariate statistics. The "problem" is that histograms are naturally 
fuzzy, and both conditions hit the same bucket.

The solution is simple - don't use histograms for such discrete data.


>>> ====
>>>    =# CREATE TABLE t1 (a int, b int, c int);
>>>    =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
>>>    9999) a);
>>>    =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>>     Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>>>    =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
>>>    =# ANALZYE t1;
>>>    =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>>     Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
>>> ====
>>>    Rows changed unwantedly.
>>
>> That has nothing to do with OR clauses, but rather with using a
>> type of statistics that does not fit the data and queries.
>> Histograms are quite inaccurate for discrete data and equality
>> conditions - in this case the clauses probably match one bucket,
>> and so we use 1/2 the bucket as an estimate. There's nothing wrong
>> with that.
>>
>> So let's use MCV instead:
>
> Hmm, it's not a problem what specific number is displayed as
> rows. What is crucial is the fact that rows has changed even
> though it shouldn't have changed. As I demonstrated above.

Again, that has nothing to do with any assumptions, and it certainly 
does not demonstrate that OR clauses should not be handled by 
multivariate statistics.

In this case, you're observing two effects.
  (1) Natural inaccuracy of histograms when used for discrete data,      especially in combination with equality
conditions(because      that's impossible to estimate accurately with histograms).
 
  (2) The original estimate (without multivariate statistics) is only      seemingly accurate, because it falsely
assumesindependence.      It simply assumes that each condition matches 1/10000 of the      table, and multiplies that,
getting~0.00001 row estimate. This      is rounded up to 1, which is accidentally the exact value.
 

Let me demonstrate this on two examples - one with discrete data, one 
with continuous distribution.

1) discrete data
    CREATE TABLE t (a INT, b INT, c INT);    INSERT INTO t  SELECT i/1000, 2*(i/1000), 3*(i/1000)
FROMgenerate_series(1, 1000000) s(i);    ANALYZE t;
 
    -- no multivariate stats (so assumption of independence)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)                   (actual time=0.290..59.120 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)                   (actual time=0.434..117.643 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=966 width=12)                   (actual time=0.433..96.956 rows=2000
loops=1)
    -- now let's add a histogram
    ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);    ANALYZE t;
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=817 width=12)                   (actual time=0.268..116.318 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)                   (actual time=0.435..93.232 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=30333 width=12)                   (actual time=0.434..122.930 rows=2000
loops=1)
    -- now let's use a MCV list
    ALTER TABLE t DROP STATISTICS ALL;    ALTER TABLE t ADD STATISTICS (mcv) on (a,b,c);    ANALYZE t;
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)                   (actual time=0.268..70.604 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;
    Seq Scan on t  (cost=0.00..22906.00 rows=767 width=12)                   (actual time=0.268..70.604 rows=1000
loops=1)
    EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;
    Seq Scan on t  (cost=0.00..22906.00 rows=1767 width=12)                   (actual time=0.428..100.607 rows=2000
loops=1)

The default estimate of AND query is rather bad. For OR clause, it's not 
that bad (the OR selectivity is not that bad when it comes to 
dependency, but it's not difficult to construct counter examples).

The histogram is not that good - for the OR queries it often results in 
over-estimates (for equality conditions on discrete data).

But the MCV estimates are very accurate. The slight under-estimate is 
probably caused by the block sampling we're using to get sample rows.


2) continuous data (I'll only show histograms)

CREATE TABLE t (a FLOAT, b FLOAT, c FLOAT);
INSERT INTO t SELECT r,                     r + r*(random() - 0.5)/2,                     r + r*(random() - 0.5)/2
        FROM (SELECT random() as r                       FROM generate_series(1,1000000)) foo;
 
ANALYZE t;

-- no multivariate stats
EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=28768 width=24)               (actual time=0.026..323.383 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=372362 width=24)               (actual time=0.026..375.005 rows=317533
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9; Seq Scan on t  (cost=0.00..23870.00 rows=192979
width=24)               (actual time=0.026..431.376 rows=393528 loops=1)
 

-- histograms
ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)               (actual time=0.021..330.487 rows=273897
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=14317 width=24)               (actual time=0.027..906.321 rows=966870
loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=20367 width=24)               (actual time=0.028..452.494 rows=393528
loops=1)

This seems wrong, because the estimate for the OR queries should not be 
lower than the estimate for the first query (with just AND), and it 
should not increase when increasing the boundary. I'd bet this is a bug 
in how the inequalities are handled with histograms, or how the AND/OR 
clauses are combined. I'll look into that.

But once again, there's nothing that would make OR clauses somehow 
incompatible with multivariate stats.


kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date:
From: Bruce Momjian
Date: 07 July 2015, 19:21:59
Subject: Re: Missing latex-longtable value
From: Fabrízio de Royes Mello
Date: 07 July 2015, 19:57:08
Subject: Re: [HACKERS] GSoC 2015 proposal: Improve the performance of “ALTER TABLE .. SET LOGGED / UNLOGGED” statement
Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers

Previous

Next