Home > mailing lists

Re: multivariate statistics v14 - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: multivariate statistics v14
Date	March 22, 2016 12:37:33
Msg-id	10036d4a-f52d-6366-1251-367ba2cf51e9@2ndquadrant.com Whole thread Raw
In response to	Re: multivariate statistics v14 (Tatsuo Ishii <ishii@postgresql.org>)
Responses	Re: multivariate statistics v14
List	pgsql-hackers

Tree view

Hi,

On 03/22/2016 11:41 AM, Tatsuo Ishii wrote:
>>> Hum. So without 0006 or beyond, there's not much benefit for the
>>> PostgreSQL users, and you are not too confident about 0006 or
>>> beyond. Then I would think it is a little bit hard to justify in
>>> putting 000[2-5] into 9.6. I really like this feature and would
>>> like to see in PostgreSQL someday, but I'm not sure if we should
>>> put the patches (0002-0005) into PostgreSQL now. Please let me
>>> know if there's some reaons we should put the patches into
>>> PostgreSQL now.
>>
>> I don't think so. While being able to combine multiple statistics
>> is certainly useful, I'm convinced that the initial patched add
>> enough
>
> Can you please elaborate a little bit more how combining multiple
> statistics is useful?

Sure.

The goal of multivariate statistics is to approximate a probability 
distribution on a group of columns. The larger the number of columns, 
the less accurate the statistics will be (with respect to individual 
columns), assuming fixed size of the sample in ANALYZE, and fixed 
statistics size.

For example, if you add a column to multivariate histogram, you'll do 
some "bucket splits" by this dimension, thus reducing the accuracy for 
the other columns. You may of course allow larger statistics (e.g. 
histograms with more buckets), but that also requires larger samples, 
and so on.

Now, let's  assume you have a query like this:
    WHERE (a=1) AND (b=2) AND (c=3) AND (d=4)

and that "a" and "b" are correlated, and "c" and "d" are correlated, but 
that otherwise the columns are independent. It'd be a bit silly to 
require building statistics on (a,b,c,d), when two statistics on each of 
the column pairs would be cheaper and also more accurate.

That's of course a trivial case - independent groups of correlated 
columns. But I'd say this is actually a pretty common case, and I do 
believe there's not much controversy that we should support it.

Another reason to allow multiple statistics is that columns in one group 
may be a good fit for MCV list (which works well for discrete values), 
while the other group may be a good candidate for histogram (which works 
well for continuous values). This can't be solved by first building a 
MCV and then a histogram on the group.

The question of course is what to do if the groups are not independent. 
The patch does that by assuming the statistics overlap, and uses 
conditions on the columns included in both statistics to combine them 
using conditional probabilities. I do believe this works quite well, but 
this is perhaps the part that needs further discussion. There are other 
ways to combine the statistics, but I do expect them to be considerably 
more expensive.

Is this a sufficient explanation?

Of course, there's a fair amount of additional complexity that I have 
not mentioned here (e.g. selecting the right combination of stats).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Fabrízio de Royes Mello
Date: 22 March 2016, 12:35:29
Subject: Re: NOT EXIST for PREPARE

From: Tatsuo Ishii
Date: 22 March 2016, 12:46:22
Subject: Re: multivariate statistics v14

Re: multivariate statistics v14 - Mailing list pgsql-hackers

Previous

Next