Home > mailing lists
Re: multivariate statistics / patch v6 - Mailing list pgsql-hackers

From	Kyotaro HORIGUCHI
Subject	Re: multivariate statistics / patch v6
Date	May 15, 2015 06:29:59
Msg-id	20150515.152936.83796179.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw
In response to	Re: multivariate statistics / patch v6 (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: multivariate statistics / patch v6 Re: multivariate statistics / patch v6
List	pgsql-hackers
Tree view
Hello,

At Thu, 14 May 2015 12:35:50 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<55547A86.8020400@2ndquadrant.com>
> 
> On 05/13/15 10:31, Kyotaro HORIGUCHI wrote:
> > Hello, this might be somewhat out of place but strongly related
> > to this patch so I'll propose this here.
> >
> > This is a proposal of new feature for this patch or asking for
> > your approval for my moving on this as a different (but very
> > close) project.
> >
> > ===
> >
> >> Attached is v6 of the multivariate stats, with a number of
> >> improvements:
> > ...
> >> 2) fix of pg_proc issues (reported by Jeff)
> >>
> >> 3) rebase to current master
> >
> > Unfortunately, the v6 patch suffers some system oid conflicts
> > with recently added ones. And what more unfortunate for me is
> > that the code for functional dependencies looks undone:)
> 
> I'll fix the OID conflicts once the CF completes, which should be in a
> few days I guess. Until then you can apply it on top of master from
> about May 6 (that's when the v6 was created, and there should be no
> conflicts).

I applied it with further fixing. It wasn't a problem :)

> Regarding the functional dependencies - you're right there's room for
> improvement. For example it only works with dependencies between pairs
> of columns, not multi-column dependencies. Is this what you mean by
> incomplete?

No, It overruns dependencies->deps because build_mv_dependencies
stores many elements into dependencies->deps[n] although it
really has a room for only one element. I suppose that you paused
writing it when you noticed that the number of required elements
is unknown before finising walk through all pairs of
values. palloc'ing numattrs^2 is reasonable enough as POC code
for now. Am I looking wrong version of patch?

-    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData))
+    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData) +
+                                sizeof(MVDependency) * numattrs * numattrs);

> > I mention this because I recently had a issue from strong
> > correlation between two columns in dbt3 benchmark. Two columns in
> > some table are in strong correlation but not in functional
> > dependencies, there are too many values and the distribution of
> > them is very uniform so MCV is no use for the table (histogram
> > has nothing to do with equal conditions). As the result, planner
> > estimates the number of rows largely wrong as expected especially
> > for joins.
> 
> I think the other statistics types (esp. histograms) might be more
> useful here, but I assume you haven't tried that because of the
> conflicts.
> 
> The current patch does not handle joins at all, though.

Well, that's one of the resons. But I understood that any
deterministic estimation cannot be applied for such distribution
when I saw what made the wrong estimation. eqsel and eqsel_join
finally relies on random match assumption on uniform distribution
when the value is not found in MCV list. And functional
dependencies stuff in your old patch (which works) (rightfully)
failed to find such relationship between the problematic
columns. So I tried ndistinct, which is not contained in your
patch to see how it works well.

> > I, then, had a try calculating the ratio between the product of
> > distinctness of every column and the distinctness of the set of
> > the columns, call it multivariate coefficient here, and found
> > that it looks greately useful for the small storage space, less
> > calculation, and simple code.
> 
> So when you have two columns A and B, you compute this:
> 
>    ndistinct(A) * ndistinct(B)
>    ---------------------------
>           ndistinct(A,B)

Yes, I used the reciprocal of that, though.

> where ndistinc(...) means number of distinct values in the column(s)?

Yes.

> > The attached first is a script to generate problematic tables.
> > And the second is a patch to make use of the mv coef on current
> > master.  The patch is a very primitive POC so no syntactical
> > interfaces involved.
...
> > Make use of mv coefficient.
> >
> >> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
> >> =# analyze t;
> >> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
> >>   Seq Scan on t  (cost=0.00..22906.00 rows=9221 width=12)
> >>                  (actual time=3.740..242.330 rows=10000 loops=1)
> >
> > Row number estimation was largely improved.
> 
> With my patch:
> 
> alter table t add statistics (mcv) on (a,b,c);
...
>  Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)

Yes, your MV-MCV list should have one third of all possible (set
of) values so it works fine, I guess. But my original problem was
occurred on the condition that (the single column) MCVs contain
under 1% of possible values, MCV would not work for such cases,
but its very uniform distribution helps random assumption to
work.

> $ perl gentbl.pl 200000 | psql postgres
<takes a while..>
> posttres=# alter table t1 add statistics (mcv true) on (a, b);
> postgres=# analyze t1;
> postgres=# explain analyze select * from t1 where a = 1 and b = 2501;
> Seq Scan on t1  (cost=0.00..124319.00 rows=1 width=8)
>                 (actual time=0.051..1250.773 rows=8 loops=1)

The estimate "rows=1" is internally 2.4e-11, 3.33e+11 times
smaller than the real number. This will result in roughly the
same order of error for joins. This is because MV-MCV holds too
small part of the domain and then calculated using random
assumption. This won't be not saved by increasing
statistics_target to any sane amount.


> alter table t drop statistics all;
> alter table t add statistics (histogram) on (a,b,c);
...
>  Seq Scan on t  (cost=0.00..22906.00 rows=9667 width=12)

> So both the MCV list and histogram do quite a good work here,

I understand how you calculate selectivity for equality clauses
using histogram. And it calculates the result rows as 2.3e-11,
which is almost same as MV-MCV, and this comes the same cause
with it then yields the same result for joins.

> but there are certainly cases when that does not work and the
> mvcoefficient works better.

The condition mv-coef is effective where, as metioned above,
MV-MCV or MV-HISTO cannot hold sufficient part of the domain. The
appropriate combination of MV-MCV and mv-coef would be the same
as va_eq_(non_)const/eqjoinsel_inner for single column, which is,
applying mv-coef on the part of selectivity corresponding to
values not in MV-MCV. I have no idea to combinate it with
MV-HISTOGRAM right now.

> The current patch does not handle joins, but it's one of the TODO
> items.

Yes, but the result on the very large tables can be deduced from
the discussion above.

> > I think the result above shows that the multivariate coefficient
> > is significant to imporove estimates when correlated colums are
> > involved.
> 
> Yes, it looks interesting. I'm wondering what are the "failure cases"
> when the coefficient approach does not work. It seems to me it relies
> on an assumption of consistency for all the ndistinct values. For
> example lets assume you have two columns - A and B, each with 1000
> distinct values, and that each value in A has 100 matching values in
> B, so the coefficient is ~10
> 
>    1,000 * 1,000 / 100,000 = 10
> 
> Now, let's assume the distribution looks differently - with first 100
> values in A matching all 1000 values of B, and the remaining 900
> values just a single B value. Then
> 
>   1,000 * 1,000 / (100,000 + 900) = ~9,9
> 
> So a very different distribution, but almost the same coefficient.
> 
> Are there any other assumptions like this?

I think no for now. Just like the current var_eq_(non_)const and
eqjoinsel_inner does, since no clue for *the true* distribution
available, we have no choice other than stand on the random (on
uniform dist) assumption. And it gives not so bad estimates for
not so extreme distributions. It's of course not perfect but good
enough.

> Also, does the coefficient work only for equality conditions only?

The mvcoef is a parallel of ndistinct, (it is a bit wierd
expression though). So I guess it is appliable on the current
estimation codes where using ndistinct, almost of all of them
look to relate to equiality comparison. 

> > Would you consider this in your patch? Otherwise I move on this
> > as a different project from yours if you don't mind. Except user
> > interface won't conflict with yours, I suppose. But finally they
> > should need some labor of consolidation.
> 
> I think it's a neat idea, and I think it might be added to the
> patch. It would fit in quite nicely, actually - I already do have
> other kinds of stats for addition, but I'm not going to work on that
> in the near future. It will require changes in some parts of the patch
> (selecting the stats for a list of clauses) and I'd like to complete
> the current patch first, and then add features in follow-up patches.

I see. Let's work on this for now.

regares,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
pgsql-hackers by date:
From: Amit Kapila
Date: 15 May 2015, 06:21:08
Subject: Re: Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
From: Etsuro Fujita
Date: 15 May 2015, 07:16:09
Subject: Minor improvements to alter_foreign_table.sgml
Re: multivariate statistics / patch v6 - Mailing list pgsql-hackers

Previous

Next