Re: multivariate statistics (v25) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics (v25) |
Date | |
Msg-id | 56f40b20-c464-fad2-ff39-06b668fac47c@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] multivariate statistics (v25) (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Responses |
Re: multivariate statistics (v25)
Re: multivariate statistics (v25) |
List | pgsql-hackers |
On 04/04/2017 09:55 AM, David Rowley wrote: > On 1 April 2017 at 04:25, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I've attached an updated patch. > > I've made another pass at this and ended up removing the tryextstats > variable. We now only try to use extended statistics when > clauselist_selectivity() is given a valid RelOptInfo with rtekind == > RTE_RELATION, and of course, it must also have some extended stats > defined too. > > I've also cleaned up a few more comments, many of which I managed to > omit updating when I refactored how the selectivity estimates ties > into clauselist_selectivity() > > I'm quite happy with all of this now, and would also be happy for > other people to take a look and comment. > > As a reviewer, I'd be marking this ready for committer, but I've moved > a little way from just reviewing this now, having spent two weeks > hacking at it. > > The latest patch is attached. > Thanks David, I agree the reworked patch is much cleaner that the last version I posted. Thanks for spending your time on it. Two minor comments: 1) DEPENDENCY_MIN_GROUP_SIZE I'm not sure we still need the min_group_size, when evaluating dependencies. It was meant to deal with 'noisy' data, but I think it after switching to the 'degree' it might actually be a bad idea. Consider this: create table t (a int, b int); insert into t select 1, 1 from generate_series(1, 10000) s(i); insert into t selecti, i from generate_series(2, 20000) s(i); create statistics s with (dependencies) on (a,b) from t; analyze t; select stadependencies from pg_statistic_ext ; stadependencies -------------------------------------------- [{1 => 2 : 0.333344}, {2 => 1 : 0.333344}] (1 row) So the degree of the dependency is just ~0.333 although it's obviously a perfect dependency, i.e. a knowledge of 'a' determines 'b'. The reason is that we discard 2/3 of rows, because those groups are only a single row each, except for the one large group (1/3 of rows). Without the mininum group size limitation, the dependencies are: test=# select stadependencies from pg_statistic_ext ; stadependencies -------------------------------------------- [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}] (1 row) which seems way more reasonable, I think. 2) A minor detail is that instead of this if (estimatedclauses != NULL && bms_is_member(listidx, estimatedclauses)) continue; perhaps we should do just this: if (bms_is_member(listidx, estimatedclauses)) continue; bms_is_member does the same NULL check right at the beginning, so I don't think this might make a measurable difference. kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: