Home > mailing lists

Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers

From	Kyotaro HORIGUCHI
Subject	Re: WIP: multivariate statistics / proof of concept
Date	March 20, 2015 08:33:33
Msg-id	20150320.173307.249113775.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw
In response to	Re: WIP: multivariate statistics / proof of concept (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: WIP: multivariate statistics / proof of concept
List	pgsql-hackers

Tree view

Hello,


Patch 0001 needs changes for OIDs since my patch was
committed. The attached is compatible with current master.

And I tried this like this, and got the following error on
analyze. But unfortunately I don't have enough time to
investigate it now.

postgres=# create table t1 (a int, b int, c int);
insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 99999) a);
postgres=# analyze t1;
ERROR:  invalid memory alloc request size 1485176862

regards,


At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<54C3FED3.1060600@2ndquadrant.com>
> Hi,
> 
> attached is an updated version of the multivariate stats patch. This is
> going to be a bit longer mail, so I'll put here a small ToC ;-)
> 
> 1) patch split into 4 parts
> 2) where to start / documentation
> 3) state of the code
> 4) main changes/improvements
> 5) remaining limitations
> 
> The motivation and design ideas, explained in the first message of this
> thread are still valid. It might be a good idea to read it first:
> 
>   http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz
> 
> BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
> into the patch in person, or discuss the patch in general.
> 
> 
> 1) Patch split into 4 parts
> ---------------------------
> Firstly, the patch got broken into the following four pieces, to make
> the reviews somewhat easier:
> 
> 1) 0001-shared-infrastructure-and-functional-dependencies.patch
> 
>    - infrastructure, shared by all the kinds of stats added
>      in the following patches (catalog, ALTER TABLE, ANALYZE ...)
> 
>    - implementation of a simple statistics, tracking functional
>      dependencies between columns (previously called "associative
>      rules", but that's incorrect for several reasons)
> 
>    - this does not modify the optimizer in any way
> 2) 0002-clause-reduction-using-functional-dependencies.patch
> 
>    - applies the functional dependencies to optimizer (i.e. considers
>      the rules in clauselist_selectivity())
> 
> 3) 0003-multivariate-MCV-lists.patch
> 
>    - multivariate MCV lists (both ANALYZE and optimizer parts)
> 
> 4) 0004-multivariate-histograms.patch
> 
>    - multivariate histograms (both ANALYZE and optimizer parts)
> 
> 
> You may look at the patches at github here:
> 
>   https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
> 
> The branch is not stable, i.e. I'll rebase / squash / force-push changes
> in the future. (There's also multivariate-stats development branch with
> unsquashed changes, but you don't want to look at that, trust me.)
> 
> The patches are not exactly small (being in the 50-100 kB range), but
> that's mostly because of the amount of comments explaining the goals and
> implementation details.
> 
> 
> 2) Where to start / documentation
> ---------------------------------
> I strived to document all the pieces properly, mostly in the form of
> comments. There's no sgml documentation at this point, which should
> obviously change in the future.
> 
> Anyway, I'd suggest reading the first e-mail in this thread, explaining
> the ideas, and then these comments:
> 
> 1) functional dependencies (patch 0001)
>    - src/backend/utils/mvstats/dependencies.c
> 
> 2) MCV lists (patch 0003)
>    - src/backend/utils/mvstats/mcv.c
> 
> 3) histograms (patch 0004)
>    - src/backend/utils/mvstats/mcv.c
> 
>    - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
>    - also see clauselist_mv_selectivity_histogram() in clausesel.c
> 
> 4) selectivity estimation (patches 0002-0004)
>    - all in src/backend/optimizer/path/clausesel.c
>    - clauselist_selectivity() - overview of how the stats are applied
>    - clauselist_apply_dependencies() - functional dependencies reduction
>    - clauselist_mv_selectivity_mcvlist() - MCV list estimation
>    - clauselist_mv_selectivity_histogram() - histogram estimation
> 
> 
> 3) State of the code
> --------------------
> I've spent a fair amount of time testing the patches, and while I
> believe there are no segfaults or so, I know parts of the code need a
> bit more love.
> 
> The part most in need of improvements / comments is probably the code in
> clausesel.c - that seems a bit quirky. Reviews / comments regarding this
> part of the code are very welcome - I'm sure there are many ways to
> improve this part.
> 
> There are a few FIXMEs elsewhere (e.g. about memory allocation in the
> (de)serialization code), but those are mostly well-defined issues that I
> know how to address (at least I believe so).
> 
> 
> 4) Main changes/improvements
> ----------------------------
> There are many significant improvements. The previous patch version was
> in the 'proof of concept' category (missing pieces, knowingly broken in
> some areas), the current patch should 'mostly work'.
> 
> The patch fixes two most annoying limitations of the first version:
> 
>   (a) support for all data types (not just those passed by value)
>   (b) handles NULL values properly
>   (c) adds support for IS [NOT] NULL clauses
> 
> Aside from that the code was significantly improved, there are proper
> regression tests and plenty of comments explaining the details.
> 
> 
> 5) Remaining limitations
> ------------------------
> 
>   (a) limited to stats on 8 columns
> 
>       This is mostly just a 'safeguard' restriction.
> 
>   (b) only data types with '<' operator
> 
>       I don't think this will change anytime soon, because all the
>       algorithms for building the stats rely on this. I don't see
>       this as a serious limitation though.
> 
>   (c) not handling DROP COLUMN or DROP TABLE and so on
> 
>       Currently this is not handled at all (so the regression tests
>       do an explicit DELETE from the pg_mv_statistic catalog).
> 
>       Handling the DROP TABLE won't be difficult, it's similar to the
>       current stats. Handling ALTER TABLE ... DROP COLUMN will be much
>       more tricky I guess - should we drop all the stats referencing
>       that column, or should we just remove it from the stats? Or
>       should we keep it and treat it as NULL? Not sure what's the best
>       solution.
> 
>   (d) limited list of compatible WHERE clauses
> 
>       The initial patch handled only simple operator clauses
> 
>           (Var op Constant)
> 
>       where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
>       handles IS [NOT] NULL clauses. Adding more clause types should
>       not  be overly difficult - starting with more traditional
>       'BooleanTest' conditions, or even multi-column conditions
>           (Var op Var)
> 
>       which are difficult to estimate using simple-column stats.
> 
>   (e) optimizer uses single stats per table
> 
>       This is still true and I don't think this will change soon. i do
>       have some ideas on how to merge multiple stats etc. but it's
>       certainly complex stuff, unlikely to happen within this CF. The
>       patch makes a lot of sense even without this particular feature,
>       because you can create multiple stats, each suitable for different
>       queries.
> 
>   (f) no JOIN conditions
> 
>       Similarly to the previous point, it's on the TODO but it's not
>       going to happen in this CF.
> 
> 
> kind regards
> 
> -- 
> Tomas Vondra                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: David Rowley
Date: 20 March 2015, 08:11:32
Subject: Re: Performance improvement for joins where outer side is unique

From: Andres Freund
Date: 20 March 2015, 09:32:17
Subject: Re: Using 128-bit integers for sum, avg and statistics aggregates

Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers

Previous

Next