Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers
From | Kyotaro HORIGUCHI |
---|---|
Subject | Re: WIP: multivariate statistics / proof of concept |
Date | |
Msg-id | 20150320.173307.249113775.horiguchi.kyotaro@lab.ntt.co.jp Whole thread Raw |
In response to | Re: WIP: multivariate statistics / proof of concept (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: WIP: multivariate statistics / proof of concept
|
List | pgsql-hackers |
Hello, Patch 0001 needs changes for OIDs since my patch was committed. The attached is compatible with current master. And I tried this like this, and got the following error on analyze. But unfortunately I don't have enough time to investigate it now. postgres=# create table t1 (a int, b int, c int); insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 99999) a); postgres=# analyze t1; ERROR: invalid memory alloc request size 1485176862 regards, At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <54C3FED3.1060600@2ndquadrant.com> > Hi, > > attached is an updated version of the multivariate stats patch. This is > going to be a bit longer mail, so I'll put here a small ToC ;-) > > 1) patch split into 4 parts > 2) where to start / documentation > 3) state of the code > 4) main changes/improvements > 5) remaining limitations > > The motivation and design ideas, explained in the first message of this > thread are still valid. It might be a good idea to read it first: > > http://www.postgresql.org/message-id/flat/543AFA15.4080608@fuzzy.cz > > BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro > into the patch in person, or discuss the patch in general. > > > 1) Patch split into 4 parts > --------------------------- > Firstly, the patch got broken into the following four pieces, to make > the reviews somewhat easier: > > 1) 0001-shared-infrastructure-and-functional-dependencies.patch > > - infrastructure, shared by all the kinds of stats added > in the following patches (catalog, ALTER TABLE, ANALYZE ...) > > - implementation of a simple statistics, tracking functional > dependencies between columns (previously called "associative > rules", but that's incorrect for several reasons) > > - this does not modify the optimizer in any way > 2) 0002-clause-reduction-using-functional-dependencies.patch > > - applies the functional dependencies to optimizer (i.e. considers > the rules in clauselist_selectivity()) > > 3) 0003-multivariate-MCV-lists.patch > > - multivariate MCV lists (both ANALYZE and optimizer parts) > > 4) 0004-multivariate-histograms.patch > > - multivariate histograms (both ANALYZE and optimizer parts) > > > You may look at the patches at github here: > > https://github.com/tvondra/postgres/tree/multivariate-stats-squashed > > The branch is not stable, i.e. I'll rebase / squash / force-push changes > in the future. (There's also multivariate-stats development branch with > unsquashed changes, but you don't want to look at that, trust me.) > > The patches are not exactly small (being in the 50-100 kB range), but > that's mostly because of the amount of comments explaining the goals and > implementation details. > > > 2) Where to start / documentation > --------------------------------- > I strived to document all the pieces properly, mostly in the form of > comments. There's no sgml documentation at this point, which should > obviously change in the future. > > Anyway, I'd suggest reading the first e-mail in this thread, explaining > the ideas, and then these comments: > > 1) functional dependencies (patch 0001) > - src/backend/utils/mvstats/dependencies.c > > 2) MCV lists (patch 0003) > - src/backend/utils/mvstats/mcv.c > > 3) histograms (patch 0004) > - src/backend/utils/mvstats/mcv.c > > - also see clauselist_mv_selectivity_mcvlist() in clausesel.c > - also see clauselist_mv_selectivity_histogram() in clausesel.c > > 4) selectivity estimation (patches 0002-0004) > - all in src/backend/optimizer/path/clausesel.c > - clauselist_selectivity() - overview of how the stats are applied > - clauselist_apply_dependencies() - functional dependencies reduction > - clauselist_mv_selectivity_mcvlist() - MCV list estimation > - clauselist_mv_selectivity_histogram() - histogram estimation > > > 3) State of the code > -------------------- > I've spent a fair amount of time testing the patches, and while I > believe there are no segfaults or so, I know parts of the code need a > bit more love. > > The part most in need of improvements / comments is probably the code in > clausesel.c - that seems a bit quirky. Reviews / comments regarding this > part of the code are very welcome - I'm sure there are many ways to > improve this part. > > There are a few FIXMEs elsewhere (e.g. about memory allocation in the > (de)serialization code), but those are mostly well-defined issues that I > know how to address (at least I believe so). > > > 4) Main changes/improvements > ---------------------------- > There are many significant improvements. The previous patch version was > in the 'proof of concept' category (missing pieces, knowingly broken in > some areas), the current patch should 'mostly work'. > > The patch fixes two most annoying limitations of the first version: > > (a) support for all data types (not just those passed by value) > (b) handles NULL values properly > (c) adds support for IS [NOT] NULL clauses > > Aside from that the code was significantly improved, there are proper > regression tests and plenty of comments explaining the details. > > > 5) Remaining limitations > ------------------------ > > (a) limited to stats on 8 columns > > This is mostly just a 'safeguard' restriction. > > (b) only data types with '<' operator > > I don't think this will change anytime soon, because all the > algorithms for building the stats rely on this. I don't see > this as a serious limitation though. > > (c) not handling DROP COLUMN or DROP TABLE and so on > > Currently this is not handled at all (so the regression tests > do an explicit DELETE from the pg_mv_statistic catalog). > > Handling the DROP TABLE won't be difficult, it's similar to the > current stats. Handling ALTER TABLE ... DROP COLUMN will be much > more tricky I guess - should we drop all the stats referencing > that column, or should we just remove it from the stats? Or > should we keep it and treat it as NULL? Not sure what's the best > solution. > > (d) limited list of compatible WHERE clauses > > The initial patch handled only simple operator clauses > > (Var op Constant) > > where operator is one of ('<', '<=', '=', '>=', '>'). Now it also > handles IS [NOT] NULL clauses. Adding more clause types should > not be overly difficult - starting with more traditional > 'BooleanTest' conditions, or even multi-column conditions > (Var op Var) > > which are difficult to estimate using simple-column stats. > > (e) optimizer uses single stats per table > > This is still true and I don't think this will change soon. i do > have some ideas on how to merge multiple stats etc. but it's > certainly complex stuff, unlikely to happen within this CF. The > patch makes a lot of sense even without this particular feature, > because you can create multiple stats, each suitable for different > queries. > > (f) no JOIN conditions > > Similarly to the previous point, it's on the TODO but it's not > going to happen in this CF. > > > kind regards > > -- > Tomas Vondra http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: