[HACKERS] extended statistics: n-distinct - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | [HACKERS] extended statistics: n-distinct |
Date | |
Msg-id | 20170320190220.ixlaueanxegqd5gr@alvherre.pgsql Whole thread Raw |
Responses |
Re: [HACKERS] extended statistics: n-distinct
Re: [HACKERS] extended statistics: n-distinct Re: [HACKERS] extended statistics: n-distinct |
List | pgsql-hackers |
Here is a closer to final version of the multivariate statistics series, last posted at https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql If you've always wanted to review multivariate stats, but never found a good reason to, now is a terrific time to do so! (In other words: I plan to get this pushed in the not too distant future.) This is a new thread to present a version of the n-distinct patch that IMO is close enough to commit. There are some work items still. There's some discussion on the topic of cross-column statistics: https://wiki.postgresql.org/wiki/Cross_Columns_Stats This problem is important enough that Kyotaro Horiguchi submitted another patch that does the same thing: https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp This patch aims to provide the same functionality, keeping the design general enough that other kinds of statistics can be added later (such as functional dependencies, histograms and MCVs, all of which have been previously submitted as patches by Tomas). To recap, what this patch provides is a new command of the form CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table Note that we put the table name in a separate FROM clause instead of together with the column name, so that this is more readily extensible to things that are not just columns, for example expressions that might involve more than one table (per review from Dean Rasheed). Currently, only one table is supported. In this patch, the "opts" can only be "ndistinct", which creates a pg_statistic_ext row with the number of distinct groups found in all possible combination across that set of columns. This can be used when a GROUP BY or a DISTINCT clause need to estimate the number of distinct groups in an aggregation. Some things left to change: * Currently, we use the ndistinct value only if the grouping uses exactly the set of columns covered by a statistics. For example, if we have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to the old method, which may result in worse results than if we used the number we know about (a,b,c) then applied a fixup to consider the distinctness of (d). * Also, estimate_num_groups() looks a bit patchy. With slightly more invasive changes we can make it look more natural. * I'm not terribly happy with the header organization. I think VacAttrStats should be in its own (new) src/include/statistics/analyze.h for example (which cleans up a bunch of existing stuff a bit), and the new files could do with some slight makeover. * The current code uses AttrNumber * and int2vector, in places where it would be more convenient to use Bitmapsets. * We currently try to keep a stats object even if a column in it is dropped -- for example, if we have stats on (a,b,c) and drop (b), then we still have stats on (a,c). While this is nice, it creates a bunch of weird corner cases, so I'm going to rip that out and just drop the statistics instead. If the user wants stats on (a,c) to remain, they can create it after (or before) dropping the column. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: