Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] multivariate statistics (v19) |
Date | |
Msg-id | 1fa982dc-ffb4-eef7-eb55-d8cf7d84f1c4@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] multivariate statistics (v19) (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [HACKERS] multivariate statistics (v19)
Re: [HACKERS] multivariate statistics (v19) Re: [HACKERS] multivariate statistics (v19) |
List | pgsql-hackers |
Hi everyone, thanks for the reviews! Attached is v23 of the patch series, addressing most of the points raised in the reviews. A quick summary of the changes (I'll respond to the other threads for points that deserve a bit more detailed discussion): 0) Rebase to current master. The main culprit was the pesky logical replication patch committed a week ago, because SUBSCRIPTION and STATISTICS are right next to each other in gram.y, various switches etc. 1) Many typos, mentioned by all the reviewers. 2) I've added a short explanation (in alter_table.sgml) of how ALTER TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those are only dropped if there would be a single remaining column. 3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to make Petr happy ;-) 4) Added missing comments to get_statistics_oid, RelationGetMVStatList, update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was not used outside common.c, so I've made it static and removed the prototype from mvstats.h. 5) I've changed 'statistics does not exist' to 'statistics do not exist' on a number of places. 6) Removed XXX about checking for duplicates in CreateStatistics. I agree with Petr that we shouldn't do such checks, as we're not doing that for other objects (e.g. indexes). 7) I've moved moved the code loading statistics from get_relation_info into a new function get_relation_statistics, to get rid of the if (true) { ... } block, which was there due to mimicking how index details are loaded without having hasindex-like flag. I like this better than merging the block into get_relation_info directly. 8) I've changed 'a statistics' to 'multivariate statistics' on a few places in sgml docs, to make it clear it's not referring to the 'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by Ideriha Takeshi). 9) I've changed the link in README.dependencies to https://en.wikipedia.org/wiki/Functional_dependency as proposed by Ideriha Takeshi. I'm pretty sure the wiki page about database normalization, referenced by the original link, included a nice functional dependency example some time ago, but it seems to have changed and the new link is better. But perhaps it's not a good idea to link to wikipedia, as the pages clearly change quite significantly? 10) The CREATE STATISTICS now reports a nice 'already exists' message, instead of the 'duplicate key', pointed out by Dilip. 11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for the array, just like the other structs. On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote: > dependencies.c: > > dependency_dgree(): > > - The k is assumed larger than 1. I think assertion is required. > > - "/* end of the preceding group */" seems to be better if it > is just after the "if (multi_sort.." currently just after it. > > - The following comment seems mis-edited. > > * If there is a single are no contradicting rows, count the group > > * as supporting, otherwise contradicting. > > maybe this would be like the following? The varialbe counting > the first "contradiction" is named "n_violations". This seems > somewhat confusing. > > > * If there are no violating rows up to here, count the group > > * as supporting, otherwise contradicting. > > - "/* first columns match, but the last one does not" > else if (multi_sort_compare_dims((k - 1), (k - 1), ... > > The above comparison should use multi_sort_compare_dim, not > dims > > - This function counts "n_contradicting_rows" but it is not > referenced. Anyway n_contradicting_rows = numrows - > n_supporing_rows so it and n_contradicting seem > unncecessary. > Yes, absolutely. This was clearly unnecessary remainder of the original implementation, and I failed to clean it up after adopting Dean's idea of continuous dependency degree. I've also reworked the method a bit, moving handling of the last group into the main loop (instead of doing that separately right after the loop, which I think was a bit ugly anyway). Can you check if you're happy with the code & comments now? > > mvstats.h: > > - struct MVDependencyData/ MVDependenciesData > > The varialbe length member at the last of the structs should > be defined using FLEXIBLE_ARRAY_MEMBER, from the convention. > Yes, fixed. The other structures already used that macro, but I failed to notice MVDependencyData/ MVDependenciesData need that fix too. > > - I'm not sure how much it impacts performance, but some > struct members seems to have a bit too wide types. For > example, MVDepedenciesData.type is of int32 but it can have > only '1' for now and it won't be two-digits. Also ndeps > cannot be so large. > I doubt the impact on performance is measurable, particularly for the global fields (e.g. nbuckets is tiny compared to the space needed for the buckets themselves). But I think you're right we shouldn't use fields wider than actually needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would be just fine). It's not just a matter of performance, but also a way to document expected values etc. I'll go through the fields and use smaller data types where appropriate. > > general: > This patch uses int16 as the type of attrubute number but it > might be better to use AttrNumber for the purpose. > (Specifically it seems defined as the type for an attribute > index but also used as the varialbe for number of attributes) > Agreed. Will check with the struct members. > > Sorry for the random comment in advance. I'll learn this further. > Thanks for the review! regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-teach-pull_-varno-varattno-_walker-about-Restric-v23.patch
- 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v23.patch
- 0003-PATCH-functional-dependencies-only-the-ANALYZE-p-v23.patch
- 0004-PATCH-selectivity-estimation-using-functional-de-v23.patch
- 0005-PATCH-multivariate-MCV-lists-v23.patch
- 0006-PATCH-multivariate-histograms-v23.patch
- 0007-WIP-use-ndistinct-for-selectivity-estimation-in--v23.patch
- 0008-WIP-allow-using-multiple-statistics-in-clauselis-v23.patch
- 0009-WIP-psql-tab-completion-basics-v23.patch
pgsql-hackers by date: