Home > mailing lists

Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: [HACKERS] multivariate statistics (v19)
Date	January 30, 2017 22:12:35
Msg-id	1fa982dc-ffb4-eef7-eb55-d8cf7d84f1c4@2ndquadrant.com Whole thread Raw
In response to	Re: [HACKERS] multivariate statistics (v19) (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses	Re: [HACKERS] multivariate statistics (v19) Re: [HACKERS] multivariate statistics (v19) Re: [HACKERS] multivariate statistics (v19)
List	pgsql-hackers

Tree view

Hi everyone,

thanks for the reviews! Attached is v23 of the patch series, addressing 
most of the points raised in the reviews.

A quick summary of the changes (I'll respond to the other threads for 
points that deserve a bit more detailed discussion):

0) Rebase to current master. The main culprit was the pesky logical 
replication patch committed a week ago, because SUBSCRIPTION and 
STATISTICS are right next to each other in gram.y, various switches etc.

1) Many typos, mentioned by all the reviewers.

2) I've added a short explanation (in alter_table.sgml) of how ALTER 
TABLE ... DROP COLUMN handles multivariate statistics, i.e. that those 
are only dropped if there would be a single remaining column.

3) I've reworded 'thoroughly' to 'in more detail' in planstats.sgml, to 
make Petr happy ;-)

4) Added missing comments to get_statistics_oid, RelationGetMVStatList, 
update_mv_stats, ndistinct_for_combination. Also update_mv_stats() was 
not used outside common.c, so I've made it static and removed the 
prototype from mvstats.h.

5) I've changed 'statistics does not exist' to 'statistics do not exist' 
on a number of places.

6) Removed XXX about checking for duplicates in CreateStatistics. I 
agree with Petr that we shouldn't do such checks, as we're not doing 
that for other objects (e.g. indexes).

7) I've moved moved the code loading statistics from get_relation_info 
into a new function get_relation_statistics, to get rid of the

   if (true)
   {
    ...
   }

block, which was there due to mimicking how index details are loaded 
without having hasindex-like flag. I like this better than merging the 
block into get_relation_info directly.

8) I've changed 'a statistics' to 'multivariate statistics' on a few 
places in sgml docs, to make it clear it's not referring to the 
'regular' statistics (e.g. at CREATE/DROP STATISTICS, mentioned by 
Ideriha Takeshi).

9) I've changed the link in README.dependencies to 
https://en.wikipedia.org/wiki/Functional_dependency as proposed by 
Ideriha Takeshi. I'm pretty sure the wiki page about database 
normalization, referenced by the original link, included a nice 
functional dependency example some time ago, but it seems to have 
changed and the new link is better.

But perhaps it's not a good idea to link to wikipedia, as the pages 
clearly change quite significantly?

10) The CREATE STATISTICS now reports a nice 'already exists' message, 
instead of the 'duplicate key', pointed out by Dilip.

11) MVNDistinctItem/MVNDistinctData now use FLEXIBLE_ARRAY_MEMBER for 
the array, just like the other structs.

On 01/26/2017 12:01 PM, Kyotaro HORIGUCHI wrote:
> dependencies.c:
>
>  dependency_dgree():
>
>   - The k is assumed larger than 1. I think assertion is required.
>
>   - "/* end of the preceding group */" seems to be better if it
>     is just after the "if (multi_sort.." currently just after it.
>
>   - The following comment seems mis-edited.
>     > * If there is a single are no contradicting rows, count the group
>     > * as supporting, otherwise contradicting.
>
>     maybe this would be like the following? The varialbe counting
>     the first "contradiction" is named "n_violations". This seems
>     somewhat confusing.
>
>     > * If there are no violating rows up to here, count the group
>     > * as supporting, otherwise contradicting.
>
>    - "/* first columns match, but the last one does not"
>      else if (multi_sort_compare_dims((k - 1), (k - 1), ...
>
>      The above comparison should use multi_sort_compare_dim, not
>      dims
>
>    - This function counts "n_contradicting_rows" but it is not
>      referenced. Anyway n_contradicting_rows = numrows -
>      n_supporing_rows so it and n_contradicting seem
>      unncecessary.
>

Yes, absolutely. This was clearly unnecessary remainder of the original 
implementation, and I failed to clean it up after adopting Dean's idea 
of continuous dependency degree.

I've also reworked the method a bit, moving handling of the last group 
into the main loop (instead of doing that separately right after the 
loop, which I think was a bit ugly anyway). Can you check if you're 
happy with the code & comments now?

>
>  mvstats.h:
>
>    - struct MVDependencyData/ MVDependenciesData
>
>      The varialbe length member at the last of the structs should
>      be defined using FLEXIBLE_ARRAY_MEMBER, from the convention.
>

Yes, fixed. The other structures already used that macro, but I failed 
to notice MVDependencyData/ MVDependenciesData need that fix too.

 >
>    - I'm not sure how much it impacts performance, but some
>      struct members seems to have a bit too wide types. For
>      example, MVDepedenciesData.type is of int32 but it can have
>      only '1' for now and it won't be two-digits. Also ndeps
>      cannot be so large.
>

I doubt the impact on performance is measurable, particularly for the 
global fields (e.g. nbuckets is tiny compared to the space needed for 
the buckets themselves).

But I think you're right we shouldn't use fields wider than actually 
needed (e.g. using uint32 for nbuckets is a bit insane, and uint16 would 
be just fine). It's not just a matter of performance, but also a way to 
document expected values etc.

I'll go through the fields and use smaller data types where appropriate.

>
> general:
>   This patch uses int16 as the type of attrubute number but it
>   might be better to use AttrNumber for the purpose.
>   (Specifically it seems defined as the type for an attribute
>    index but also used as the varialbe for number of attributes)
>

Agreed. Will check with the struct members.

>
> Sorry for the random comment in advance. I'll learn this further.
>

Thanks for the review!

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

From: Nikhil Sontakke
Date: 30 January 2017, 21:45:33
Subject: Re: [HACKERS] Speedup twophase transactions

From: Alvaro Herrera
Date: 30 January 2017, 22:16:40
Subject: Re: [HACKERS] VACUUM's ancillary tasks

Re: [HACKERS] multivariate statistics (v19) - Mailing list pgsql-hackers

Attachment

Previous

Next