Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | CAEZATCV5Xq7FGFYvigJWPZQrV9Hx37WvHfTgKdf6Ak3smbip2w@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On Wed, 26 Dec 2018 at 22:09, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Attached is an updated version of the patch - rebased and fixing the > warnings reported by Thomas Munro. > Here are a few random review comments based on what I've read so far: On the CREATE STATISTICS doc page, the syntax in the new examples added to the bottom of the page is incorrect. E.g., instead of CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2; it should read CREATE STATISTICS s2 (mcv) ON a, b FROM t2; I think perhaps there should be also be a short explanatory sentence after each example (as in the previous one) just to explain what the example is intended to demonstrate. E.g., for the new MCV example, perhaps say These statistics give the planner more detailed information about the specific values that commonly appear in the table, as well as an upper bound on the selectivities of combinations of values that do not appear in the table, allowing it to generate better estimates in both cases. I don't think there's a need for too much detail there, since it's explained more fully elsewhere, but it feels like it needs a little more just to explain the purpose of the example. There is additional documentation in perform.sgml that needs updating -- about what kinds of stats the planner keeps. Those docs are actually quite similar to the ones on planstats.sgml. It seems the former focus more one what stats the planner stores, while the latter focus on how the planner uses those stats. In func.sgml, the docs for pg_mcv_list_items need extending to include the base frequency column. Similarly for the example query in planstats.sgml. Tab-completion for the CREATE STATISTICS statement should be extended for the new kinds. Looking at mcv_update_match_bitmap(), it's called 3 times (twice recursively from within itself), and I think the pattern for calling it is a bit messy. E.g., /* by default none of the MCV items matches the clauses */ bool_matches = palloc0(sizeof(char) * mcvlist->nitems); if (or_clause(clause)) { /* OR clauses assume nothing matches, initially */ memset(bool_matches, STATS_MATCH_NONE, sizeof(char) * mcvlist->nitems); } else { /* AND clauses assume everything matches, initially */ memset(bool_matches, STATS_MATCH_FULL, sizeof(char) * mcvlist->nitems); } /* build the match bitmap for the OR-clauses */ mcv_update_match_bitmap(root, bool_clauses, keys, mcvlist, bool_matches, or_clause(clause)); the comment for the AND case directly contradicts the initial comment, and the final comment is wrong because it could be and AND clause. For a NOT clause it does: /* by default none of the MCV items matches the clauses */ not_matches = palloc0(sizeof(char) * mcvlist->nitems); /* NOT clauses assume nothing matches, initially */ memset(not_matches, STATS_MATCH_FULL, sizeof(char) * mcvlist->nitems); /* build the match bitmap for the NOT-clause */ mcv_update_match_bitmap(root, not_args, keys, mcvlist, not_matches, false); so the second comment is wrong. I understand the evolution that lead to this function existing in this form, but I think that it can now be refactored into a "getter" function rather than an "update" function. I.e., something like mcv_get_match_bitmap() which first allocates the array to be returned and initialises it based on the passed-in value of is_or. That way, all the calling sites can be simplified to one-liners like /* get the match bitmap for the AND/OR clause */ bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys, mcvlist, or_clause(clause)); In the previous discussion around UpdateStatisticsForTypeChange(), the consensus appeared to be that we should just unconditionally drop all extended statistics when ALTER TABLE changes the type of an included column (just as we do for per-column stats), since such a type change can rewrite the data in arbitrary ways, so there's no reason to assume that the old stats are still valid. I think it makes sense to extract that as a separate patch to be committed ahead of these ones, and I'd also argue for back-patching it. That's it for now. I'll try to keep reviewing if time permits. Regards, Dean
pgsql-hackers by date: