Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Date | |
Msg-id | 72a97865-9864-9cd7-eedb-e96dae4bde65@2ndquadrant.com Whole thread Raw |
In response to | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (David Rowley <david.rowley@2ndquadrant.com>) |
Responses |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
|
List | pgsql-hackers |
On 1/16/19 7:56 AM, David Rowley wrote:> On Tue, 15 Jan 2019 at 08:21, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Turns out you were right - the attribute_referenced piece was quite >> unnecessary. So I've removed it. I've also extended the regression tests >> to verify changing type of another column does not reset the stats. > > (Trying to find my feet over here) > > I've read over the entire thread, and apart from missing the last two > emails and therefore the latest patch, I managed to read over most of > the MCV patch. I didn't quite get to reading mcv.c and don't quite > have the energy to take that on now. > Thanks for looking! > At this stage I'm trying to get to know the patch. I read a lot of > discussing between you and Dean ironing out how the stats should be > used to form selectivities. At the time I'd not read the patch yet, > so most of it went over my head. > > I did note down a few things on my read. I've included them below. > Hopefully, they're useful. > > MCV list review > > 1. In mvc.c there's Assert(ndistinct <= UINT16_MAX); This should be > PG_UINT16_MAX > Yep. Will fix. > 2. math.h should be included just after postgres.h > Yep. Will fix. > 3. Copyright is still -2017 in mcv.c. Hopefully, if you change it to > 2019, you'll never have to bump it ever again! :-) > Optimist ;-) > 4. Looking at pg_stats_ext_mcvlist_items() I see you've coded the > string building manually. The way it's coded I'm finding a little > strange. It means the copying becomes quadratic due to > > snprintf(buff, 1024, format, values[1], DatumGetPointer(valout)); > strncpy(values[1], buff, 1023); > > So basically, generally, here you're building a new string with > values[1] followed by a comma, then followed by valout. One the next > line you then copy that new buffer back into values[1]. I understand > this part is likely not performance critical, but I see no reason to > write the code this way. > > Are you limiting the strings to 1024 bytes on purpose? I don't see > any comment mentioning you want to truncate strings. > > Would it not be better to do this part using a > AppendStringInfoString()? and just manually add a '{', ',' or '}' as > and when required? >> DatumGetPointer(valout) should really be using DatumGetCString(valout). > > Likely you can also use heap_form_tuple. This will save you having to > convert ints into strings then only to have BuildTupleFromCStrings() > do the reverse. > I agree. I admit all of this is a residue of an initial hackish version of the function, and should be changed to StringInfo. Will fix. > 5. individiaul -> individual > lists. This allows very accurate estimates for individiaul columns, but > > litst -> lists > > litst on combinations of columns. Similarly to functional dependencies > Will fix. > 6. Worth mentioning planning cycles too? > > "It's advisable to create <literal>MCV</literal> statistics objects only > on combinations of columns that are actually used in conditions together, > and for which misestimation of the number of groups is resulting in bad > plans. Otherwise, the <command>ANALYZE</command> cycles are just wasted." > Makes sense. Although that's what we say about the existing stats, so perhaps we should tweak that too. > 7. straight-forward -> straightforward > > (most-common values) lists, a straight-forward extension of the per-column > > 8. adresses -> addresses > > statistics adresses the limitation by storing individual values, but it > Will fix. Thanks for proof-reading. > 9. Worth mentioning ANALYZE time? > > This section introduces multivariate variant of <acronym>MCV</acronym> > (most-common values) lists, a straight-forward extension of the per-column > statistics described in <xref linkend="row-estimation-examples"/>. This > statistics adresses the limitation by storing individual values, but it > is naturally more expensive, both in terms of storage and planning time. > Yeah. > 10. low -> a low > > with low number of distinct values. Before looking at the second query, > > 11. them -> then > > on items in the <acronym>MCV</acronym> list, and them sums the frequencies > Will fix. > 12. Should we be referencing the source from the docs? > > See <function>mcv_clauselist_selectivity</function> > in <filename>src/backend/statistics/mcv.c</filename> for details. > > hmm. I see it's not the first going by: git grep -E "\w+\.c\<" > gt Hmm, that does not return anything to me - do you actually see any references to .c files in the sgml docs? I agree that probably is not a good idea, so I'll remove that. > 13. Pretty minor, but the following loop in > UpdateStatisticsForTypeChange() could use a break; > > attribute_referenced = false; > for (i = 0; i < staForm->stxkeys.dim1; i++) > if (attnum == staForm->stxkeys.values[i]) > attribute_referenced = true; > > UPDATE: If I'd reviewed the correct patch I'd have seen that you'd > removed this already > ;-) > 14. Again in UpdateStatisticsForTypeChange(), would it not be better > to do the statext_is_kind_built(oldtup, STATS_EXT_MCV) check before > checking if the stats contain this column? This gets rid of your > reset_stats variable. > > I also don't quite understand why there's an additional check for > statext_is_kind_built(oldtup, STATS_EXT_MCV), which if that's false > then why do we do the dummy update on the tuple? > > Have you just coded this so that you can support other stats types > later without too much modification? If so, I'm finding it a bit > confusing to read, so maybe it's worth only coding it that way if > there's more than one stats type to reset for. > > UPDATE: If I'd reviewed the correct patch I'd have seen that you'd > removed this already ;-) > > 15. I see you broke out the remainder of the code from > clauselist_selectivity() into clauselist_selectivity_simple(). The > comment looks like just a copy and paste from the original. That > seems like quite a bit of duplication. Is it better to maybe trim down > the original one? > I'll see what I can do. > 16. I initially didn't see how this code transformed the bms into an array: > > /* > * Transform the bms into an array, to make accessing i-th member easier, > * and then construct a filtered version with only attnums referenced > * by the dependency we validate. > */ > attnums = build_attnums(attrs); > > attnums_dep = (int *)palloc(k * sizeof(int)); > for (i = 0; i < k; i++) > attnums_dep[i] = attnums[dependency[i]]; > > Would it be better to name build_attnums() build_attnums_array() ? > > I think it would also be better to, instead of saying "the bms", just > say "attrs". > Hmmm, maybe. > 17. dependencies_clauselist_selectivity(), in: > > if ((dependency_is_compatible_clause(clause, rel->relid, &attnum)) && > (!bms_is_member(listidx, *estimatedclauses))) > > would it be better to have the bms_is_member() first? > Yes, that might be a tad faster. > 18. In dependencies_clauselist_selectivity() there seem to be a new > bug introduced. We do: > > /* mark this one as done, so we don't touch it again. */ > *estimatedclauses = bms_add_member(*estimatedclauses, listidx); > > but the bms_is_member() check that skipped these has been removed. > > It might be easier to document if we just always do: > > if (bms_is_member(listidx, *estimatedclauses)) > continue; > > at the start of both loops. list_attnums can just be left unset for > the originally already estimatedclauses. > It's probably not as clear as it should be, but if the clause is already estimated (or incompatible), then the list_attnums[] entry will be InvalidAttrNumber. Which is what we check in the second loop. > 19. in extended_stats.c, should build_attnums() be documented that the > Bitmapset members are not offset by > FirstLowInvalidHeapAttributeNumber. I think mostly Bitmapsets of > Attnums are offset by this, so might be worth a mention. > Good point. > 20. I think bms_member_index() needs documentation. I imagine you'll > want to mention that the bitmapset must contain the given varattno, > else surely it'll do the wrong thing if it's not. Perhaps an > Assert(bms_is_member(keys, varattno)); should be added to it. > Agreed. Or maybe make it return -1 in that case? It might even have missing_ok flag or something like that. > 21. Comment does not really explain what the function does or what the > arguments mean: > > /* > * statext_is_compatible_clause_internal > * Does the heavy lifting of actually inspecting the clauses for > * statext_is_compatible_clause. > */ > Will improve. > 22. In statext_is_compatible_clause_internal(): > > /* Var = Const */ > > The above comment seems a bit misplaced. It looks like the code below > it is looking for an OpExpr in the form of "Var <op> Const", or "Const > <op> Var". > Yes, I agree. > 23. statext_is_compatible_clause_internal() you have: > > if ((get_oprrest(expr->opno) != F_EQSEL) && > (get_oprrest(expr->opno) != F_NEQSEL) && > (get_oprrest(expr->opno) != F_SCALARLTSEL) && > (get_oprrest(expr->opno) != F_SCALARLESEL) && > (get_oprrest(expr->opno) != F_SCALARGTSEL) && > (get_oprrest(expr->opno) != F_SCALARGESEL)) > return false; > > 6 calls to get_oprrest(). 1 is enough. > > How does the existing MCV and histogram stats handle these operators? > Does it insist on a btree opfamily, or is it as crude as this too? > It's this crude too, AFAICS. > 24. In statext_is_compatible_clause_internal, you have: > > /* NOT/AND/OR clause */ > if (or_clause(clause) || > and_clause(clause) || > not_clause(clause)) > { > /* > * AND/OR/NOT-clauses are supported if all sub-clauses are supported > > Looks like you were not sure which order to have these, so you just > tried a few variations :-D Maybe just make them all the same? > If you insist ;-) > 25. Does statext_is_compatible_clause_internal)_ need to skip over RelabelTypes? > I believe it does, based on what I've observed during development. Why do you think it's not necessary? > 26. In statext_is_compatible_clause_internal() you mention: /* We only > support plain Vars for now */, but I see nothing that ensures that > only Vars are allowed in the is_opclause() condition. > > /* see if it actually has the right */ > ok = (NumRelids((Node *) expr) == 1) && > (is_pseudo_constant_clause(lsecond(expr->args)) || > (varonleft = false, > is_pseudo_constant_clause(linitial(expr->args)))); > > the above would allow var+var == const through. > But then we call statext_is_compatible_clause_internal on it again, and that only allows Vars and "Var op Const" expressions. Maybe there's a way around that? > The NumRelids seems like it would never have anything > 1 as you have > a BMS_SINGLETON test on the RestrictInfo where you're calling this > function from. I think you likely want just a IsA(... , Var) checks > here, after skipping over RelabelTypes. > > Not sure what "/* see if it actually has the right */" means. > That should have been "right structure" I believe. > 27. Should the function be named something more related to MCV? The > name makes it appear fairly generic to extended stats. > > * statext_is_compatible_clause > * Determines if the clause is compatible with MCV lists. > No, because it's supposed to also handle histograms (and perhaps other stats types) in the future. > 28. This comment seems wrong: > > * Currently we only support Var = Const, or Const = Var. It may be possible > * to expand on this later. > > I see you're allowing IS NULL and IS NOT NULL too. = does not seem to > be required either. > OK, will fix. > 29. The following fragment makes me think we're only processing > clauses to use them with MCV lists, but the comment claims "dependency > selectivity estimations" > > /* we're interested in MCV lists */ > int types = STATS_EXT_MCV; > > /* check if there's any stats that might be useful for us. */ > if (!has_stats_of_kind(rel->statlist, types)) > return (Selectivity) 1.0; > > list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) * > list_length(clauses)); > > /* > * Pre-process the clauses list to extract the attnums seen in each item. > * We need to determine if there's any clauses which will be useful for > * dependency selectivity estimations. Along the way we'll record all of > Yeah, that's copy-pasto. > 30. Is it better to do the bms_is_member() first here? > > if ((statext_is_compatible_clause(clause, rel->relid, &attnums)) && > (!bms_is_member(listidx, *estimatedclauses))) > > Likely it'll be cheaper. > Yeah, same as before. > 31. I think this comment should be /* Ensure choose_best_statistics() > didn't mess up */ > > /* We only understand MCV lists for now. */ > Assert(stat->kind == STATS_EXT_MCV); > I'll expand the comment a bit. > 32. What're lags? > > bool *isnull; /* lags of NULL values (up to 32 columns) */ > Should be "flags" I think. > 33. "ndimentions"? There's no field in the struct by that name. I'd > assume it's the same size as the isnull array above it? > > Datum *values; /* variable-length (ndimensions) */ > Yes, that's the case. > 34. README.mcv > > * large -> a large > > For columns with large number of distinct values (e.g. those with continuous > > * Is the following up-to-date? I thought I saw code for NOT too? > > (a) equality clauses WHERE (a = 1) AND (b = 2) > (b) inequality clauses WHERE (a < 1) AND (b >= 2) > (c) NULL clauses WHERE (a IS NULL) AND (b IS NOT NULL) > (d) OR clauses WHERE (a < 1) OR (b >= 2) > > * multi-variate -> multivariate > > are large the list may be quite large. This is especially true for multi-variate > > * a -> an > > TODO Currently there's no logic to consider building only a MCV list (and not > > * I'd have said "an SRF", but git grep "a SRF" disagrees with me. I > guess those people must be pronouncing it, somehow!? surf... serf... ? > > easier, there's a SRF returning detailed information about the MCV lists. > > * Is it better to put a working SQL in here? > > SELECT * FROM pg_mcv_list_items(stxmcv); > > maybe like: > > SELECT s.* FROM pg_statistic_ext, LATERAL pg_mcv_list_items(stxmcv) s; > > Maybe with a WHERE clause? > > * This list seems outdated. > > - item index (0, ..., (nitems-1)) > - values (string array) > - nulls only (boolean array) > - frequency (double precision) > > base_frequency seems to exist now too. > Yeah, those are mostly typos. Will fix. thanks -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: