Re: Extended Statistics set/restore/clear functions. - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Extended Statistics set/restore/clear functions. |
Date | |
Msg-id | b4ba8ed9-f1b3-4b6b-8612-ff42d65c18c6@vondra.me Whole thread Raw |
In response to | Re: Extended Statistics set/restore/clear functions. (Corey Huinker <corey.huinker@gmail.com>) |
Responses |
Re: Extended Statistics set/restore/clear functions.
|
List | pgsql-hackers |
On 1/23/25 15:51, Corey Huinker wrote: > On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > Hi, > > Thanks for continuing to work on this. > > On 1/22/25 19:17, Corey Huinker wrote: > > This is a separate thread for work started in [1] but focused > purely on > > getting the following functions working: > > > > * pg_set_extended_stats > > * pg_clear_extended_stats > > * pg_restore_extended_stats > > > > These functions are analogous to their relation/attribute > counterparts, > > use the same calling conventions, and build upon the same basic > > infrastructure. > > > > I think it is important that we get these implemented because they > close > > the gap that was left in terms of the ability to modify existing > > statistics and to round out the work being done to carry over > statistics > > via dump/restore and pg_upgrade i [1]. > > > > The purpose of each patch is as follows (adapted from previous > thread): > > > > 0001 - This makes the input function for pg_ndistinct functional. > > > > 0002 - This makes the input function for pg_dependencies functional. > > > > I only quickly skimmed the patches, but a couple comments: > > > 1) I think it makes perfect sense to use the JSON parsing for the input > functions, but maybe it'd be better to adjust the format a bit to make > that even easier? > > Right now the JSON "keys" have structure, which means we need some ad > hoc parsing. Maybe we should make it "proper JSON" by moving that into > separate key/value, e.g. for ndistinct we might replace this: > > {"1, 2": 2323, "1, 3" : 3232, ...} > > with this: > > [ {"keys": [1, 2], "ndistinct" : 2323}, > {"keys": [1, 3], "ndistinct" : 3232}, > ... ] > > so a regular JSON array of objects, with keys an "array". And similarly > for dependencies. > > > That is almost exactly what I did back when the stats import functions > took a nested JSON argument. > > The biggest problem with changing that format is that the old format > would still show up in the system being exported, so we would have to > process that format as well as the new one. > > > Yes, it's more verbose, but maybe better for "mechanical" processing? > > > It absolutely would be better for processing, but we'd still have to > read the old format from older systems. I suppose the pg_dump code > could do some SQL gymnastics to convert the old json-but-sad format into > the processing-friendly format of the future, and I could easily adapt > what I've already written over a year ago to that task. I suppose it's > just a matter of having the community behind changing the output format > to enable a better input format. > D'oh! I always forget about the backwards compatibility issue, i.e. that we still need to ingest values from already released versions. Yeah, that makes the format change less beneficial. > > > 2) Do we need some sort of validation? Perhaps this was discussed in the > other thread and I missed that, but isn't it a problem that happily > accept e.g. this? > > {"6666, 6666" : 1, "1, -222": 14, ...} > > That has duplicate keys with bogus attribute numbers, stats on (bogus) > system attributes, etc. I suspect this may easily cause problems during > planning (if it happens to visit those statistics). > > > We used to have _lots_ of validation for data quality issues, much of > which was removed at the request of reviewers. However, much of that > discussion was about the health of the statistics, but these are > standalone data types, maybe they're held to a higher standard. If so, > what sort of checks do you think would be needed and/or wanted? > > So far, I can imagine the following: > > * no negative attnums in key list > * no duplicate attnums in key list > > anything else? > Yeah, I recall there were a lot of checks initially and we dropped them over time. I'm not asking to reinstate all of those thorough checks. At this point I was really thinking only about validating the attnums, i.e. to make sure it's a valid attribute in the table / statistics. That is something the pg_set_attribute_stats() enforce too, thanks to having a separate argument for the attribute name. That's where I'd stop. I don't want to do checks on the statistics content, like verifying the frequencies in the MCV sum up to 1.0 or stuff like that. I think we're not doing that for pg_set_attribute_stats either (and I'd bet one could cause a lot of "fun" this way). > > Maybe that's acceptable - ultimately the user could import something > broken in a much subtler way, of course. But the pg_set_attribute_stats > seems somewhat more protected against this, because it gets the attr as > a separate argument. > > > The datatype itself is in isolation, but once we've created a valid > pg_ndistinct or pg_dependencies, there's nothing stopping us from > validating the values in the datatype against the statistics object and > the relation it belongs to, but that might get the same resistance that > I got to say, ensuring that frequency lists were monotonically decreasing. > Understood. IMHO it's fine to say we're not validating the statistics are "consistent" but I think we should check it matches the definition. > > I recall I wished to have the attnum in the output function, but that > was not quite possible because we don't know the relid (and thus the > descriptor) in that function. > > Is it a good idea to rely on the input/output format directly? How will > that deal with cross-version differences? Doesn't it mean the in/out > format is effectively fixed, or at least has to be backwards compatible > (i.e. new version has to handle any value from older versions)? > > > Presently there are no cross-version differences, though earlier I > address the pros and cons of changing it. No matter what, the burden of > having a valid format is on the user in the case of > pg_set_extended_stats() and pg_restore_extended_stats() has a server > version number associated, so the option of handling a format change > could be baked in, but then we're doing version tests and input > typecasts like we do with ANYARRAY types. Not impossible, but definitely > more work. > OK, makes sense. > > Or what if I want to import the stats for a table with slightly > different structure (e.g. because dump/restore skips dropped columns). > Won't that be a problem with the format containing raw attnums? Or is > this a use case we don't expect to work? > > > The family of pg_set_*_stats functions expect the input to be meaningful > and correct for that relation on that server version. Any attnum > translation would have to be done by the user to adapt to the new or > changed relation. > > The family of pg_restore_*_stats functions are designed to be forward > compatible, and to work across versions but for basically the same > relation or relation of the same shape. Basically, they're for > pg_restore and pg_upgrade, so no changes in attnums would be expected. > > OK > > For the per-attribute stats it's probably fine, because that's mostly > just a collection of regular data types (scalar values or arrays of > values, ...) and we're not modifying them except for maybe adding new > fields. But extended stats seem more complex, so maybe it's different? > > > I had that working by attname matching way back in the early days, but > it would involve creating an intermediate format for translating the > attnums to attnames on export, and then re-translating them on the way > back in. > > I suppose someone could write the following utility functions > > pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) - >> json > pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> > pg_ndistinct > > and that would bridge the gap for the special case where you want to > adapt pg_ndistinct from one table structure to a slightly different one. > > OK > > > I remember a long discussion about the format at the very beginning of > this patch series, and the conclusion clearly was to have a function > that import stats for one attribute at a time. And that seems to be > working fine, but extended stats values have more internal structure, so > perhaps they need to do something more complicated. > > > I believe this *is* doing something more complicated, especially with > the MCVList, though I was very pleased to see how much of the existing > infrastructure I was able to reuse. > > OK > > > > > 0003 - Makes several static functions in attribute_stats.c public > for use > > by extended stats. One of those is get_stat_attr_type(), which in > the last > > patchset was modified to take an attribute name rather than > attnum, thus > > saving a syscache lookup. However, extended stats identifies > attributes by > > attnum not name, so that optimization had to be set aside, at least > > temporarily. > > > > 0004 - These implement the functions pg_set_extended_stats(), > > pg_clear_extended_stats(), and pg_restore_extended_stats() and > behave like > > their relation/attribute equivalents. If we can get these > committed and > > used by pg_dump, then we don't have to debate how to handle post- > upgrade > > steps for users who happen to have extended stats vs the approximately > > 99.75% of users who do not have extended stats. > > > > I see there's a couple MCV-specific functions in the extended_stats.c. > Shouldn't those go into mvc.c instead? > > > I wanted to put it there, but there was a reason I didn't and I've now > forgotten what it was. I'll make an effort to relocate it to mcv.c. > > For that matter, it might make sense to break out the expressions code > into its own file, because every other stat attribute has its own. > Thoughts on that? > +1 to that, if it reduced unnecessary code duplication > > > FWIW there's a bunch of whitespace issues during git apply. > > > +1 > > > > OK. Thanks for the patch! > > > Thanks for the feedback, please keep it coming. I think it's really > important that extended stats, though used rarely, be included in our > dump/restore/upgrade changes so as to make for a more consistent user > experience. > I agree, and I appreciate you working on it. -- Tomas Vondra
pgsql-hackers by date: