Home > mailing lists
Re: Extended Statistics set/restore/clear functions. - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Extended Statistics set/restore/clear functions.
Date	January 23 23:15:53
Msg-id	b4ba8ed9-f1b3-4b6b-8612-ff42d65c18c6@vondra.me Whole thread Raw
In response to	Re: Extended Statistics set/restore/clear functions. (Corey Huinker <corey.huinker@gmail.com>)
Responses	Re: Extended Statistics set/restore/clear functions.
List	pgsql-hackers
Tree view

On 1/23/25 15:51, Corey Huinker wrote:
> On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
> 
>     Hi,
> 
>     Thanks for continuing to work on this.
> 
>     On 1/22/25 19:17, Corey Huinker wrote:
>     > This is a separate thread for work started in [1] but focused
>     purely on
>     > getting the following functions working:
>     >
>     > * pg_set_extended_stats
>     > * pg_clear_extended_stats
>     > * pg_restore_extended_stats
>     >
>     > These functions are analogous to their relation/attribute
>     counterparts,
>     > use the same calling conventions, and build upon the same basic
>     > infrastructure.
>     >
>     > I think it is important that we get these implemented because they
>     close
>     > the gap that was left in terms of the ability to modify existing
>     > statistics and to round out the work being done to carry over
>     statistics
>     > via dump/restore and pg_upgrade i [1].
>     >
>     > The purpose of each patch is as follows (adapted from previous
>     thread):
>     >
>     > 0001 - This makes the input function for pg_ndistinct functional.
>     >
>     > 0002 - This makes the input function for pg_dependencies functional.
>     >
> 
>     I only quickly skimmed the patches, but a couple comments:
> 
> 
>     1) I think it makes perfect sense to use the JSON parsing for the input
>     functions, but maybe it'd be better to adjust the format a bit to make
>     that even easier?
> 
>     Right now the JSON "keys" have structure, which means we need some ad
>     hoc parsing. Maybe we should make it "proper JSON" by moving that into
>     separate key/value, e.g. for ndistinct we might replace this:
> 
>       {"1, 2": 2323, "1, 3" : 3232, ...}
> 
>     with this:
> 
>       [ {"keys": [1, 2], "ndistinct" : 2323},
>         {"keys": [1, 3], "ndistinct" : 3232},
>         ... ]
> 
>     so a regular JSON array of objects, with keys an "array". And similarly
>     for dependencies.
> 
> 
> That is almost exactly what I did back when the stats import functions
> took a nested JSON argument.
> 
> The biggest problem with changing that format is that the old format
> would still show up in the system being exported, so we would have to
> process that format as well as the new one.
>  
> 
>     Yes, it's more verbose, but maybe better for "mechanical" processing?
> 
> 
> It absolutely would be better for processing, but we'd still have to
> read the old format from older systems.  I suppose the pg_dump code
> could do some SQL gymnastics to convert the old json-but-sad format into
> the processing-friendly format of the future, and I could easily adapt
> what I've already written over a year ago to that task. I suppose it's
> just a matter of having the community behind changing the output format
> to enable a better input format.
> 

D'oh! I always forget about the backwards compatibility issue, i.e. that
we still need to ingest values from already released versions. Yeah,
that makes the format change less beneficial.

>  
> 
>     2) Do we need some sort of validation? Perhaps this was discussed in the
>     other thread and I missed that, but isn't it a problem that happily
>     accept e.g. this?
> 
>       {"6666, 6666" : 1, "1, -222": 14, ...}
> 
>     That has duplicate keys with bogus attribute numbers, stats on (bogus)
>     system attributes, etc. I suspect this may easily cause problems during
>     planning (if it happens to visit those statistics).
> 
> 
> We used to have _lots_ of validation for data quality issues, much of
> which was removed at the request of reviewers. However, much of that
> discussion was about the health of the statistics, but these are
> standalone data types, maybe they're held to a higher standard. If so,
> what sort of checks do you think would be needed and/or wanted?
> 
> So far, I can imagine the following:
> 
> * no negative attnums in key list
> * no duplicate attnums in key list
> 
> anything else?
> 

Yeah, I recall there were a lot of checks initially and we dropped them
over time. I'm not asking to reinstate all of those thorough checks.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats
either (and I'd bet one could cause a lot of "fun" this way).

> 
>     Maybe that's acceptable - ultimately the user could import something
>     broken in a much subtler way, of course. But the pg_set_attribute_stats
>     seems somewhat more protected against this, because it gets the attr as
>     a separate argument.
> 
> 
> The datatype itself is in isolation, but once we've created a valid
> pg_ndistinct or pg_dependencies, there's nothing stopping us from
> validating the values in the datatype against the statistics object and
> the relation it belongs to, but that might get the same resistance that
> I got to say, ensuring that frequency lists were monotonically decreasing.
>  

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

> 
>     I recall I wished to have the attnum in the output function, but that
>     was not quite possible because we don't know the relid (and thus the
>     descriptor) in that function.
> 
>     Is it a good idea to rely on the input/output format directly? How will
>     that deal with cross-version differences? Doesn't it mean the in/out
>     format is effectively fixed, or at least has to be backwards compatible
>     (i.e. new version has to handle any value from older versions)?
> 
> 
> Presently there are no cross-version differences, though earlier I
> address the pros and cons of changing it. No matter what, the burden of
> having a valid format is on the user in the case of
> pg_set_extended_stats() and pg_restore_extended_stats() has a server
> version number associated, so the option of handling a format change
> could be baked in, but then we're doing version tests and input
> typecasts like we do with ANYARRAY types. Not impossible, but definitely
> more work.
> 

OK, makes sense.
 
> 
>     Or what if I want to import the stats for a table with slightly
>     different structure (e.g. because dump/restore skips dropped columns).
>     Won't that be a problem with the format containing raw attnums? Or is
>     this a use case we don't expect to work?
> 
> 
> The family of pg_set_*_stats functions expect the input to be meaningful
> and correct for that relation on that server version. Any attnum
> translation would have to be done by the user to adapt to the new or
> changed relation.
> 
> The family of pg_restore_*_stats functions are designed to be forward
> compatible, and to work across versions but for basically the same
> relation or relation of the same shape. Basically, they're for
> pg_restore and pg_upgrade, so no changes in attnums would be expected.
>  
> 

OK

> 
>     For the per-attribute stats it's probably fine, because that's mostly
>     just a collection of regular data types (scalar values or arrays of
>     values, ...) and we're not modifying them except for maybe adding new
>     fields. But extended stats seem more complex, so maybe it's different?
> 
> 
> I had that working by attname matching way back in the early days, but
> it would involve creating an intermediate format for translating the
> attnums to attnames on export, and then re-translating them on the way
> back in.
> 
> I suppose someone could write the following utility functions
> 
>     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
>> json
>     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> pg_ndistinct
> 
> and that would bridge the gap for the special case where you want to
> adapt pg_ndistinct from one table structure to a slightly different one.
> 
> 

OK
 
> 
> 
>     I remember a long discussion about the format at the very beginning of
>     this patch series, and the conclusion clearly was to have a function
>     that import stats for one attribute at a time. And that seems to be
>     working fine, but extended stats values have more internal structure, so
>     perhaps they need to do something more complicated.
> 
> 
> I believe this *is* doing something more complicated, especially with
> the MCVList, though I was very pleased to see how much of the existing
> infrastructure I was able to reuse.
> 
> 

OK

>  
> 
> 
>     > 0003 - Makes several static functions in attribute_stats.c public
>     for use
>     > by extended stats. One of those is get_stat_attr_type(), which in
>     the last
>     > patchset was modified to take an attribute name rather than
>     attnum, thus
>     > saving a syscache lookup. However, extended stats identifies
>     attributes by
>     > attnum not name, so that optimization had to be set aside, at least
>     > temporarily.
>     >
>     > 0004 - These implement the functions pg_set_extended_stats(),
>     > pg_clear_extended_stats(), and pg_restore_extended_stats() and
>     behave like
>     > their relation/attribute equivalents. If we can get these
>     committed and
>     > used by pg_dump, then we don't have to debate how to handle post-
>     upgrade
>     > steps for users who happen to have extended stats vs the approximately
>     > 99.75% of users who do not have extended stats.
>     >
> 
>     I see there's a couple MCV-specific functions in the extended_stats.c.
>     Shouldn't those go into mvc.c instead?
> 
> 
> I wanted to put it there, but there was a reason I didn't and I've now
> forgotten what it was. I'll make an effort to relocate it to mcv.c.
> 
> For that matter, it might make sense to break out the expressions code
> into its own file, because every other stat attribute has its own.
> Thoughts on that?
>  

+1 to that, if it reduced unnecessary code duplication

> 
> 
>     FWIW there's a bunch of whitespace issues during git apply.
> 
> 
> +1
>  
> 
> 
>     OK. Thanks for the patch!
> 
> 
> Thanks for the feedback, please keep it coming. I think it's really
> important that extended stats, though used rarely, be included in our
> dump/restore/upgrade changes so as to make for a more consistent user
> experience.
>  

I agree, and I appreciate you working on it.



-- 
Tomas Vondra
pgsql-hackers by date:
From: Andrew Dunstan
Date: 23 January, 23:12:22
Subject: Re: why -Fdance archive format option works with ./pg_restore but not with ./pg_dump?
From: Dagfinn Ilmari Mannsåker
Date: 23 January, 23:25:45
Subject: Re: pg_createsubscriber TAP test wrapping makes command options hard to read.
Re: Extended Statistics set/restore/clear functions. - Mailing list pgsql-hackers

Previous

Next