Home > mailing lists

Re: Statistics Import and Export - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Statistics Import and Export
Date	March 25, 2024 23:16:48
Msg-id	bf724b21-914a-4497-84e3-49944f9776f6@enterprisedb.com Whole thread Raw
In response to	Re: Statistics Import and Export (Corey Huinker <corey.huinker@gmail.com>)
Responses	Re: Statistics Import and Export Re: Statistics Import and Export
List	pgsql-hackers

Tree view

On 3/25/24 09:27, Corey Huinker wrote:
> On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
>> v12 attached.
>>
>>
> v13 attached. All the same features as v12, but with a lot more type
> checking, bounds checking, value inspection, etc. Perhaps the most notable
> feature is that we're now ensuring that histogram values are in ascending
> order. This could come in handy for detecting when we are applying stats to
> a column of the wrong type, or the right type but with a different
> collation. It's not a guarantee of validity, of course, but it would detect
> egregious changes in sort order.
> 

Hi,

I did take a closer look at v13 today. I have a bunch of comments and
some minor whitespace fixes in the attached review patches.

0001
----

1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

6) There's a couple minor whitespace fixes or comments etc.

0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

2) I think this condition in dumpTableSchema() is actually incorrect:

  if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
      (tbinfo->relkind == RELKIND_VIEW))
      dumpRelationStats(fout, &tbinfo->dobj, reltypename,

Aren't indexes pretty much exactly the thing for which we don't want to
dump statistics? In fact this skips dumping statistics for table - if
you dump a database with a single table (-Fc), pg_restore -l will tell
you this:

217; 1259 16385 TABLE public t user
3403; 0 16385 TABLE DATA public t user

Which is not surprising, because table is not a view. With an expression
index you get this:

217; 1259 16385 TABLE public t user
3404; 0 16385 TABLE DATA public t user
3258; 1259 16418 INDEX public t_expr_idx user
3411; 0 0 STATS IMPORT public INDEX t_expr_idx

Unfortunately, fixing the condition does not work:

  $ pg_dump -Fc test > test.dump
  pg_dump: warning: archive items not in correct section order

This happens for a very simple reason - the statistics are marked as
SECTION_POST_DATA, which for the index works, because indexes are in
post-data section. But the table stats are dumped right after data,
still in the "data" section.

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

From: Andrew Dunstan
Date: 25 March 2024, 23:14:18
Subject: Re: session username in default psql prompt?

From: Jacob Champion
Date: 25 March 2024, 23:21:40
Subject: Re: WIP Incremental JSON Parser

Re: Statistics Import and Export - Mailing list pgsql-hackers

Attachment

Previous

Next