Thread: Bogosities in pg_dump's extended statistics support

Bogosities in pg_dump's extended statistics support

From

Tom Lane

Date:

11 February 2018, 08:56:46

While poking around in pg_dump for another purpose, I happened to notice
these things about its handling of extended-statistics objects:

1. pg_dump supposes that a stats object must be in the same schema as
the table it's on.  This is flat wrong.

regression=# create schema s1;
CREATE SCHEMA
regression=# create table s1.mytab (f1 int, f2 int);
CREATE TABLE
regression=# create schema s2;
CREATE SCHEMA
regression=# create statistics s2.mystats on f1, f2 from s1.mytab;
CREATE STATISTICS

Even if this weren't flat wrong already, it's not going to scale to the
intended expansion of stats objects to describe multi-table stats.

2. pg_dump supposes that a stats object must have the same owner as
the table it's on.  This is also flat wrong, despite the backend
restriction that the stats creator must have ownership privilege
on the table:

regression=# create user tableowner;
CREATE ROLE
regression=# create user statsowner;
CREATE ROLE
regression=# grant tableowner to statsowner;
GRANT ROLE
regression=# \c - tableowner
You are now connected to database "regression" as user "tableowner".
regression=> create table sometable (f1 int, f2 int);
CREATE TABLE
regression=> \c - statsowner
You are now connected to database "regression" as user "statsowner".
regression=> create statistics somestats on f1, f2 from sometable;
CREATE STATISTICS

3. pg_dump decides whether to dump a stats object on the basis of whether
its underlying table is going to be dumped.  While that's not totally
silly today, it likewise isn't going to scale to multi-table stats.
I propose that we should just treat extended stats objects like other
generic database objects --- which, in practice, means "dump if the
containing schema is getting dumped".  We don't exclude views or matviews
on the basis of whether their underlying tables are going to be dumped,
so why should stats objects operate differently from those?

4. pg_dump issues a separate query against pg_statistic_ext for every
table it's contemplating dumping.  This is pretty inefficient, at least
for people with lots of tables, and it also is going to be flat broken
when multi-table stats arrive.  We should just do one query and iterate
through all the stats objects at once.  Changing the points above will
eliminate any need for pg_dump to associate stats objects with particular
underlying tables at all, so that there's no value in the per-table way.

Points 1 and 2 clearly constitute bugs whose fixes need to be back-patched
into v10, because dump/restore will misbehave today on stats objects
violating those presumptions.  One could argue for leaving point 3 alone
in v10, but I'm inclined to back-patch that change as well rather than
allowing people to get used to a non-future-proof behavior.

Comments?

            regards, tom lane

Re: Bogosities in pg_dump's extended statistics support

From

Alvaro Herrera

Date:

12 February 2018, 18:07:08

Thanks for patching.  Your points #1, #2 and #4 are OK with me.  I'm
thinking about point #3, concerning when-to-dump policy for stats
objects:

Tom Lane wrote:

> 3. pg_dump decides whether to dump a stats object on the basis of whether
> its underlying table is going to be dumped.  While that's not totally
> silly today, it likewise isn't going to scale to multi-table stats.
> I propose that we should just treat extended stats objects like other
> generic database objects --- which, in practice, means "dump if the
> containing schema is getting dumped".  We don't exclude views or matviews
> on the basis of whether their underlying tables are going to be dumped,
> so why should stats objects operate differently from those?

I claim that views and matviews are indeed separate objects rather than
sub-objects belonging to a table; but a stats object is, in some way,
just part of the table(s) definition(s), so the analogy doesn't hold
necessarily.  In particular the decision to dump the stats object should
be connected to the table(s) rather than the stat object as standing
alone.

In other words, I think the original policy of dumping a single-table
stats object whenever the table is dumped should be retained.  I think
this should be a simple patch on top of what you committed.

The followup question is what to do for multi-table stats objects.  I
think the options are that it should be dumped if:

1. any of its tables are dumped
2. *all* its tables are dumped
3. the schema containing the stats object is dumped

I think #2 is a loser: the object goes missing if you dump the tables
individually.

#3 has a similar problem: if you put the stats object in a schema that's
not that of any of the involved tables (admittedly a somewhat odd
choice), then again the stats object goes missing, and there is no
indication that the dump is incomplete except that the application
becomes slower -- a situation difficult to detect (unlike missing
regular objects, because then queries break in obvious ways).

My preference is for #1.  It has the small problem that if you have one
dump for each table, the stats object appears more than once (so
restoring both causes an error during restore of the second one), but I
don't think this is terrible.

I think you could go even further and say (combination of #1 and #3):

4. dump the stats object if any of the involved tables are dumped, *or*
   the schema containing the stats object is dumped.

with the rationale that if you explicitly dump a schema in which you've
specifically put a number of stat objects, then they should appear in
the dump, even if the tables are not dumped.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Bogosities in pg_dump's extended statistics support

From

Tom Lane

Date:

12 February 2018, 19:14:04

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> I propose that we should just treat extended stats objects like other
>> generic database objects --- which, in practice, means "dump if the
>> containing schema is getting dumped".  We don't exclude views or matviews
>> on the basis of whether their underlying tables are going to be dumped,
>> so why should stats objects operate differently from those?

> I claim that views and matviews are indeed separate objects rather than
> sub-objects belonging to a table; but a stats object is, in some way,
> just part of the table(s) definition(s), so the analogy doesn't hold
> necessarily.  In particular the decision to dump the stats object should
> be connected to the table(s) rather than the stat object as standing
> alone.

Meh.  The thing that is going to make this painful is when we allow
extensions to contain stats objects, something which doesn't seem to be
allowed today (at least, ALTER EXTENSION ADD STATISTICS doesn't work)
but which is surely a rather bad oversight.  Consider two cases for an
extension E containing a table T:

* There is a stats object S on T, and S is also a member of E.

* There is a stats object S on T, and S is NOT a member of E
(perhaps somebody made it after the fact).

I argue that if you try to tie S's dumpability strictly to T's, you are
going to break one or the other of these cases, either treating S as if
it were a member of E when you shouldn't or vice versa.  Getting the
binary-upgrade case right will make it even more complicated.

(And I haven't even mentioned, say, stats objects on two tables belonging
to different extensions.)

I think we've made our choice and it's that stats objects are independent
objects.  Trying to preserve some part of the they're-just-attributes-
of-a-table view is going to create more problems than it solves.

            regards, tom lane

Re: Bogosities in pg_dump's extended statistics support

From

Robert Haas

Date:

12 February 2018, 19:25:49

On Mon, Feb 12, 2018 at 10:07 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> The followup question is what to do for multi-table stats objects.  I
> think the options are that it should be dumped if:
>
> 1. any of its tables are dumped
> 2. *all* its tables are dumped
> 3. the schema containing the stats object is dumped
>
> I think #2 is a loser: the object goes missing if you dump the tables
> individually.
>
> #3 has a similar problem: if you put the stats object in a schema that's
> not that of any of the involved tables (admittedly a somewhat odd
> choice), then again the stats object goes missing, and there is no
> indication that the dump is incomplete except that the application
> becomes slower -- a situation difficult to detect (unlike missing
> regular objects, because then queries break in obvious ways).
>
> My preference is for #1.  It has the small problem that if you have one
> dump for each table, the stats object appears more than once (so
> restoring both causes an error during restore of the second one), but I
> don't think this is terrible.

It's not the worst thing in the world, but it's a long way from being
the best thing.  My personal experience has been that it sucks when
you get errors during a restore -- it's often not clear exactly what
object was affected and whether the error is something that's going to
break your application.  Of course, it's even worse if you are
restoring in single-transaction or on-error-stop mode.

I believe I previously argued that statistics objects should be
considered first-class objects in their own right, not just table
properties, and that's still my position.  In short, I agree with Tom.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company