Thread: Bogosities in pg_dump's extended statistics support
While poking around in pg_dump for another purpose, I happened to notice these things about its handling of extended-statistics objects: 1. pg_dump supposes that a stats object must be in the same schema as the table it's on. This is flat wrong. regression=# create schema s1; CREATE SCHEMA regression=# create table s1.mytab (f1 int, f2 int); CREATE TABLE regression=# create schema s2; CREATE SCHEMA regression=# create statistics s2.mystats on f1, f2 from s1.mytab; CREATE STATISTICS Even if this weren't flat wrong already, it's not going to scale to the intended expansion of stats objects to describe multi-table stats. 2. pg_dump supposes that a stats object must have the same owner as the table it's on. This is also flat wrong, despite the backend restriction that the stats creator must have ownership privilege on the table: regression=# create user tableowner; CREATE ROLE regression=# create user statsowner; CREATE ROLE regression=# grant tableowner to statsowner; GRANT ROLE regression=# \c - tableowner You are now connected to database "regression" as user "tableowner". regression=> create table sometable (f1 int, f2 int); CREATE TABLE regression=> \c - statsowner You are now connected to database "regression" as user "statsowner". regression=> create statistics somestats on f1, f2 from sometable; CREATE STATISTICS 3. pg_dump decides whether to dump a stats object on the basis of whether its underlying table is going to be dumped. While that's not totally silly today, it likewise isn't going to scale to multi-table stats. I propose that we should just treat extended stats objects like other generic database objects --- which, in practice, means "dump if the containing schema is getting dumped". We don't exclude views or matviews on the basis of whether their underlying tables are going to be dumped, so why should stats objects operate differently from those? 4. pg_dump issues a separate query against pg_statistic_ext for every table it's contemplating dumping. This is pretty inefficient, at least for people with lots of tables, and it also is going to be flat broken when multi-table stats arrive. We should just do one query and iterate through all the stats objects at once. Changing the points above will eliminate any need for pg_dump to associate stats objects with particular underlying tables at all, so that there's no value in the per-table way. Points 1 and 2 clearly constitute bugs whose fixes need to be back-patched into v10, because dump/restore will misbehave today on stats objects violating those presumptions. One could argue for leaving point 3 alone in v10, but I'm inclined to back-patch that change as well rather than allowing people to get used to a non-future-proof behavior. Comments? regards, tom lane
Thanks for patching. Your points #1, #2 and #4 are OK with me. I'm thinking about point #3, concerning when-to-dump policy for stats objects: Tom Lane wrote: > 3. pg_dump decides whether to dump a stats object on the basis of whether > its underlying table is going to be dumped. While that's not totally > silly today, it likewise isn't going to scale to multi-table stats. > I propose that we should just treat extended stats objects like other > generic database objects --- which, in practice, means "dump if the > containing schema is getting dumped". We don't exclude views or matviews > on the basis of whether their underlying tables are going to be dumped, > so why should stats objects operate differently from those? I claim that views and matviews are indeed separate objects rather than sub-objects belonging to a table; but a stats object is, in some way, just part of the table(s) definition(s), so the analogy doesn't hold necessarily. In particular the decision to dump the stats object should be connected to the table(s) rather than the stat object as standing alone. In other words, I think the original policy of dumping a single-table stats object whenever the table is dumped should be retained. I think this should be a simple patch on top of what you committed. The followup question is what to do for multi-table stats objects. I think the options are that it should be dumped if: 1. any of its tables are dumped 2. *all* its tables are dumped 3. the schema containing the stats object is dumped I think #2 is a loser: the object goes missing if you dump the tables individually. #3 has a similar problem: if you put the stats object in a schema that's not that of any of the involved tables (admittedly a somewhat odd choice), then again the stats object goes missing, and there is no indication that the dump is incomplete except that the application becomes slower -- a situation difficult to detect (unlike missing regular objects, because then queries break in obvious ways). My preference is for #1. It has the small problem that if you have one dump for each table, the stats object appears more than once (so restoring both causes an error during restore of the second one), but I don't think this is terrible. I think you could go even further and say (combination of #1 and #3): 4. dump the stats object if any of the involved tables are dumped, *or* the schema containing the stats object is dumped. with the rationale that if you explicitly dump a schema in which you've specifically put a number of stat objects, then they should appear in the dump, even if the tables are not dumped. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Tom Lane wrote: >> I propose that we should just treat extended stats objects like other >> generic database objects --- which, in practice, means "dump if the >> containing schema is getting dumped". We don't exclude views or matviews >> on the basis of whether their underlying tables are going to be dumped, >> so why should stats objects operate differently from those? > I claim that views and matviews are indeed separate objects rather than > sub-objects belonging to a table; but a stats object is, in some way, > just part of the table(s) definition(s), so the analogy doesn't hold > necessarily. In particular the decision to dump the stats object should > be connected to the table(s) rather than the stat object as standing > alone. Meh. The thing that is going to make this painful is when we allow extensions to contain stats objects, something which doesn't seem to be allowed today (at least, ALTER EXTENSION ADD STATISTICS doesn't work) but which is surely a rather bad oversight. Consider two cases for an extension E containing a table T: * There is a stats object S on T, and S is also a member of E. * There is a stats object S on T, and S is NOT a member of E (perhaps somebody made it after the fact). I argue that if you try to tie S's dumpability strictly to T's, you are going to break one or the other of these cases, either treating S as if it were a member of E when you shouldn't or vice versa. Getting the binary-upgrade case right will make it even more complicated. (And I haven't even mentioned, say, stats objects on two tables belonging to different extensions.) I think we've made our choice and it's that stats objects are independent objects. Trying to preserve some part of the they're-just-attributes- of-a-table view is going to create more problems than it solves. regards, tom lane
On Mon, Feb 12, 2018 at 10:07 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > The followup question is what to do for multi-table stats objects. I > think the options are that it should be dumped if: > > 1. any of its tables are dumped > 2. *all* its tables are dumped > 3. the schema containing the stats object is dumped > > I think #2 is a loser: the object goes missing if you dump the tables > individually. > > #3 has a similar problem: if you put the stats object in a schema that's > not that of any of the involved tables (admittedly a somewhat odd > choice), then again the stats object goes missing, and there is no > indication that the dump is incomplete except that the application > becomes slower -- a situation difficult to detect (unlike missing > regular objects, because then queries break in obvious ways). > > My preference is for #1. It has the small problem that if you have one > dump for each table, the stats object appears more than once (so > restoring both causes an error during restore of the second one), but I > don't think this is terrible. It's not the worst thing in the world, but it's a long way from being the best thing. My personal experience has been that it sucks when you get errors during a restore -- it's often not clear exactly what object was affected and whether the error is something that's going to break your application. Of course, it's even worse if you are restoring in single-transaction or on-error-stop mode. I believe I previously argued that statistics objects should be considered first-class objects in their own right, not just table properties, and that's still my position. In short, I agree with Tom. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company