Re: Suggestion for --truncate-tables to pg_restore - Mailing list pgsql-hackers
From | Karl O. Pinc |
---|---|
Subject | Re: Suggestion for --truncate-tables to pg_restore |
Date | |
Msg-id | 1348242845.21480.0@mofo Whole thread Raw |
In response to | Suggestion for --truncate-tables to pg_restore ("Karl O. Pinc" <kop@meme.com>) |
Responses |
Re: Suggestion for --truncate-tables to pg_restore
Re: Suggestion for --truncate-tables to pg_restore |
List | pgsql-hackers |
On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote: > I've had problems using pg_restore --data-only when > restoring individual schemas (which contain data which > has had bad things done to it). --clean does not work > well because of dependent objects in other schemas. Before doing any more work I want to report on the discussions that took place at the code sprint at Postgres Open in Chicago. Because I'm going to add in additional thoughts I've had and to avoid mis-representing anybody's opinion I'll not mention who said what. Feel free to step forward and claim Ingenious Ideas as your own. Likewise I apologize if lack of attribution makes it more difficult to discern (my) uninformed drivel from intelligent insight. ---- First, the problem: Begin with the following structure: CREATE TABLE schemaA.foo (id PRIMARY KEY, data INT); CREATE VIEW schemaB.bar AS SELECT * FROM schemaA.foo; Then, by accident, somebody does: UPDATE schemaA.foo SET data = data + (RANDOM() * 1000)::INT; So, you want to restore the data into schemaA.foo. But schemaA.foo has (bad) data in it that must first be removed. It would seem that using pg_restore --clean -n schemaA -t foo my_pg_dump_backup would solve the problem, it would drop schemaA.foo, recreate it, and then restore the data. But this does not work. schemaA.foo does not drop because it's got a dependent database object, schemaB.bar. Of course there are manual work-arounds. One of these is truncating schemaA.foo and then doing a pg_restore with --data-only. The manual work-arounds become increasingly burdensome as you need to restore more tables. The case that motivated me was an attempt to restore the data in an entire schema, one which contained a significant number of tables. So, the idea here is to be able to do a data-only restore, first truncating the data in the tables being restored to remove the existing corrupted data. The proposal is to add a --truncate-tables option to pg_restore. ---- There were some comments on syntax. I proposed to use -u as a short option. This was thought confusing, given it's use in other Unix command line programs (mysql). Since there's no obvious short option, forget it. Just have a long option. Another choice is to avoid introducing yet another option and instead overload --clean so that when doing a --data-only restore --clean truncates tables and otherwise --clean retains the existing behavior of dropping and re-creating the restored objects. (I tested pg_restore with 9.1 and when --data-only is used --clean is ignored, it does not even produce a warning. This is arguably a bug.) ---- More serious objections were raised regarding semantics. What if, instead, the initial structure looked like: CREATE TABLE schemaA.foo (id PRIMARY KEY, data INT); CREATE TABLE schemaB.bar (id INT CONSTRAINT "bar_on_foo" REFERENCES foo, moredata INT); With a case like this, in most real-world situations, you'd have to use pg_restore with --disable-triggers if you wanted to use --data-only and --truncate-tables. The possibility of foreign key referential integrity corruption is obvious. Aside: Unless you're restoring databases in their entirety the pg_restore --disable-triggers option makes it easy to introduce foreign key referential integrity corruption. In fact, since pg_restore does not wrap it's operations in one big transaction, it's easy to attempt restoration of a portion of a database, have part of the process succeed and part of it fail (due to either schema or data dependencies), and be left off worse than before you started. The pg_restore docs might benefit from a big fat warning regarding attempts to restore less than an entire database. So, the discussion went, pg_restore is just another application and introducing more options which could lead to corruption of referential integrity is a bad idea. But pg_restore should not be thought of as just another front-end. It should be thought of as a data recovery tool. Recovering some data and being left with referential integrity problems is better than having no data. This is true even if, due to different users owning different schemas and so forth, nobody knows exactly what might be broken. Yes, but we can do better. (The unstated sub-text being that we don't want to introduce an inferior feature which will then need to be supported forever.) How could we do better: Here I will record only the ideas related to restore, although there was some mention of dump as well. There has apparently been some discussion of writing a foreign data wrapper which would operate on a database dump. This might (in ways that are not immediately obvious to me) address this issue. The restore process could, based on what table data needs restoration, look at foreign key dependencies and produce a list of the tables which all must be restored into order to ensure foreign key referential integrity. In the case of restoration into a empty database the foreign key dependences must be calculated from the dump. (An "easy" way to do this would be to create all the database objects in some temporary place and query the system catalogs to produce the dependency graph.) In the case of restoration into an existing database the foreign key dependences should come from the database into which the data is to be restored. (This is necessary to capture dependences which may have been introduced after the dump was taken.) The above applies to data-only restoration. When restoring the database schema meta-information (object definition) a similar graph of database object dependences must be produced and used to determine what needs to be restored. But when doing a partial data-only restore there is more to data integrity than just foreign key referential integrity. Other constraints and triggers ensure other sorts of data integrity rules. It is not enough to leave triggers turned on when restoring data. Data not restored may validate against restored data in triggers fired only on manipulation of the un-restored table content. The only solution I can see is to also include in the computed set of tables which require restoration those tables having triggers which reference any of the restored data. Just how far should pg_restore go in attempting to preserve data integrity? ---- Two things are clear: The current table and schema oriented options for backing up and restoring portions of databases are flawed with respectto data integrity. Life is complicated. Where should I go from here? I am not now in a position to pursue anything more complicated than completing the code to add a --truncate-tables option to pg_restore. Should I finish this and send in a patch? Karl <kop@meme.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
pgsql-hackers by date: