Thread: BUG #3110: Online Backup introduces Duplicate OIDs
The following bug has been logged online: Bug reference: 3110 Logged by: Randy Isbell Email address: jisbell@cisco.com PostgreSQL version: 8.2.3 Operating system: FreeBSD 6.1 i386 Description: Online Backup introduces Duplicate OIDs Details: This issue is observed in version 8.0.7 on FreeBSD 6.0 and 8.0.7 on FreeBSD 6.1 as well. When restoring the output of an online backup, many tables now have duplicate OID values / primary keys, viz: --- (cut here) --- # psql -Upostgres sn Welcome to psql 8.2.3, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit sn=# \d at_dns Table "public.at_dns" Column | Type | Modifiers ---------------------+--------------------------+----------- ac_b | text | not null ac_zone | text | not null ac_host | text | not null ac_ttl | bigint | not null ac_type | text | not null ac_data | text | not null ac_password | text | ac_desc | text | ac_lastupdate | timestamp with time zone | ac_lastddns | timestamp with time zone | ac_mx_priority | bigint | ac_soa_contact | text | ac_soa_serial | bigint | ac_soa_refresh | bigint | ac_soa_retry | bigint | ac_soa_expire | bigint | ac_soa_minimum | bigint | ac_srv_priority | integer | ac_srv_weight | integer | ac_srv_port | integer | ac_naptr_flags | text | ac_naptr_order | integer | ac_naptr_preference | integer | ac_naptr_service | text | ac_naptr_regexp | text | ac_naptr_ac_n | text | Indexes: "ai_dns_host" PRIMARY KEY, btree (ac_zone, ac_host, ac_type, ac_data) "ai_dns_b_host" UNIQUE, btree (ac_b, ac_zone, ac_host, ac_type, ac_data) "ai_oid_dns" UNIQUE, btree (oid) Triggers: at_dns_delete_serial AFTER DELETE ON at_dns FOR EACH ROW EXECUTE PROCEDURE at_delete_serial() at_dns_touch AFTER INSERT OR DELETE OR UPDATE ON at_dns FOR EACH STATEMENT EXECUTE PROCEDURE af_file_touch() at_dns_update_serial AFTER INSERT OR UPDATE ON at_dns FOR EACH ROW EXECUTE PROCEDURE at_update_serial() sn=# select ac_zone,ac_host,ac_type,ac_data, count(oid) from at_dns group by ac_zone,ac_host,ac_type,ac_data having count(oid) > 1; ac_zone | ac_host | ac_type | ac_data | count -------------------+---------+---------+------------------------+------- nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 (1 row) sn=# reindex table at_dns; ERROR: could not create unique index DETAIL: Table contains duplicated values. --- (cut here) --- I verified the source database does NOT have duplicate values prior to the online dump. Note that this problem seems to occur when the database is under load. The above error results when approximately 30 transactions per second are active during the online backup. A quiescent database does NOT exhibit this problem. Thoughts? - r.
"Randy Isbell" <jisbell@cisco.com> writes: > When restoring the output of an online backup, many tables now have > duplicate OID values / primary keys, viz: > ... > sn=# reindex table at_dns; > ERROR: could not create unique index I'm confused. You're claiming that the reload succeeds, but after that a reindex fails? Are there actually duplicate OIDs in the dump file, as evidenced by looking at it? What pg_dump options are you using (I suppose -o at least)? If using pg_restore, what options there? regards, tom lane
Randy Isbell <jisbell@cisco.com> writes: > This is not a pg_dump. I am performing an online backup ala the > "pg_start_backup()" and "pg_stop_backup()" mechanism. Oh, OK. Could we see the system columns (ctid,xmin,xmax,cmin,cmax) for the rows with duplicated OIDs? Also, would you look at the user data in these rows and see if it's the same in each pair? I'm wondering if the pairs might be both old and new states of an UPDATE. regards, tom lane
Randy Isbell <jisbell@cisco.com> writes: > Here you go: > SELECT > ctid,xmin,xmax,cmin,cmax,oid,* Thanks. This is real interesting, because none of the rows have xmax/cmax set, so it doesn't appear that they were meant to have been updated out of existence. > For the at_dns table, it appears one column (ac_soa_serial) changes. Does that correspond to something your application does, ie UPDATE ac_soa_serial to a new value without changing anything else? I'm trying to guess if the duplicates arose by means of a misfiring UPDATE, or if they were independent insertions. Is it plausible that two rows that are the same except for ac_soa_serial would be inserted by your app? If the latter, a possible theory is that the OID counter is somehow being rolled back by the dump/reload process. regards, tom lane
Tom Lane wrote: > Randy Isbell <jisbell@cisco.com> writes: > > Here you go: > > SELECT > > ctid,xmin,xmax,cmin,cmax,oid,* > > Thanks. This is real interesting, because none of the rows have > xmax/cmax set, so it doesn't appear that they were meant to have been > updated out of existence. Question for Randy -- is this plain PostgreSQL, or are you using some, hmmm, proprietary derivate of it? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Thanks for the reply. This is not a pg_dump. I am performing an online backup ala the "pg_start_backup()" and "pg_stop_backup()" mechanism. Here is another look at the same table but with the OID values included (this is from a different dump, not the initial bug report): sn=# select oid, ac_zone,ac_host,ac_type,ac_data, count(oid) from at_dns group by oid,ac_zone,ac_host,ac_type,ac_data having count(oid) > 1; oid | ac_zone | ac_host | ac_type | ac_data | count ---------+-------------------------------+---------+--------- +------------------------+------- 7049453 | e164.lr0007.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 7049454 | e164.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 7049503 | e164.lr0008.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 7049512 | e164.lr0006.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 7049515 | e164.lr0005.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 7049531 | e164.lr0009.nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2 (6 rows) So the scenerio is: 1. Verify that the original database is void of duplicate primary key values. This is done be running queries similar to the above for each primary key on all user tables. 2. Create a load on the database, about 30 transactions / sec. 3. Issue pg_start_backup() 4. Save off the data cluster 5. Issue pg_stop_backup() 6. Collect the WAL files 7. Create a big hairy tar file with the stuff from items 4 and 6. 8. Take the big hairy tar file to another server running the same pg 8.2.3, untar and start postgres 9. On the server from item 8, run the same queries from item 1. Viola, behold the duplicates. I've run a number of these scenerios and I've found some common features which may be clues: a. The duplicate records result from a DELETE or UPDATE query b. They occur near the time of the pg_start_backup() or pg_stop_backup (). That is, the DELETE or UPDATE query is issued within 90 to 110 seconds of either the start or stop, and this backup takes 20 min to create. Is there a transaction log file dump utility? It would be nice to see if the corruption is actually in the WAL files. Regards, - r. On Mar 6, 2007, at 11:43 AM, Tom Lane wrote: > "Randy Isbell" <jisbell@cisco.com> writes: >> When restoring the output of an online backup, many tables now have >> duplicate OID values / primary keys, viz: >> ... >> sn=# reindex table at_dns; >> ERROR: could not create unique index > > I'm confused. You're claiming that the reload succeeds, but after > that > a reindex fails? Are there actually duplicate OIDs in the dump file, > as evidenced by looking at it? What pg_dump options are you using > (I suppose -o at least)? If using pg_restore, what options there? > > regards, tom lane
Answers inline. On Mar 7, 2007, at 3:54 PM, Tom Lane wrote: > Randy Isbell <jisbell@cisco.com> writes: >> Here you go: >> SELECT >> ctid,xmin,xmax,cmin,cmax,oid,* > > Thanks. This is real interesting, because none of the rows have > xmax/cmax set, so it doesn't appear that they were meant to have been > updated out of existence. > >> For the at_dns table, it appears one column (ac_soa_serial) changes. > > Does that correspond to something your application does, ie UPDATE > ac_soa_serial to a new value without changing anything else? Yes. In fact, the app may perform an update even when no columns change. > I'm trying > to guess if the duplicates arose by means of a misfiring UPDATE, or if > they were independent insertions. Is it plausible that two rows that > are the same except for ac_soa_serial would be inserted by your app? Very unlikely. I have the requirement to keep an audit trail of all changes made to the database. This is done by triggers on each table which update a corresponding mirror table in an audit schema. There is a trigger for each insert, update, and delete. I'm using this audit information to try and isolate the problem. What I found is that no duplication ever exists for INSERTs, only UPDATEs and DELETEs. Also, recall my previous note that the duplication happens near the time of the pg_start_backup() and pg_stop_backup(). Based on the audit schema information I collected, I see numerous updates and deletes happen with no duplication problems when they are in the middle of the backup time. In my environment the duplication happens within 2 minutes of the start or stop. This may be incidental, but I've seen it on 8 of 10 backup/restore runs. > If the latter, a possible theory is that the OID counter is somehow > being rolled back by the dump/reload process. The reload process is simply: start postgres and let it replay the necessary WAL files. Is there a way to determine if the WAL file data is bad? It would be helpful to know if the problem is caused by the backup, or if something is wrong in the replay of the WAL files. > > regards, tom lane - r.