Thread: BUG #3110: Online Backup introduces Duplicate OIDs

BUG #3110: Online Backup introduces Duplicate OIDs

From

"Randy Isbell"

Date:

06 March 2007, 12:15:20

The following bug has been logged online:

Bug reference:      3110
Logged by:          Randy Isbell
Email address:      jisbell@cisco.com
PostgreSQL version: 8.2.3
Operating system:   FreeBSD 6.1 i386
Description:        Online Backup introduces Duplicate OIDs
Details:

This issue is observed in version 8.0.7 on FreeBSD 6.0 and 8.0.7 on FreeBSD
6.1 as well.

When restoring the output of an online backup, many tables now have
duplicate OID values / primary keys, viz:

--- (cut here) ---
# psql -Upostgres sn
Welcome to psql 8.2.3, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

sn=# \d at_dns
                   Table "public.at_dns"
       Column        |           Type           | Modifiers
---------------------+--------------------------+-----------
 ac_b                | text                     | not null
 ac_zone             | text                     | not null
 ac_host             | text                     | not null
 ac_ttl              | bigint                   | not null
 ac_type             | text                     | not null
 ac_data             | text                     | not null
 ac_password         | text                     |
 ac_desc             | text                     |
 ac_lastupdate       | timestamp with time zone |
 ac_lastddns         | timestamp with time zone |
 ac_mx_priority      | bigint                   |
 ac_soa_contact      | text                     |
 ac_soa_serial       | bigint                   |
 ac_soa_refresh      | bigint                   |
 ac_soa_retry        | bigint                   |
 ac_soa_expire       | bigint                   |
 ac_soa_minimum      | bigint                   |
 ac_srv_priority     | integer                  |
 ac_srv_weight       | integer                  |
 ac_srv_port         | integer                  |
 ac_naptr_flags      | text                     |
 ac_naptr_order      | integer                  |
 ac_naptr_preference | integer                  |
 ac_naptr_service    | text                     |
 ac_naptr_regexp     | text                     |
 ac_naptr_ac_n       | text                     |
Indexes:
    "ai_dns_host" PRIMARY KEY, btree (ac_zone, ac_host, ac_type, ac_data)
    "ai_dns_b_host" UNIQUE, btree (ac_b, ac_zone, ac_host, ac_type,
ac_data)
    "ai_oid_dns" UNIQUE, btree (oid)
Triggers:
    at_dns_delete_serial AFTER DELETE ON at_dns FOR EACH ROW EXECUTE
PROCEDURE at_delete_serial()
    at_dns_touch AFTER INSERT OR DELETE OR UPDATE ON at_dns FOR EACH
STATEMENT EXECUTE PROCEDURE af_file_touch()
    at_dns_update_serial AFTER INSERT OR UPDATE ON at_dns FOR EACH ROW
EXECUTE PROCEDURE at_update_serial()


sn=# select ac_zone,ac_host,ac_type,ac_data, count(oid) from at_dns group by
ac_zone,ac_host,ac_type,ac_data having count(oid) > 1;
      ac_zone      | ac_host | ac_type |        ac_data         | count
-------------------+---------+---------+------------------------+-------
 nqa5.l1.cisco.com | @       | soa     | ns1.nqa5.l1.cisco.com. |     2
(1 row)

sn=# reindex table at_dns;
ERROR:  could not create unique index
DETAIL:  Table contains duplicated values.

--- (cut here) ---

I verified the source database does NOT have duplicate values prior to the
online dump.

Note that this problem seems to occur when the database is under load.  The
above error results when approximately 30 transactions per second are active
during the online backup.  A quiescent database does NOT exhibit this
problem.

Thoughts?

- r.

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Tom Lane

Date:

06 March 2007, 13:44:06

"Randy Isbell" <jisbell@cisco.com> writes:
> When restoring the output of an online backup, many tables now have
> duplicate OID values / primary keys, viz:
> ...
> sn=# reindex table at_dns;
> ERROR:  could not create unique index

I'm confused.  You're claiming that the reload succeeds, but after that
a reindex fails?  Are there actually duplicate OIDs in the dump file,
as evidenced by looking at it?  What pg_dump options are you using
(I suppose -o at least)?  If using pg_restore, what options there?

            regards, tom lane

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Tom Lane

Date:

07 March 2007, 12:34:22

Randy Isbell <jisbell@cisco.com> writes:
> This is not a pg_dump.  I am performing an online backup ala the
> "pg_start_backup()" and "pg_stop_backup()" mechanism.

Oh, OK.  Could we see the system columns (ctid,xmin,xmax,cmin,cmax) for
the rows with duplicated OIDs?  Also, would you look at the user data in
these rows and see if it's the same in each pair?  I'm wondering if the
pairs might be both old and new states of an UPDATE.

            regards, tom lane

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Tom Lane

Date:

07 March 2007, 17:54:51

Randy Isbell <jisbell@cisco.com> writes:
> Here you go:
> SELECT
>          ctid,xmin,xmax,cmin,cmax,oid,*

Thanks.  This is real interesting, because none of the rows have
xmax/cmax set, so it doesn't appear that they were meant to have been
updated out of existence.

> For the at_dns table, it appears one column (ac_soa_serial) changes.

Does that correspond to something your application does, ie UPDATE
ac_soa_serial to a new value without changing anything else?  I'm trying
to guess if the duplicates arose by means of a misfiring UPDATE, or if
they were independent insertions.  Is it plausible that two rows that
are the same except for ac_soa_serial would be inserted by your app?
If the latter, a possible theory is that the OID counter is somehow
being rolled back by the dump/reload process.

            regards, tom lane

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Alvaro Herrera

Date:

07 March 2007, 18:12:00

Tom Lane wrote:
> Randy Isbell <jisbell@cisco.com> writes:
> > Here you go:
> > SELECT
> >          ctid,xmin,xmax,cmin,cmax,oid,*
>
> Thanks.  This is real interesting, because none of the rows have
> xmax/cmax set, so it doesn't appear that they were meant to have been
> updated out of existence.

Question for Randy -- is this plain PostgreSQL, or are you using some,
hmmm, proprietary derivate of it?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Randy Isbell

Date:

08 March 2007, 11:44:33

Thanks for the reply.

This is not a pg_dump.  I am performing an online backup ala the
"pg_start_backup()" and "pg_stop_backup()" mechanism.

Here is another look at the same table but with the OID values
included (this is from a different dump, not the initial bug report):

sn=# select oid, ac_zone,ac_host,ac_type,ac_data, count(oid) from
at_dns group by
oid,ac_zone,ac_host,ac_type,ac_data having count(oid) > 1;
    oid   |            ac_zone            | ac_host | ac_type
|        ac_data         | count
---------+-------------------------------+---------+---------
+------------------------+-------
7049453 | e164.lr0007.nqa5.l1.cisco.com | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
7049454 | e164.nqa5.l1.cisco.com        | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
7049503 | e164.lr0008.nqa5.l1.cisco.com | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
7049512 | e164.lr0006.nqa5.l1.cisco.com | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
7049515 | e164.lr0005.nqa5.l1.cisco.com | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
7049531 | e164.lr0009.nqa5.l1.cisco.com | @       | soa     |
ns1.nqa5.l1.cisco.com. |     2
(6 rows)

So the scenerio is:

1. Verify that the original database is void of duplicate primary key
values.  This is done be running queries similar to the above for
each primary key on all user tables.
2. Create a load on the database, about 30 transactions / sec.
3. Issue pg_start_backup()
4. Save off the data cluster
5. Issue pg_stop_backup()
6. Collect the WAL files
7. Create a big hairy tar file with the stuff from items 4 and 6.
8. Take the big hairy tar file to another server running the same pg
8.2.3, untar and start postgres
9. On the server from item 8, run the same queries from item 1.
Viola, behold the duplicates.

I've run a number of these scenerios and I've found some common
features which may be clues:

a. The duplicate records result from a DELETE or UPDATE query
b. They occur near the time of the pg_start_backup() or pg_stop_backup
().  That is, the DELETE or UPDATE query is issued within 90 to 110
seconds of either the start or stop, and this backup takes 20 min to
create.

Is there a transaction log file dump utility?  It would be nice to
see if the corruption is actually in the WAL files.

Regards,
- r.

On Mar 6, 2007, at 11:43 AM, Tom Lane wrote:

> "Randy Isbell" <jisbell@cisco.com> writes:
>> When restoring the output of an online backup, many tables now have
>> duplicate OID values / primary keys, viz:
>> ...
>> sn=# reindex table at_dns;
>> ERROR:  could not create unique index
>
> I'm confused.  You're claiming that the reload succeeds, but after
> that
> a reindex fails?  Are there actually duplicate OIDs in the dump file,
> as evidenced by looking at it?  What pg_dump options are you using
> (I suppose -o at least)?  If using pg_restore, what options there?
>
>             regards, tom lane

Re: BUG #3110: Online Backup introduces Duplicate OIDs

From

Randy Isbell

Date:

08 March 2007, 16:55:11

Answers inline.

On Mar 7, 2007, at 3:54 PM, Tom Lane wrote:

> Randy Isbell <jisbell@cisco.com> writes:
>> Here you go:
>> SELECT
>>          ctid,xmin,xmax,cmin,cmax,oid,*
>
> Thanks.  This is real interesting, because none of the rows have
> xmax/cmax set, so it doesn't appear that they were meant to have been
> updated out of existence.
>
>> For the at_dns table, it appears one column (ac_soa_serial) changes.
>
> Does that correspond to something your application does, ie UPDATE
> ac_soa_serial to a new value without changing anything else?

Yes.  In fact, the app may perform an update even when no columns
change.

> I'm trying
> to guess if the duplicates arose by means of a misfiring UPDATE, or if
> they were independent insertions.  Is it plausible that two rows that
> are the same except for ac_soa_serial would be inserted by your app?

Very unlikely.   I have the requirement to keep an audit trail of all
changes made to the database.  This is done by triggers on each table
which update a corresponding mirror table in an audit schema.  There
is a trigger for each insert, update, and delete.  I'm using this
audit information to try and isolate the problem.  What I found is
that no duplication ever exists for INSERTs, only UPDATEs and DELETEs.

Also, recall my previous note that the duplication happens near the
time of the pg_start_backup() and pg_stop_backup().  Based on the
audit schema information I collected, I see numerous updates and
deletes happen with no duplication problems when they are in the
middle of the backup time.  In my environment the duplication happens
within 2 minutes of the start or stop.  This may be incidental, but
I've seen it on 8 of 10 backup/restore runs.

> If the latter, a possible theory is that the OID counter is somehow
> being rolled back by the dump/reload process.

The reload process is simply: start postgres and let it replay the
necessary WAL files.

  Is there a way to determine if the WAL file data is bad?  It would
be helpful to know if the problem is caused by the backup, or if
something is wrong in the replay of the WAL files.

>
>             regards, tom lane

- r.