Thread: How to idenity duplicate rows

How to idenity duplicate rows

From

"David Inglis"

Date:

19 March 2006, 17:20:33

Can anybody assist with this problem I have a table that has some duplicated rows of data, I want to place a unique constraint on the columns userid and procno to eliminate this problem in the future but how do I identify and get rid of the existing duplication.

Thanks in advance

Regards

David Inglis

Re: How to idenity duplicate rows

From

Peter Eisentraut

Date:

19 March 2006, 18:52:07

David Inglis wrote:
> Can anybody assist with this problem I have a table that has some
> duplicated rows of data,  I want to place a unique constraint on  the
> columns userid and procno to eliminate this problem in the future but
> how do I identify and get rid of the existing duplication.

To find them, something like

SELECT a, b, c FROM table GROUP BY a, b, c HAVING count(*) > 1;

comes to mind, where you have to list all columns of the table in place
of a, b, c.

As for deleting all but one row in a duplicated group, you're going to
have to get at them by the oid or ctid columns perhaps.

The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
oldtable;.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: How to idenity duplicate rows

From

"Jan Cruz"

Date:

19 March 2006, 22:08:02

As for deleting all but one row in a duplicated group, you're going to
have to get at them by the oid or ctid columns perhaps.

The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
oldtable;.

I believe getting oid and/or ctid is not possible since it would not display/get duplicate records
in a "HAVING CLAUSE" since their oid/ctid wouldn't be the same.

And creating a newtable and use SELECT DISTINCT isn't an acceptable idea to the audit team when you have millions of records in a table in production server when you only need to remove let say 70 records from that table.

I tried another approach where I queried and insert the duplicate record (35 records) into
a new/temporary table. Then I created a stored procedure something like this:
--START

DECLARE
foo table;

BEGIN
       FOR foo IN
         SELECT * FROM newtable
       LOOP
          DELETE FROM oldtable
          where oldtable.field1 = foo.field1
           and   oldtable.field2 = foo.field2
           and   oldtable.field3 = foo.field3
           and   oldtable.field4 = foo.field4
           ...';
       END LOOP;
END;

-- END

Problem with this approach I got 35 duplicate records (count = 2)
from the new table and delete only 20 records from the oldtable.
If I am not mistaken it should delete 70 records.

I wonder

Re: How to idenity duplicate rows

From

Berend Tober

Date:

19 March 2006, 22:16:45

Peter Eisentraut wrote:

>David Inglis wrote:
>
>
>>Can anybody assist with this problem I have a table that has some
>>duplicated rows of data,  I want to place a unique constraint on  the
>>columns userid and procno to eliminate this problem in the future but
>>how do I identify and get rid of the existing duplication.
>>
>>
>
>To find them, something like
>
>SELECT a, b, c FROM table GROUP BY a, b, c HAVING count(*) > 1;
>
>comes to mind, where you have to list all columns of the table in place
>of a, b, c.
>
>As for deleting all but one row in a duplicated group, you're going to
>have to get at them by the oid or ctid columns perhaps.
>
>The other idea is to run CREATE TABLE newtable AS SELECT DISTINCT * FROM
>oldtable;.
>
>
This doesn't bring over to the new table any foreign key relationships
or triggers.

Another approach (if you don't have OID's) is to create uniqueness by
appending a column to the table, populating it with sequential integers.
Then you proceed as otherwise suggested above by using aggregation to
identify the duplicated rows.