Thread: Deduplication and transaction isolation level

Deduplication and transaction isolation level

From

François Beausoleil

Date:

25 September 2013, 03:19:46

Hi all!

I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key,
andthe query I use to import the data is supposed to dedup before inserting, but I still get primary key violations. 

The import process is:

* Load CSV data into temp table
* INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)

I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key
violationsare proving me wrong. 

My assumption is that two or more computers are inserting the same pkey, and since the transaction isolation level is
unchangedfrom the default (read committed), the insert on machine A does not see that machine B's already inserted the
rowA's trying to insert. Is that likely? 

Do I have other solutions? Should I run all files serially through PostgreSQL?

Thanks!
François Beausoleil

                   Table "public.show_interaction_bindings_2013_09_23"
               Column                |            Type             |      Modifiers
-------------------------------------+-----------------------------+----------------------
 interaction_id                      | uuid                        | not null
 show_id                             | uuid                        | not null
 service_name                        | character varying(8)        | not null
 interaction_service_id              | text                        | not null
 interaction_created_at              | timestamp without time zone | not null
 -- ...
Indexes:
    "show_interaction_bindings_2013_09_23_pkey" PRIMARY KEY, btree (market_id, show_id, interaction_service_id,
service_name)CLUSTER 
Check constraints:
    "int_created_at_within_partition" CHECK ('2013-09-23 07:00:00'::timestamp without time zone <=
interaction_created_atAND interaction_created_at < '2013-09-30 07:00:00'::timestamp without time zone) 
    "persona_followers_count_greater_than_zero" CHECK (persona_followers_count > 0)
Inherits: show_interaction_bindings

-- Import script
CREATE TEMPORARY TABLE IF NOT EXISTS show_interaction_bindings_2013_09_23_import( LIKE show_interaction_bindings );
COPY show_interaction_bindings_2013_09_23_import(market_id, show_id, interaction_service_id, service_name,
interaction_created_at/* , ... */) FROM stdin WITH (FORMAT csv, HEADER false, DELIMITER ',', QUOTE '\"'); 
-- data, thousands of rows
\.

ANALYZE show_interaction_bindings_2013_09_23_import;

CREATE TABLE IF NOT EXISTS show_interaction_bindings_2013_09_23(
    CONSTRAINT int_created_at_within_partition CHECK(timestamp without time zone '2013-09-23 07:00' <=
interaction_created_atAND interaction_created_at < timestamp without time zone '2013-09-30 07:00' 
  , PRIMARY KEY(market_id, show_id, interaction_service_id, service_name)
) INHERITS ( show_interaction_bindings );


-- The failing query

EXPLAIN ANALYZE
INSERT INTO show_interaction_bindings_2013_09_23(market_id, show_id, interaction_service_id, service_name, ...)
  SELECT DISTINCT ON (market_id, show_id, interaction_service_id, service_name) market_id, show_id,
interaction_service_id,service_name, interaction_created_at -- , ... 
  FROM show_interaction_bindings_2013_09_23_import
  WHERE NOT EXISTS( SELECT *
                    FROM show_interaction_bindings_2013_09_23 dest
                    WHERE dest.show_id = show_interaction_bindings_2013_09_23_import.show_id
                      AND dest.market_id = show_interaction_bindings_2013_09_23_import.market_id
                      AND dest.service_name = show_interaction_bindings_2013_09_23_import.service_name
                      AND dest.interaction_service_id =
show_interaction_bindings_2013_09_23_import.interaction_service_id); 


-- The failure

psql:/var/tmp/csv-assembly-21534/2013-09-24-12.sql:32391: ERROR:  duplicate key value violates unique constraint
"show_interaction_bindings_2013_09_23_pkey"
DETAIL:  Key (market_id, show_id, interaction_service_id, service_name)=(23813bc0-f08d-012f-70c3-4040b2a1b35b,
17104480-d6b6-0130-b1d1-7a163e02a1d3,382389681297833984, Twitter) already exists.

Attachment

smime.p7s

Re: Deduplication and transaction isolation level

From

Merlin Moncure

Date:

25 September 2013, 13:04:30

On Tue, Sep 24, 2013 at 10:19 PM, François Beausoleil
<francois@teksol.info> wrote:
> Hi all!
>
> I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key,
andthe query I use to import the data is supposed to dedup before inserting, but I still get primary key violations. 
>
> The import process is:
>
> * Load CSV data into temp table
> * INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)
>
> I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key
violationsare proving me wrong. 

Right.  Transaction A and B are interleaved: they both run the same
check against the same id at the same time.  Both checks pass because
neither transaction is committed.  This problem is not solvable by
adjusting the isolation level.

Typical solutions might be to:
A. Lock the table while inserting
B. Retry the transaction following an error.
C. Import the records to a staging table, then copy the do the
deduplication check when moving from the staging table

merlin

Re: Deduplication and transaction isolation level

From

François Beausoleil

Date:

25 September 2013, 15:25:42

Le 2013-09-25 à 09:04, Merlin Moncure a écrit :

> On Tue, Sep 24, 2013 at 10:19 PM, François Beausoleil
> <francois@teksol.info> wrote:
>> Hi all!
>>
>> I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key,
andthe query I use to import the data is supposed to dedup before inserting, but I still get primary key violations. 
>>
>> The import process is:
>>
>> * Load CSV data into temp table
>> * INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)
>>
>> I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key
violationsare proving me wrong. 
>
> Right.  Transaction A and B are interleaved: they both run the same
> check against the same id at the same time.  Both checks pass because
> neither transaction is committed.  This problem is not solvable by
> adjusting the isolation level.
>
> Typical solutions might be to:
> A. Lock the table while inserting
> B. Retry the transaction following an error.
> C. Import the records to a staging table, then copy the do the
> deduplication check when moving from the staging table

You mean:

COPY TO temp FROM stdin;
INSERT INTO staging SELECT DISTINCT FROM temp;
INSERT INTO production SELECT DISTINCT FROM staging;
DELETE FROM staging WHERE inserted into production;

Right? I assume I would not have the primary key constraint on the staging table.

And steps 3 and 4 can be done in a single statement using UPDATE/DELETE CTE.

Thanks!
François

Attachment

smime.p7s

Re: Deduplication and transaction isolation level

From

Steven Schlansker

Date:

25 September 2013, 17:50:55

On Sep 25, 2013, at 6:04 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Tue, Sep 24, 2013 at 10:19 PM, François Beausoleil
> <francois@teksol.info> wrote:
>> Hi all!
>>
>> I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key,
andthe query I use to import the data is supposed to dedup before inserting, but I still get primary key violations. 
>>
>> The import process is:
>>
>> * Load CSV data into temp table
>> * INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)
>>
>> I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key
violationsare proving me wrong. 
>
> Right.  Transaction A and B are interleaved: they both run the same
> check against the same id at the same time.  Both checks pass because
> neither transaction is committed.  This problem is not solvable by
> adjusting the isolation level.

Are you sure that this is the case?  It is my understanding that since 9.1 with SSI
(https://wiki.postgresql.org/wiki/SSI)if you set the transaction isolation level to SERIALIZABLE, this problem is
solved,as the insert will take a "predicate lock" and the other insert cannot succeed. 

We use this to detect / resolve concurrent inserts that violate primary keys and it works great.

However in this case it probably doesn't help the OP because the cost of restarting the entire import is likely too
high.

>
> Typical solutions might be to:
> A. Lock the table while inserting
> B. Retry the transaction following an error.
> C. Import the records to a staging table, then copy the do the
> deduplication check when moving from the staging table
>

Re: Deduplication and transaction isolation level

From

Merlin Moncure

Date:

25 September 2013, 17:52:54

On Wed, Sep 25, 2013 at 12:50 PM, Steven Schlansker <steven@likeness.com> wrote:
>
> On Sep 25, 2013, at 6:04 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> On Tue, Sep 24, 2013 at 10:19 PM, François Beausoleil
>> <francois@teksol.info> wrote:
>>> Hi all!
>>>
>>> I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary
key,and the query I use to import the data is supposed to dedup before inserting, but I still get primary key
violations.
>>>
>>> The import process is:
>>>
>>> * Load CSV data into temp table
>>> * INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)
>>>
>>> I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary
keyviolations are proving me wrong. 
>>
>> Right.  Transaction A and B are interleaved: they both run the same
>> check against the same id at the same time.  Both checks pass because
>> neither transaction is committed.  This problem is not solvable by
>> adjusting the isolation level.
>
> Are you sure that this is the case?  It is my understanding that since 9.1 with SSI
(https://wiki.postgresql.org/wiki/SSI)if you set the transaction isolation level to SERIALIZABLE, this problem is
solved,as the insert will take a "predicate lock" and the other insert cannot succeed. 
>
> We use this to detect / resolve concurrent inserts that violate primary keys and it works great.
>
> However in this case it probably doesn't help the OP because the cost of restarting the entire import is likely too
high.

ah, you're right!

merlin

Re: Deduplication and transaction isolation level

From

"DDT"

Date:

26 September 2013, 02:05:54

Is your table have only pkey one field and the sql is the exact same sql you use?

If you have more than one field to insert, DISTINCT ensure the whole row values set are distinct instead of one field value. that't maybe a reason for your situation.

------------------ Original ------------------

From: "Steven Schlansker";<steven@likeness.com>;

Date: Thu, Sep 26, 2013 01:50 AM

To: "Merlin Moncure"<mmoncure@gmail.com>;

Cc: "François Beausolei"<francois@teksol.info>; "Forums postgresql"<pgsql-general@postgresql.org>;

Subject: Re: [GENERAL] Deduplication and transaction isolation level

On Sep 25, 2013, at 6:04 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Tue, Sep 24, 2013 at 10:19 PM, François Beausoleil
> <francois@teksol.info> wrote:
>> Hi all!
>>
>> I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key, and the query I use to import the data is supposed to dedup before inserting, but I still get primary key violations.
>>
>> The import process is:
>>
>> * Load CSV data into temp table
>> * INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)
>>
>> I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key violations are proving me wrong.
>
> Right. Transaction A and B are interleaved: they both run the same
> check against the same id at the same time. Both checks pass because
> neither transaction is committed. This problem is not solvable by
> adjusting the isolation level.

Are you sure that this is the case? It is my understanding that since 9.1 with SSI (https://wiki.postgresql.org/wiki/SSI) if you set the transaction isolation level to SERIALIZABLE, this problem is solved, as the insert will take a "predicate lock" and the other insert cannot succeed.

We use this to detect / resolve concurrent inserts that violate primary keys and it works great.

However in this case it probably doesn't help the OP because the cost of restarting the entire import is likely too high.

>
> Typical solutions might be to:
> A. Lock the table while inserting
> B. Retry the transaction following an error.
> C. Import the records to a staging table, then copy the do the
> deduplication check when moving from the staging table
>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
.