Home > mailing lists

Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

From	Torsten Zühlsdorff
Subject	Re: How to insert a bulk of data with unique-violations very fast
Date	June 6, 2010 10:15:31
Msg-id	hug2od$nu4$1@news.eternal-september.org Whole thread Raw
In response to	Re: How to insert a bulk of data with unique-violations very fast (Scott Marlowe <scott.marlowe@gmail.com>)
Responses	Re: How to insert a bulk of data with unique-violations very fast
List	pgsql-performance

Tree view

Scott Marlowe schrieb:

>>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>>> a
>>>> list of data, which contains multiple times of rows than the already
>>>> stored
>>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>>> duplicates
>>>> and often the set of already stored data.
>>>> I want to store just every entry, which is not within the already stored
>>>> one. Also i do not want to store duplicates. Example:
>>> The standard method in pgsql is to load the data into a temp table
>>> then insert where not exists in old table.
>> Sorry, i didn't get it. I've googled some examples, but no one match at my
>> case. Every example i found was a single insert which should be done or
>> ignored, if the row is already stored.
>>
>> But in my case i have a bulk of rows with duplicates. Either your tipp
>> doesn't match my case or i didn't unterstand it correctly. Can you provide a
>> simple example?
>
> create table main (id int primary key, info text);
> create table loader (id int, info text);
> insert into main values (1,'abc'),(2,'def'),(3,'ghi');
> insert into loader values (1,'abc'),(4,'xyz');
> select * from main;
>  id | info
> ----+------
>   1 | abc
>   2 | def
>   3 | ghi
> (3 rows)
>
> select * from loader;
>  id | info
> ----+------
>   1 | abc
>   4 | xyz
> (2 rows)
>
> insert into main select * from loader except select * from main;
> select * from main;
>  id | info
> ----+------
>   1 | abc
>   2 | def
>   3 | ghi
>   4 | xyz
> (4 rows)
>
> Note that for the where not exists to work the fields would need to be
> all the same, or you'd need a more complex query.  If the info field
> here was different you'd get an error an no insert / update.  For that
> case you might want to use "where not in":
>
> insert into main select * from loader where id not in (select id from main);

Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've
stumpled over a difference in sql-formulation. Using except seems to be
a little slower than the more complex where not in (subquery) group by.
Here is my example:

CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
  value
-------
  a
  b
  c
(3 rows)

SELECT* from tmpseq;
  value
-------
  a
  b
  c
  a
  b
  c
  a
  b
  d
  d
  b
  d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------
  HashSetOp Except  (cost=0.00..2.34 rows=4 width=2) (actual
time=0.157..0.158 rows=1 loops=1)
    ->  Append  (cost=0.00..2.30 rows=15 width=2) (actual
time=0.012..0.126 rows=15 loops=1)
          ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1.24 rows=12
width=2) (actual time=0.009..0.060 rows=12 loops=1)
                ->  Seq Scan on tmpseq  (cost=0.00..1.12 rows=12
width=2) (actual time=0.004..0.022 rows=12 loops=1)
          ->  Subquery Scan "*SELECT* 2"  (cost=0.00..1.06 rows=3
width=2) (actual time=0.006..0.018 rows=3 loops=1)
                ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2)
(actual time=0.003..0.009 rows=3 loops=1)
  Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT
value FROM tseq) GROUP BY value;
                                                  QUERY PLAN

------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=2.20..2.22 rows=2 width=2) (actual
time=0.053..0.055 rows=1 loops=1)
    ->  Seq Scan on tmpseq  (cost=1.04..2.19 rows=6 width=2) (actual
time=0.038..0.043 rows=3 loops=1)
          Filter: (NOT (hashed SubPlan 1))
          SubPlan 1
            ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2)
(actual time=0.004..0.009 rows=3 loops=1)
  Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small
case?

Greetings form Germany,
Torsten

pgsql-performance by date:

From: Torsten Zühlsdorff
Date: 06 June 2010, 10:15:30
Subject: Re: How to insert a bulk of data with unique-violations very fast

From: Scott Marlowe
Date: 06 June 2010, 14:47:01
Subject: Re: How to insert a bulk of data with unique-violations very fast

Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

Previous

Next