Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance
From | Torsten Zühlsdorff |
---|---|
Subject | Re: How to insert a bulk of data with unique-violations very fast |
Date | |
Msg-id | hug2od$nu4$1@news.eternal-september.org Whole thread Raw |
In response to | Re: How to insert a bulk of data with unique-violations very fast (Scott Marlowe <scott.marlowe@gmail.com>) |
Responses |
Re: How to insert a bulk of data with unique-violations
very fast
|
List | pgsql-performance |
Scott Marlowe schrieb: >>>> i have a set of unique data which about 150.000.000 rows. Regullary i get >>>> a >>>> list of data, which contains multiple times of rows than the already >>>> stored >>>> one. Often around 2.000.000.000 rows. Within this rows are many >>>> duplicates >>>> and often the set of already stored data. >>>> I want to store just every entry, which is not within the already stored >>>> one. Also i do not want to store duplicates. Example: >>> The standard method in pgsql is to load the data into a temp table >>> then insert where not exists in old table. >> Sorry, i didn't get it. I've googled some examples, but no one match at my >> case. Every example i found was a single insert which should be done or >> ignored, if the row is already stored. >> >> But in my case i have a bulk of rows with duplicates. Either your tipp >> doesn't match my case or i didn't unterstand it correctly. Can you provide a >> simple example? > > create table main (id int primary key, info text); > create table loader (id int, info text); > insert into main values (1,'abc'),(2,'def'),(3,'ghi'); > insert into loader values (1,'abc'),(4,'xyz'); > select * from main; > id | info > ----+------ > 1 | abc > 2 | def > 3 | ghi > (3 rows) > > select * from loader; > id | info > ----+------ > 1 | abc > 4 | xyz > (2 rows) > > insert into main select * from loader except select * from main; > select * from main; > id | info > ----+------ > 1 | abc > 2 | def > 3 | ghi > 4 | xyz > (4 rows) > > Note that for the where not exists to work the fields would need to be > all the same, or you'd need a more complex query. If the info field > here was different you'd get an error an no insert / update. For that > case you might want to use "where not in": > > insert into main select * from loader where id not in (select id from main); Thank you very much for your example. Now i've got it :) I've test your example on a small set of my rows. While testing i've stumpled over a difference in sql-formulation. Using except seems to be a little slower than the more complex where not in (subquery) group by. Here is my example: CREATE TABLE tseq (value text); INSERT INTO tseq VALUES ('a') , ('b'), ('c'); CREATE UNIQUE INDEX tseq_unique on tseq (value); CREATE TEMP TABLE tmpseq(value text); INSERT INTO tmpseq VALUES ('a') , ('b'), ('c'); INSERT INTO tmpseq VALUES ('a') , ('b'), ('c'); INSERT INTO tmpseq VALUES ('a') , ('b'), ('d'); INSERT INTO tmpseq VALUES ('d') , ('b'), ('d'); SELECT* from tseq; value ------- a b c (3 rows) SELECT* from tmpseq; value ------- a b c a b c a b d d b d (12 rows) VACUUM VERBOSE ANALYSE; explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- HashSetOp Except (cost=0.00..2.34 rows=4 width=2) (actual time=0.157..0.158 rows=1 loops=1) -> Append (cost=0.00..2.30 rows=15 width=2) (actual time=0.012..0.126 rows=15 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..1.24 rows=12 width=2) (actual time=0.009..0.060 rows=12 loops=1) -> Seq Scan on tmpseq (cost=0.00..1.12 rows=12 width=2) (actual time=0.004..0.022 rows=12 loops=1) -> Subquery Scan "*SELECT* 2" (cost=0.00..1.06 rows=3 width=2) (actual time=0.006..0.018 rows=3 loops=1) -> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2) (actual time=0.003..0.009 rows=3 loops=1) Total runtime: 0.216 ms (7 rows) explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT value FROM tseq) GROUP BY value; QUERY PLAN ------------------------------------------------------------------------------------------------------------ HashAggregate (cost=2.20..2.22 rows=2 width=2) (actual time=0.053..0.055 rows=1 loops=1) -> Seq Scan on tmpseq (cost=1.04..2.19 rows=6 width=2) (actual time=0.038..0.043 rows=3 loops=1) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2) (actual time=0.004..0.009 rows=3 loops=1) Total runtime: 0.105 ms (6 rows) My question: is this an generall behavior or just an effect of the small case? Greetings form Germany, Torsten
pgsql-performance by date: