Re: Comparing two slices within one table efficiently - Mailing list pgsql-sql
From | Andrew Kroeger |
---|---|
Subject | Re: Comparing two slices within one table efficiently |
Date | |
Msg-id | 46C0B01C.3020307@sprocks.gotdns.com Whole thread Raw |
In response to | Comparing two slices within one table efficiently ("Ken Simpson" <ksimpson@mailchannels.com>) |
Responses |
Re: Comparing two slices within one table efficiently
|
List | pgsql-sql |
Ken Simpson wrote: > I have a table with the following simplified form: > > create table t ( > run_id integer, > domain_id integer, > mta_id integer, > attribute1 integer, > attribute2 integer, > unique(run_id, domain_id, mta_id) > ); > > The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on. > > I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is therea coresponding tuple in run 2 where either attribute1 or attribute2 have changed? > > The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long time.e.g. > > select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1!= t1.attribute1 or t2.attribute2 != t1.attribute2) > > This query takes millenia... > > Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort ofoperation must be common in databases. A self-join will probably perform better in your case. Consider the following query: select t1.domain_id as domain_id, t1.mta_id as mta_id, t1.run_id as run_id_1, t1.attribute1 as attribute1_1, t1.attribute2as attribute2_1, t2.run_id as run_id_2, t2.attribute1 as attribute1_2, t2.attribute2 as attribute2_2 from t t1 join t t2 on t1.mta_id = t2.mta_id and t1.domain_id = t2.domain_id and (t1.attribute1 != t2.attribute1 or t1.attribute2!= t2.attribute2) You are basically joining together 2 copies of the same table (one aliased to "t1" and the other aliased to "t2"). The logic in the join conditions are the same you had in the subselect. You will probably want to constrain the query some more in the join conditions. As my example above is currently written, you will get 2 records for each true difference (e.g. for a given mta_id/domain_id, you'll get 1 record when run_id X comes from t1 and run_id Y comes from t2 & you get another record describing the same difference when run_id X comes from t2 and run_id Y comes from t1). Some example constraints that might help are: t1.run_id < t2.run_id Assuming run_id increases chronologically, this will return 1 record for each case there is a difference in attributes from any prior run. t1.run_id + 1 = t2.run_id Again assuming run_id increases chronologically, this will only return a record when there are attribute differences between 2 successive runs. Indexes should also be considered for the performance of this query. At a minimum, you will probably want a compound index on (domain_id, mta_id). [... thinks for a bit ...] Actually, you could re-order your unique constraint to take care of this, as a unique constraint is basically an index with a unique constraint on it. Your unique constraint is currently on (run_id, domain_id, mta_id). If you re-order that unique constraint to (domain_id, mta_id, run_id), you will probably see a couple of benefits: - Due to how PG can use the first parts of a compound index, it will be able to use the (domain_id, mta_id) portion as an index to help your joins. - I assume domain_id is much more selective than run_id, so the unique checks should speed up as well. The way you have the unique constraint written, the index is first scanned based on run_id which isn't very selective - once it finds the correct run_id in the index, there will still be a lot of index entries to scan to match on domain_id and mta_id. Re-ordering the way I propose should narrow down the number of index entries quite quickly, as it will first be narrowed on domain_id, then mta_id, and finally run_id. Hope this helps. Andrew