Thread: Approximate join on timestamps
Dear Experts, I have two tables containing chronological data, and I want to join them using the timestamps. The challenge is that the timestamps only match approximately. My first attempt was something like t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval) Of course there is no "abs" for intervals, and I couldn't think of anything better than this t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval) What indexes could I add to make this moderately efficient? But that query isn't really good enough. There is no single "epsillon" value that works for this data set. I really want to find the closest match. I feel that it ought to be possible to step through the two tables in timestamp order matching up elements. Is there any way to express this is SQL? (One detail is that the left table has fewer rows than the right table, and I want one output row for each row in the left table.) Many thanks for any suggestions. Phil. (You are welcome to CC: me in any replies.)
---------- Forwarded message ---------- From: Rhys Stewart <rhys.stewart@gmail.com> Date: Mar 20, 2007 6:50 PM Subject: Re: [GENERAL] Approximate join on timestamps To: Phil Endecott <spam_from_postgresql_general@chezphil.org> had a similar problem a while back. so i made and abs_time function: CREATE OR REPLACE FUNCTION abs_time(interval) RETURNS interval AS $BODY$ BEGIN if $1 < '00:00:00'::interval then return ($1 * -1)::interval; else return $1; END IF; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE; ALTER FUNCTION abs_time(interval) OWNER TO postgres; hopes this gets you somewhere On 3/20/07, Phil Endecott <spam_from_postgresql_general@chezphil.org> wrote: > Dear Experts, > > I have two tables containing chronological data, and I want to join > them using the timestamps. The challenge is that the timestamps only > match approximately. > > My first attempt was something like > > t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval) > > Of course there is no "abs" for intervals, and I couldn't think of > anything better than this > > t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval) > > What indexes could I add to make this moderately efficient? > > But that query isn't really good enough. There is no single "epsillon" > value that works for this data set. I really want to find the closest match. > > I feel that it ought to be possible to step through the two tables in > timestamp order matching up elements. Is there any way to express this > is SQL? > > (One detail is that the left table has fewer rows than the right table, > and I want one output row for each row in the left table.) > > Many thanks for any suggestions. > > > Phil. > > > (You are welcome to CC: me in any replies.) > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
"Phil Endecott" <spam_from_postgresql_general@chezphil.org> writes: > I have two tables containing chronological data, and I want to join them using > the timestamps. The challenge is that the timestamps only match approximately. > > My first attempt was something like > > t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval) > > Of course there is no "abs" for intervals, and I couldn't think of anything > better than this > > t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval) > > What indexes could I add to make this moderately efficient? > > But that query isn't really good enough. There is no single "epsillon" value > that works for this data set. I really want to find the closest match. > > I feel that it ought to be possible to step through the two tables in > timestamp order matching up elements. Is there any way to express this is SQL? > > (One detail is that the left table has fewer rows than the right table, and I > want one output row for each row in the left table.) > > Many thanks for any suggestions. Untested, but what about something like a function that does (pseudocode below): select (min(t1.t) > ref_time) as above_t1 select (max(t1.t) < ref_time) as below_t1 if ((above_t1 - below_t1) =< '0 seconds'::interval then return above_t1 else return below_t1 to find out the nearest time with regards to t1 when compared to a reference time that should be the time you're looking for. Do the same for t2... I haven't checked the docs if there's something that already makes your life easier :-) -- Jorge Godoy <jgodoy@gmail.com>
"Rhys Stewart" <rhys.stewart@gmail.com> writes: > had a similar problem a while back. so i made and abs_time function: > > CREATE OR REPLACE FUNCTION abs_time(interval) > RETURNS interval AS > $BODY$ > BEGIN > if > $1 < '00:00:00'::interval > then > return ($1 * -1)::interval; > else > return $1; > END IF; > END; > $BODY$ > LANGUAGE 'plpgsql' VOLATILE; I believe that you can declare this IMMUTABLE. For a given interval it will always return the same value, so you can benefit from some optimization. http://www.postgresql.org/docs/8.2/interactive/xfunc-volatility.html An IMMUTABLE function cannot modify the database and is guaranteed to return the same results given the same arguments forever. This category allows the optimizer to pre-evaluate the function when a query calls it with constant arguments. For example, a query like SELECT ... WHERE x = 2 + 2 can be simplified on sight to SELECT ... WHERE x = 4, because the function underlying the integer addition operator is marked IMMUTABLE. -- Jorge Godoy <jgodoy@gmail.com>
Phil Endecott wrote: > Dear Experts, > > I have two tables containing chronological data, and I want to join > them using the timestamps. The challenge is that the timestamps only > match approximately. Hi Phil, This is how we dealt with a similar situation. It may be suitable for you.... We have about 200,000,000 records timestamped to facilitate this using a more or less data warehousing approach. We generate timestamps at one minute intervals, then assign the appropriate values (readings) to each timestamp, by using the last recorded reading before that time (separate record for each instrument) within an appropriate interval (so missing values are not populated with historic values). Note that this will discard all records for each reading except for the last one per interval. We also add a column ("timer") which has values of 1, 2, 5, 10, 20, 30, 60, 720, 1440 depending on the hour & minutes of the timestamp. so a "where timer >=60" returns hourly readings, "where timer =720" gives midday readings, "where timer >=10" gives every 10 minute reading, etc This then gets a clustered index on timestamp (and is partitioned on year) and a 24 way self-relation (ie: 24 instrument readings joined by timestamp returns 3 months of 10 minute values in 20 odd seconds on a fastish desktop box. HTH, Brent Wood
On Tue, 20 Mar 2007 23:30:46 +0000, "Phil Endecott" <spam_from_postgresql_general@chezphil.org> wrote: > I have two tables containing chronological data, and I want to join > them using the timestamps. The challenge is that the timestamps only > match approximately. > > My first attempt was something like > > t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval) > > Of course there is no "abs" for intervals, and I couldn't think of > anything better than this > > t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval) How about using extract(epoch from t) to turn it into a numeric value? select distinct on (t1.primary_key) * from t1 join t2 on extract(epoch from t2.t) < extract(epoch from t1.t) + 30 and extract(epoch from t2.t) > extract(epoch from t1.t) - 30 order by t1.something, abs(extract(epoch from t2.t) - extract(epoch from t1.t)); > What indexes could I add to make this moderately efficient? If t is timestamp without time zone then you might be able to use an index on it create index t1_epoch_idx on t1 ((extract(epoch from t))) create index t2_epoch_idx on t2 ((extract(epoch from t))) > But that query isn't really good enough. There is no single "epsillon" > value that works for this data set. I really want to find the closest match. see order by. the +/- 30 in the above query can be used for tolerance on the join. > (One detail is that the left table has fewer rows than the right table, > and I want one output row for each row in the left table.) see distinct on. klint. +---------------------------------------+-----------------+ : Klint Gore : "Non rhyming : : EMail : kg@kgb.une.edu.au : slang - the : : Snail : A.B.R.I. : possibilities : : Mail University of New England : are useless" : : Armidale NSW 2351 Australia : L.J.J. : : Fax : +61 2 6772 5376 : : +---------------------------------------+-----------------+
Phil Endecott wrote: > Dear Experts, > > I have two tables containing chronological data, and I want to join them > using the timestamps. The challenge is that the timestamps only match > approximately. > > My first attempt was something like > > t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval) > > Of course there is no "abs" for intervals, and I couldn't think of > anything better than this > > t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 > min'::interval) What about: t1 join t2 on ((t1.t - interval '30s', t1.t + interval '30s') overlaps (t2.t - interval '30s', t2.t + interval '30s')) No need for abs(interval) or repeating conditions that way. My first attempt was using 'between' instead of 'overlaps', but I don't think that'll work correctly. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //