Thread: Approximate join on timestamps

Approximate join on timestamps

From

"Phil Endecott"

Date:

20 March 2007, 20:30:57

Dear Experts,

I have two tables containing chronological data, and I want to join
them using the timestamps.  The challenge is that the timestamps only
match approximately.

My first attempt was something like

   t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)

Of course there is no "abs" for intervals, and I couldn't think of
anything better than this

   t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

What indexes could I add to make this moderately efficient?

But that query isn't really good enough.  There is no single "epsillon"
value that works for this data set.  I really want to find the closest match.

I feel that it ought to be possible to step through the two tables in
timestamp order matching up elements.  Is there any way to express this
is SQL?

(One detail is that the left table has fewer rows than the right table,
and I want one output row for each row in the left table.)

Many thanks for any suggestions.


Phil.


(You are welcome to CC: me in any replies.)

Fwd: Approximate join on timestamps

From

"Rhys Stewart"

Date:

20 March 2007, 20:51:36

---------- Forwarded message ----------
From: Rhys Stewart <rhys.stewart@gmail.com>
Date: Mar 20, 2007 6:50 PM
Subject: Re: [GENERAL] Approximate join on timestamps
To: Phil Endecott <spam_from_postgresql_general@chezphil.org>


had a similar problem a while back. so i made and abs_time function:

CREATE OR REPLACE FUNCTION abs_time(interval)
  RETURNS interval AS
$BODY$
BEGIN
if
        $1 < '00:00:00'::interval
then
        return ($1 * -1)::interval;
else
        return $1;
END IF;
END;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION abs_time(interval) OWNER TO postgres;


hopes this gets you somewhere




On 3/20/07, Phil Endecott <spam_from_postgresql_general@chezphil.org> wrote:
> Dear Experts,
>
> I have two tables containing chronological data, and I want to join
> them using the timestamps.  The challenge is that the timestamps only
> match approximately.
>
> My first attempt was something like
>
>    t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)
>
> Of course there is no "abs" for intervals, and I couldn't think of
> anything better than this
>
>    t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)
>
> What indexes could I add to make this moderately efficient?
>
> But that query isn't really good enough.  There is no single "epsillon"
> value that works for this data set.  I really want to find the closest match.
>
> I feel that it ought to be possible to step through the two tables in
> timestamp order matching up elements.  Is there any way to express this
> is SQL?
>
> (One detail is that the left table has fewer rows than the right table,
> and I want one output row for each row in the left table.)
>
> Many thanks for any suggestions.
>
>
> Phil.
>
>
> (You are welcome to CC: me in any replies.)
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

Re: Approximate join on timestamps

From

Jorge Godoy

Date:

20 March 2007, 20:55:02

"Phil Endecott" <spam_from_postgresql_general@chezphil.org> writes:

> I have two tables containing chronological data, and I want to join them using
> the timestamps.  The challenge is that the timestamps only match approximately.
>
> My first attempt was something like
>
>   t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)
>
> Of course there is no "abs" for intervals, and I couldn't think of anything
> better than this
>
>   t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)
>
> What indexes could I add to make this moderately efficient?
>
> But that query isn't really good enough.  There is no single "epsillon" value
> that works for this data set.  I really want to find the closest match.
>
> I feel that it ought to be possible to step through the two tables in
> timestamp order matching up elements.  Is there any way to express this is SQL?
>
> (One detail is that the left table has fewer rows than the right table, and I
> want one output row for each row in the left table.)
>
> Many thanks for any suggestions.

Untested, but what about something like a function that does (pseudocode below):


          select (min(t1.t) > ref_time) as above_t1
          select (max(t1.t) < ref_time) as below_t1
          if ((above_t1 - below_t1) =< '0 seconds'::interval then
             return above_t1
          else
             return below_t1


to find out the nearest time with regards to t1 when compared to a reference
time that should be the time you're looking for.

Do the same for t2...


I haven't checked the docs if there's something that already makes your life
easier :-)


--
Jorge Godoy      <jgodoy@gmail.com>

Re: Fwd: Approximate join on timestamps

From

Jorge Godoy

Date:

20 March 2007, 21:03:15

"Rhys Stewart" <rhys.stewart@gmail.com> writes:

> had a similar problem a while back. so i made and abs_time function:
>
> CREATE OR REPLACE FUNCTION abs_time(interval)
>  RETURNS interval AS
> $BODY$
> BEGIN
> if
>        $1 < '00:00:00'::interval
> then
>        return ($1 * -1)::interval;
> else
>        return $1;
> END IF;
> END;
> $BODY$
>  LANGUAGE 'plpgsql' VOLATILE;

I believe that you can declare this IMMUTABLE.  For a given interval it will
always return the same value, so you can benefit from some optimization.

http://www.postgresql.org/docs/8.2/interactive/xfunc-volatility.html


        An IMMUTABLE function cannot modify the database and is guaranteed to
        return the same results given the same arguments forever. This
        category allows the optimizer to pre-evaluate the function when a
        query calls it with constant arguments. For example, a query like
        SELECT ... WHERE x = 2 + 2 can be simplified on sight to SELECT
        ... WHERE x = 4, because the function underlying the integer addition
        operator is marked IMMUTABLE.



--
Jorge Godoy      <jgodoy@gmail.com>

Re: Approximate join on timestamps

From

Brent Wood

Date:

20 March 2007, 21:39:50

Phil Endecott wrote:
> Dear Experts,
>
> I have two tables containing chronological data, and I want to join
> them using the timestamps.  The challenge is that the timestamps only
> match approximately.

Hi Phil,

This is how we dealt with a similar situation. It may be suitable for
you....

We have about 200,000,000 records timestamped to facilitate this using a
more or less data warehousing approach.

We generate timestamps at one minute intervals, then assign the
appropriate values (readings) to each timestamp, by using the last
recorded reading before that time (separate record for each instrument)
within an appropriate interval (so missing values are not populated with
historic values). Note that this will discard all records for each
reading except for the last one per interval.

We also add a column ("timer") which has values of 1, 2, 5, 10, 20, 30,
60, 720, 1440 depending on the hour & minutes of the timestamp.

so a "where timer >=60" returns hourly readings, "where timer =720"
gives midday readings, "where timer >=10" gives every 10 minute reading,
etc

This then gets a clustered index on timestamp (and is partitioned on
year) and a 24 way self-relation (ie: 24 instrument readings joined by
timestamp returns 3 months of 10 minute values in 20 odd seconds on a
fastish desktop box.

HTH,

  Brent Wood

Re: Approximate join on timestamps

From

Klint Gore

Date:

20 March 2007, 22:56:05

On Tue, 20 Mar 2007 23:30:46 +0000, "Phil Endecott" <spam_from_postgresql_general@chezphil.org> wrote:
> I have two tables containing chronological data, and I want to join
> them using the timestamps.  The challenge is that the timestamps only
> match approximately.
>
> My first attempt was something like
>
>    t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)
>
> Of course there is no "abs" for intervals, and I couldn't think of
> anything better than this
>
>    t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1 min'::interval)

How about using extract(epoch from t) to turn it into a numeric value?

 select distinct on (t1.primary_key) *
 from t1
 join t2 on extract(epoch from t2.t) < extract(epoch from t1.t) + 30
     and extract(epoch from t2.t) > extract(epoch from t1.t) - 30
 order by t1.something,
     abs(extract(epoch from t2.t) - extract(epoch from t1.t));


> What indexes could I add to make this moderately efficient?

If t is timestamp without time zone then you might be able to use an
index on it

create index t1_epoch_idx on t1 ((extract(epoch from t)))
create index t2_epoch_idx on t2 ((extract(epoch from t)))


> But that query isn't really good enough.  There is no single "epsillon"
> value that works for this data set.  I really want to find the closest match.

see order by.  the +/- 30 in the above query can be used for tolerance
on the join.

> (One detail is that the left table has fewer rows than the right table,
> and I want one output row for each row in the left table.)

see distinct on.

klint.

+---------------------------------------+-----------------+
: Klint Gore                            : "Non rhyming    :
: EMail   : kg@kgb.une.edu.au           :  slang - the    :
: Snail   : A.B.R.I.                    :  possibilities  :
: Mail      University of New England   :  are useless"   :
:           Armidale NSW 2351 Australia :     L.J.J.      :
: Fax     : +61 2 6772 5376             :                 :
+---------------------------------------+-----------------+

Re: Approximate join on timestamps

From

Alban Hertroys

Date:

21 March 2007, 06:51:47

Phil Endecott wrote:
> Dear Experts,
>
> I have two tables containing chronological data, and I want to join them
> using the timestamps.  The challenge is that the timestamps only match
> approximately.
>
> My first attempt was something like
>
>   t1 join t2 on (abs(t1.t-t2.t)<'1 min'::interval)
>
> Of course there is no "abs" for intervals, and I couldn't think of
> anything better than this
>
>   t1 join t2 on (t1.t-t2.t<'1 min'::interval and t2.t-t1.t<'1
> min'::interval)

What about:
 t1 join t2
 on ((t1.t - interval '30s', t1.t + interval '30s') overlaps (t2.t -
interval '30s', t2.t + interval '30s'))

No need for abs(interval) or repeating conditions that way. My first
attempt was using 'between' instead of 'overlaps', but I don't think
that'll work correctly.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //