[HACKERS] ASOF join - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | [HACKERS] ASOF join |
Date | |
Msg-id | bc494762-26bd-b100-e1f9-a97901ddad57@postgrespro.ru Whole thread Raw |
Responses |
Re: [HACKERS] ASOF join
|
List | pgsql-hackers |
I wonder if there were some discussion/attempts to add ASOF join to Postgres (sorry, may be there is better term for it, I am refereeing KDB definition: http://code.kx.com/wiki/Reference/aj ).
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:
//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];
...and not only. Below is one example of how people now manually coding ASOF join:
select
count(*),
count(*)
filter (where timedelta_prev < -30),
count(*)
filter (where ride_prev = ride_next),
... -- many different aggregates
from
(
select
p.provider_id,
p.ts,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as timedelta_prev,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as timedelta,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as ride_prev,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as ride_next
from (
select
provider_id,
ts,
event_name
from
lvl2_681_parsed p
) p
where
p.event_name = 'GPS signal restored'
-- offset 0
) z;
Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.
Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:
select * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);
It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:
//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];
...and not only. Below is one example of how people now manually coding ASOF join:
select
count(*),
count(*)
filter (where timedelta_prev < -30),
count(*)
filter (where ride_prev = ride_next),
... -- many different aggregates
from
(
select
p.provider_id,
p.ts,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as timedelta_prev,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as timedelta,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as ride_prev,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as ride_next
from (
select
provider_id,
ts,
event_name
from
lvl2_681_parsed p
) p
where
p.event_name = 'GPS signal restored'
-- offset 0
) z;
Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.
Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:
select * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);
It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: