Thread: Index usage on OR queries
Hi,
I'm trying to optimize a query where I have two tables that both have a timestamp column. I want the result where either of the timestamps is after a specified time. In a reduced form, like this:
CREATE TABLE a
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);
CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST);
CREATE TABLE b
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);
CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST);
--- generate some data
insert into a(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);
insert into b(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);
-- Using constraint works as expected, and uses the time index.
select * from a join b using(id)
where a.time >= '2011-08-15';
-- ... both ways...
select * from a join b using(id)
where b.time >= '2011-08-15';
-- However, if I'm trying to do this for both times at once, the time index is not used at all
select * from a join b using(id)
where a.time >= '2011-08-15' OR b.time >= '2011-08-01'
-- This can be optimized by using CTEs
with am as (
select * from a where time >= '2011-08-15'
)
, bm as (
select * from b where time >= '2011-08-15'
)
select * from am join bm using(id)
-- end
I'm just wondering why the optimizer does things the way it does - and if the CTE version is the best way to go...
The actual case is slightly more complex and uses more tables - this is mostly a way to find updated data.
--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554
I'm trying to optimize a query where I have two tables that both have a timestamp column. I want the result where either of the timestamps is after a specified time. In a reduced form, like this:
CREATE TABLE a
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);
CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST);
CREATE TABLE b
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);
CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST);
--- generate some data
insert into a(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);
insert into b(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);
-- Using constraint works as expected, and uses the time index.
select * from a join b using(id)
where a.time >= '2011-08-15';
-- ... both ways...
select * from a join b using(id)
where b.time >= '2011-08-15';
-- However, if I'm trying to do this for both times at once, the time index is not used at all
select * from a join b using(id)
where a.time >= '2011-08-15' OR b.time >= '2011-08-01'
-- This can be optimized by using CTEs
with am as (
select * from a where time >= '2011-08-15'
)
, bm as (
select * from b where time >= '2011-08-15'
)
select * from am join bm using(id)
-- end
I'm just wondering why the optimizer does things the way it does - and if the CTE version is the best way to go...
The actual case is slightly more complex and uses more tables - this is mostly a way to find updated data.
--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554
-- This can be optimized by using CTEs
with am as (
select * from a where time >= '2011-08-15'
)
, bm as (
select * from b where time >= '2011-08-15'
)
select * from am join bm using(id)
Disregard this, it doesn't to the same at all.
Now I'm more confused as to how I can optimize the query.
--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554
On 8/31/2011 9:35 AM, Tore Halvorsen wrote: > Hi, > > I'm trying to optimize a query where I have two tables that both have a > timestamp column. I want the result where either of the timestamps is > after a specified time. In a reduced form, like this: > > > CREATE TABLE a > ( > id serial NOT NULL PRIMARY KEY, > time timestamp without time zone NOT NULL DEFAULT now() > ); > > CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST); > > CREATE TABLE b > ( > id serial NOT NULL PRIMARY KEY, > time timestamp without time zone NOT NULL DEFAULT now() > ); > > CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST); > > --- generate some data > insert into a(time) > select now() - '10 year'::interval * random() from generate_series(1, > 1000000, 1); > > insert into b(time) > select now() - '10 year'::interval * random() from generate_series(1, > 1000000, 1); > > -- Using constraint works as expected, and uses the time index. > select * from a join b using(id) > where a.time >= '2011-08-15'; > > -- ... both ways... > select * from a join b using(id) > where b.time >= '2011-08-15'; > > -- However, if I'm trying to do this for both times at once, the time > index is not used at all > select * from a join b using(id) > where a.time >= '2011-08-15' OR b.time >= '2011-08-01' > > -- This can be optimized by using CTEs > with am as ( > select * from a where time >= '2011-08-15' > ) > , bm as ( > select * from b where time >= '2011-08-15' > ) > select * from am join bm using(id) > > -- end > > I'm just wondering why the optimizer does things the way it does - and > if the CTE version is the best way to go... > > The actual case is slightly more complex and uses more tables - this is > mostly a way to find updated data. > > -- > Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul] > <demo> 2011 Tore Halvorsen || +052 0553034554 On PG 9, after I ANALYZED the tables, it used indexes: QUERY PLAN --------------------------------------------------------------------------------------------------------------- Merge Join (cost=1.59..82778.35 rows=13171 width=20) (actual time=0.066..1076.616 rows=12966 loops=1) Merge Cond: (a.id = b.id) Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without time zone) OR (b."time" >= '2011-08-01 0 -> Index Scan using a_pkey on a (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.007..204.856 ro -> Index Scan using b_pkey on b (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.006..224.189 ro ANALYZE is the magic. -Andy
On Wed, Aug 31, 2011 at 4:49 PM, Andy Colson <andy@squeakycode.net> wrote:
On PG 9, after I ANALYZED the tables, it used indexes:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Merge Join (cost=1.59..82778.35 rows=13171 width=20) (actual time=0.066..1076.616 rows=12966 loops=1)
Merge Cond: (a.id = b.id)
Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without time zone) OR (b."time" >= '2011-08-01 0
-> Index Scan using a_pkey on a (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.007..204.856 ro
-> Index Scan using b_pkey on b (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.006..224.189 ro
ANALYZE is the magic.
You are, of course, right, but it doesn't use the TIME index.
Hmmm, may be my example isn't large enough to produce the issue.
--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554
On 8/31/2011 9:53 AM, Tore Halvorsen wrote: > On Wed, Aug 31, 2011 at 4:49 PM, Andy Colson <andy@squeakycode.net > <mailto:andy@squeakycode.net>> wrote: > > On PG 9, after I ANALYZED the tables, it used indexes: > > > QUERY PLAN > ------------------------------__------------------------------__------------------------------__--------------------- > Merge Join (cost=1.59..82778.35 rows=13171 width=20) (actual > time=0.066..1076.616 rows=12966 loops=1) > Merge Cond: (a.id <http://a.id> = b.id <http://b.id>) > Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp > without time zone) OR (b."time" >= '2011-08-01 0 > -> Index Scan using a_pkey on a (cost=0.00..31389.36 > rows=1000000 width=12) (actual time=0.007..204.856 ro > -> Index Scan using b_pkey on b (cost=0.00..31389.36 > rows=1000000 width=12) (actual time=0.006..224.189 ro > > > ANALYZE is the magic. > > > You are, of course, right, but it doesn't use the TIME index. > Hmmm, may be my example isn't large enough to produce the issue. > > -- > Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul] > <demo> 2011 Tore Halvorsen || +052 0553034554 wow, yea.. I saw index and just assumed. Didn't even notice, sorry about that. -Andy
On 31 Srpen 2011, 16:49, Andy Colson wrote: > On 8/31/2011 9:35 AM, Tore Halvorsen wrote: >> Hi, >> >> I'm trying to optimize a query where I have two tables that both have a >> timestamp column. I want the result where either of the timestamps is >> after a specified time. In a reduced form, like this: >> >> >> CREATE TABLE a >> ( >> id serial NOT NULL PRIMARY KEY, >> time timestamp without time zone NOT NULL DEFAULT now() >> ); >> >> CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST); >> >> CREATE TABLE b >> ( >> id serial NOT NULL PRIMARY KEY, >> time timestamp without time zone NOT NULL DEFAULT now() >> ); >> >> CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST); >> >> --- generate some data >> insert into a(time) >> select now() - '10 year'::interval * random() from generate_series(1, >> 1000000, 1); >> >> insert into b(time) >> select now() - '10 year'::interval * random() from generate_series(1, >> 1000000, 1); >> >> -- Using constraint works as expected, and uses the time index. >> select * from a join b using(id) >> where a.time >= '2011-08-15'; >> >> -- ... both ways... >> select * from a join b using(id) >> where b.time >= '2011-08-15'; >> >> -- However, if I'm trying to do this for both times at once, the time >> index is not used at all >> select * from a join b using(id) >> where a.time >= '2011-08-15' OR b.time >= '2011-08-01' >> >> -- This can be optimized by using CTEs >> with am as ( >> select * from a where time >= '2011-08-15' >> ) >> , bm as ( >> select * from b where time >= '2011-08-15' >> ) >> select * from am join bm using(id) >> >> -- end >> >> I'm just wondering why the optimizer does things the way it does - and >> if the CTE version is the best way to go... >> >> The actual case is slightly more complex and uses more tables - this is >> mostly a way to find updated data. >> >> -- >> Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul] >> <demo> 2011 Tore Halvorsen || +052 0553034554 > > > > On PG 9, after I ANALYZED the tables, it used indexes: > > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------- > Merge Join (cost=1.59..82778.35 rows=13171 width=20) (actual > time=0.066..1076.616 rows=12966 loops=1) > Merge Cond: (a.id = b.id) > Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without > time zone) OR (b."time" >= '2011-08-01 0 > -> Index Scan using a_pkey on a (cost=0.00..31389.36 rows=1000000 > width=12) (actual time=0.007..204.856 ro > -> Index Scan using b_pkey on b (cost=0.00..31389.36 rows=1000000 > width=12) (actual time=0.006..224.189 ro > > > ANALYZE is the magic. Yes ;-) Who says we don't have a magical fairy dust? Anyway you could try to postpone the join a bit - determine the IDs first and then join. Something like this WITH t AS ( SELECT id FROM a WHERE time >= '2011-08-15' UNION SELECT id FROM b WHERE time >= '2011-08-15' ) SELECT * FROM a JOIN b ON (a.id = b.id) WHERE id IN (SELECT id FROM t); or something like that. It's not as clean as your query, but in some cases it's faster. Tomas