Thread: Index usage on OR queries

Index usage on OR queries

From

Tore Halvorsen

Date:

31 August 2011, 11:35:38

Hi,

I'm trying to optimize a query where I have two tables that both have a timestamp column. I want the result where either of the timestamps is after a specified time. In a reduced form, like this:

CREATE TABLE a
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);

CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST);

CREATE TABLE b
(
id serial NOT NULL PRIMARY KEY,
time timestamp without time zone NOT NULL DEFAULT now()
);

CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST);

--- generate some data
insert into a(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);

insert into b(time)
select now() - '10 year'::interval * random() from generate_series(1, 1000000, 1);

-- Using constraint works as expected, and uses the time index.
select * from a join b using(id)
where a.time >= '2011-08-15';

-- ... both ways...
select * from a join b using(id)
where b.time >= '2011-08-15';

-- However, if I'm trying to do this for both times at once, the time index is not used at all
select * from a join b using(id)
where a.time >= '2011-08-15' OR b.time >= '2011-08-01'

-- This can be optimized by using CTEs
with am as (
select * from a where time >= '2011-08-15'
)
, bm as (
select * from b where time >= '2011-08-15'
)
select * from am join bm using(id)

-- end

I'm just wondering why the optimizer does things the way it does - and if the CTE version is the best way to go...

The actual case is slightly more complex and uses more tables - this is mostly a way to find updated data.

--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554

Re: Index usage on OR queries

From

Tore Halvorsen

Date:

31 August 2011, 11:41:25

-- This can be optimized by using CTEs
with am as (
select * from a where time >= '2011-08-15'
)
, bm as (
select * from b where time >= '2011-08-15'
)
select * from am join bm using(id)

Disregard this, it doesn't to the same at all.

Now I'm more confused as to how I can optimize the query.

--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554

Re: Index usage on OR queries

From

Andy Colson

Date:

31 August 2011, 11:49:06

On 8/31/2011 9:35 AM, Tore Halvorsen wrote:
> Hi,
>
> I'm trying to optimize a query where I have two tables that both have a
> timestamp column. I want the result where either of the timestamps is
> after a specified time. In a reduced form, like this:
>
>
> CREATE TABLE a
> (
>    id serial NOT NULL PRIMARY KEY,
>    time timestamp without time zone NOT NULL DEFAULT now()
> );
>
> CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST);
>
> CREATE TABLE b
> (
>    id serial NOT NULL PRIMARY KEY,
>    time timestamp without time zone NOT NULL DEFAULT now()
> );
>
> CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST);
>
> --- generate some data
> insert into a(time)
> select now() - '10 year'::interval * random() from generate_series(1,
> 1000000, 1);
>
> insert into b(time)
> select now() - '10 year'::interval * random() from generate_series(1,
> 1000000, 1);
>
> -- Using constraint works as expected, and uses the time index.
> select * from a join b using(id)
> where a.time >= '2011-08-15';
>
> -- ... both ways...
> select * from a join b using(id)
> where b.time >= '2011-08-15';
>
> -- However, if I'm trying to do this for both times at once, the time
> index is not used at all
> select * from a join b using(id)
> where a.time >= '2011-08-15' OR b.time >= '2011-08-01'
>
> -- This can be optimized by using CTEs
> with am as (
>    select * from a where time >= '2011-08-15'
> )
> , bm as (
>    select * from b where time >= '2011-08-15'
> )
> select * from am join bm using(id)
>
> -- end
>
> I'm just wondering why the optimizer does things the way it does - and
> if the CTE version is the best way to go...
>
> The actual case is slightly more complex and uses more tables - this is
> mostly a way to find updated data.
>
> --
> Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
> <demo> 2011 Tore Halvorsen || +052 0553034554



On PG 9, after I ANALYZED the tables, it used indexes:


QUERY PLAN
---------------------------------------------------------------------------------------------------------------
  Merge Join  (cost=1.59..82778.35 rows=13171 width=20) (actual
time=0.066..1076.616 rows=12966 loops=1)
    Merge Cond: (a.id = b.id)
    Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without
time zone) OR (b."time" >= '2011-08-01 0
    ->  Index Scan using a_pkey on a  (cost=0.00..31389.36 rows=1000000
width=12) (actual time=0.007..204.856 ro
    ->  Index Scan using b_pkey on b  (cost=0.00..31389.36 rows=1000000
width=12) (actual time=0.006..224.189 ro


ANALYZE is the magic.

-Andy

Re: Index usage on OR queries

From

Tore Halvorsen

Date:

31 August 2011, 11:54:02

On Wed, Aug 31, 2011 at 4:49 PM, Andy Colson <andy@squeakycode.net> wrote:

On PG 9, after I ANALYZED the tables, it used indexes:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Merge Join (cost=1.59..82778.35 rows=13171 width=20) (actual time=0.066..1076.616 rows=12966 loops=1)
Merge Cond: (a.id = b.id)
Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without time zone) OR (b."time" >= '2011-08-01 0
-> Index Scan using a_pkey on a (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.007..204.856 ro
-> Index Scan using b_pkey on b (cost=0.00..31389.36 rows=1000000 width=12) (actual time=0.006..224.189 ro

ANALYZE is the magic.

You are, of course, right, but it doesn't use the TIME index.
Hmmm, may be my example isn't large enough to produce the issue.

--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554

Re: Index usage on OR queries

From

Andy Colson

Date:

31 August 2011, 11:56:24

On 8/31/2011 9:53 AM, Tore Halvorsen wrote:
> On Wed, Aug 31, 2011 at 4:49 PM, Andy Colson <andy@squeakycode.net
> <mailto:andy@squeakycode.net>> wrote:
>
>     On PG 9, after I ANALYZED the tables, it used indexes:
>
>
>     QUERY PLAN
>
------------------------------__------------------------------__------------------------------__---------------------
>       Merge Join  (cost=1.59..82778.35 rows=13171 width=20) (actual
>     time=0.066..1076.616 rows=12966 loops=1)
>        Merge Cond: (a.id <http://a.id> = b.id <http://b.id>)
>        Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp
>     without time zone) OR (b."time" >= '2011-08-01 0
>        ->  Index Scan using a_pkey on a  (cost=0.00..31389.36
>     rows=1000000 width=12) (actual time=0.007..204.856 ro
>        ->  Index Scan using b_pkey on b  (cost=0.00..31389.36
>     rows=1000000 width=12) (actual time=0.006..224.189 ro
>
>
>     ANALYZE is the magic.
>
>
> You are, of course, right, but it doesn't use the TIME index.
> Hmmm, may be my example isn't large enough to produce the issue.
>
> --
> Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
> <demo> 2011 Tore Halvorsen || +052 0553034554

wow, yea.. I saw index and just assumed.  Didn't even notice, sorry
about that.

-Andy

Re: Index usage on OR queries

From

"Tomas Vondra"

Date:

31 August 2011, 12:01:41

On 31 Srpen 2011, 16:49, Andy Colson wrote:
> On 8/31/2011 9:35 AM, Tore Halvorsen wrote:
>> Hi,
>>
>> I'm trying to optimize a query where I have two tables that both have a
>> timestamp column. I want the result where either of the timestamps is
>> after a specified time. In a reduced form, like this:
>>
>>
>> CREATE TABLE a
>> (
>>    id serial NOT NULL PRIMARY KEY,
>>    time timestamp without time zone NOT NULL DEFAULT now()
>> );
>>
>> CREATE INDEX a_time_idx ON a USING btree (time DESC NULLS LAST);
>>
>> CREATE TABLE b
>> (
>>    id serial NOT NULL PRIMARY KEY,
>>    time timestamp without time zone NOT NULL DEFAULT now()
>> );
>>
>> CREATE INDEX b_time_idx ON b USING btree (time DESC NULLS LAST);
>>
>> --- generate some data
>> insert into a(time)
>> select now() - '10 year'::interval * random() from generate_series(1,
>> 1000000, 1);
>>
>> insert into b(time)
>> select now() - '10 year'::interval * random() from generate_series(1,
>> 1000000, 1);
>>
>> -- Using constraint works as expected, and uses the time index.
>> select * from a join b using(id)
>> where a.time >= '2011-08-15';
>>
>> -- ... both ways...
>> select * from a join b using(id)
>> where b.time >= '2011-08-15';
>>
>> -- However, if I'm trying to do this for both times at once, the time
>> index is not used at all
>> select * from a join b using(id)
>> where a.time >= '2011-08-15' OR b.time >= '2011-08-01'
>>
>> -- This can be optimized by using CTEs
>> with am as (
>>    select * from a where time >= '2011-08-15'
>> )
>> , bm as (
>>    select * from b where time >= '2011-08-15'
>> )
>> select * from am join bm using(id)
>>
>> -- end
>>
>> I'm just wondering why the optimizer does things the way it does - and
>> if the CTE version is the best way to go...
>>
>> The actual case is slightly more complex and uses more tables - this is
>> mostly a way to find updated data.
>>
>> --
>> Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
>> <demo> 2011 Tore Halvorsen || +052 0553034554
>
>
>
> On PG 9, after I ANALYZED the tables, it used indexes:
>
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------
>   Merge Join  (cost=1.59..82778.35 rows=13171 width=20) (actual
> time=0.066..1076.616 rows=12966 loops=1)
>     Merge Cond: (a.id = b.id)
>     Join Filter: ((a."time" >= '2011-08-15 00:00:00'::timestamp without
> time zone) OR (b."time" >= '2011-08-01 0
>     ->  Index Scan using a_pkey on a  (cost=0.00..31389.36 rows=1000000
> width=12) (actual time=0.007..204.856 ro
>     ->  Index Scan using b_pkey on b  (cost=0.00..31389.36 rows=1000000
> width=12) (actual time=0.006..224.189 ro
>
>
> ANALYZE is the magic.

Yes ;-) Who says we don't have a magical fairy dust?

Anyway you could try to postpone the join a bit - determine the IDs first
and then join. Something like this

WITH t AS (
  SELECT id FROM a WHERE time >= '2011-08-15'
  UNION
  SELECT id FROM b WHERE time >= '2011-08-15'
)
SELECT * FROM a JOIN b ON (a.id = b.id) WHERE id IN (SELECT id FROM t);

or something like that. It's not as clean as your query, but in some cases
it's faster.

Tomas