Thread: Non-unique index performance

Non-unique index performance

From

Sezai YILMAZ

Date:

24 June 2005, 05:46:03

Hello!

I have a table eith name person as described below. It has an unique
index for id column (it is also primary key) and has an index for parent
column.

If I run a query with where clause on id column it uses the index (look
at the first explain analyze result; it says "Index Scan using...") and
the query for 582856 rows table results in 225,893 ms.

But, if I run another query with where clause on parent column it does
not use the index (look at the second explain analyze result; it says
"Seq Scan using...") and the query for 582856 rows table results in
11192.913 ms.

Why the difference of both queries is so dramatical for unique and
non-unique indexed columns? Why PostgreSQL does not use the non-unique
indexes (it says that it does sequential scan)?

I have to use an index on non-unique column. What is the solution for
that? Is there a way to speed up non-unique indexes?

*****************************************************************
test=> \d person
           Table "public.person"
 Column  |         Type          | Modifiers
---------+-----------------------+-----------
 name    | character varying(30) |
 surname | character varying(30) |
 id      | integer               | not null
 parent  | integer               |
Indexes:
    "person_pkey" primary key, btree (id)
    "parent_ndx" btree (parent)

test=> explain analyze select id,name from person where id in ('17201',
'338191', '244319', '515209', '20415');
    QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using person_pkey, person_pkey, person_pkey, person_pkey,
person_pkey on person  (cost=0.00..30.12 rows=5 width=18) (actual
time=56.817..225.760 rows=5 loops=1)
   Index Cond: ((id = 17201) OR (id = 338191) OR (id = 244319) OR (id =
515209)OR (id = 20415))
 Total runtime: 225.893 ms
(3 rows)




test=> explain analyze select * from person where parent in ('17201',
'338191', '244319', '515209', '20415');
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on person  (cost=0.00..35871.26 rows=14427 width=32) (actual
time=0.063..11192.809 rows=5 loops=1)
   Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319)
OR (parent = 515209) OR (parent = 20415))
 Total runtime: 11192.913 ms
(3 rows)

*****************************************************************

Thanks!

--
sy

Re: Non-unique index performance

From

Richard Huxton

Date:

24 June 2005, 06:21:02

Sezai YILMAZ wrote:
> Hello!
>
> I have a table eith name person as described below. It has an unique
> index for id column (it is also primary key) and has an index for parent
> column.
>
> If I run a query with where clause on id column it uses the index (look
> at the first explain analyze result; it says "Index Scan using...") and
> the query for 582856 rows table results in 225,893 ms.
>
> But, if I run another query with where clause on parent column it does
> not use the index (look at the second explain analyze result; it says
> "Seq Scan using...") and the query for 582856 rows table results in
> 11192.913 ms.
>
> Why the difference of both queries is so dramatical for unique and
> non-unique indexed columns? Why PostgreSQL does not use the non-unique
> indexes (it says that it does sequential scan)?

Because it thinks it will be faster/cheaper.

> I have to use an index on non-unique column. What is the solution for
> that? Is there a way to speed up non-unique indexes?

You don't want to force it to use an index, you want it to make better
estimates. Let's have a look...

> *****************************************************************
> test=> \d person
>           Table "public.person"
> Column  |         Type          | Modifiers
> ---------+-----------------------+-----------
> name    | character varying(30) |
> surname | character varying(30) |
> id      | integer               | not null
> parent  | integer               |
> Indexes:
>    "person_pkey" primary key, btree (id)
>    "parent_ndx" btree (parent)

OK - all very simple. And you've said there are about 580,000 rows.

> test=> explain analyze select id,name from person where id in ('17201',
> '338191', '244319', '515209', '20415');

Why are you quoting integers?

> test=> explain analyze select * from person where parent in ('17201',
> '338191', '244319', '515209', '20415');
>                                                   QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------
>
> Seq Scan on person  (cost=0.00..35871.26 rows=14427 width=32) (actual
> time=0.063..11192.809 rows=5 loops=1)
>   Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319) OR
> (parent = 515209) OR (parent = 20415))
> Total runtime: 11192.913 ms

Hmm - for some reason it's expecting 14427 rows to be returned. If there
were that many matches, then it might well be a better choice than going
back and fore between the index and the table all the time.

So - we need to find out why it thinks there will be so many rows returned.

1. VACUUM FULL ANALYSE person;
2. re-run the explain

That will make sure that the table's statistics are up-to-date. If that
hasn't helped, then perhaps we need to educate PG about the distribution
of values in "parent".

2. ALTER TABLE person ALTER COLUMN parent SET STATISTICS 100;
3. ANALYSE person;
4. re-run the explain

If that still doesn't work, keep increasing the statistics (max. value
1000).

Let us know how that works for you.
--
   Richard Huxton
   Archonet Ltd

Re: Non-unique index performance

From

Martijn van Oosterhout

Date:

24 June 2005, 06:29:23

On Fri, Jun 24, 2005 at 11:44:50AM +0300, Sezai YILMAZ wrote:
> Hello!
>
> I have a table eith name person as described below. It has an unique
> index for id column (it is also primary key) and has an index for parent
> column.

<snip>

> Why the difference of both queries is so dramatical for unique and
> non-unique indexed columns? Why PostgreSQL does not use the non-unique
> indexes (it says that it does sequential scan)?

<snip>

It has nothing to do with the index and everything to do with how many
rows it expected to return. If you look at the explain output you'll
see that the first only expected 5 rows to be returned, whereas the
second expected 14427 rows. Looking up 14000 rows in a index is rather
expensive so PostgreSQL decided that a seq scan would be faster.

If these numbers arn't accurate, you need to show EXPLAIN ANALYZE
output as well as checking how often you run ANALYZE generally.

Hope this helps,

> test=> explain analyze select id,name from person where id in ('17201',
> '338191', '244319', '515209', '20415');
>    QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using person_pkey, person_pkey, person_pkey, person_pkey,
> person_pkey on person  (cost=0.00..30.12 rows=5 width=18) (actual
> time=56.817..225.760 rows=5 loops=1)
>   Index Cond: ((id = 17201) OR (id = 338191) OR (id = 244319) OR (id =
> 515209)OR (id = 20415))
> Total runtime: 225.893 ms
> (3 rows)
>
>
>
>
> test=> explain analyze select * from person where parent in ('17201',
> '338191', '244319', '515209', '20415');
>                                                   QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------
> Seq Scan on person  (cost=0.00..35871.26 rows=14427 width=32) (actual
> time=0.063..11192.809 rows=5 loops=1)
>   Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319)
> OR (parent = 515209) OR (parent = 20415))
> Total runtime: 11192.913 ms
> (3 rows)
>

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-99762.dat

Re: Non-unique index performance

From

Sezai YILMAZ

Date:

24 June 2005, 07:13:04

Richard Huxton wrote:

> Sezai YILMAZ wrote:
>
>> Hello!
>>
>> I have a table eith name person as described below. It has an unique
>> index for id column (it is also primary key) and has an index for
>> parent column.
>>
>> If I run a query with where clause on id column it uses the index
>> (look at the first explain analyze result; it says "Index Scan
>> using...") and the query for 582856 rows table results in 225,893 ms.
>>
>> But, if I run another query with where clause on parent column it
>> does not use the index (look at the second explain analyze result; it
>> says "Seq Scan using...") and the query for 582856 rows table results
>> in 11192.913 ms.
>>
>> Why the difference of both queries is so dramatical for unique and
>> non-unique indexed columns? Why PostgreSQL does not use the
>> non-unique indexes (it says that it does sequential scan)?
>
>
> Because it thinks it will be faster/cheaper.
>
>> I have to use an index on non-unique column. What is the solution for
>> that? Is there a way to speed up non-unique indexes?
>
>
> You don't want to force it to use an index, you want it to make better
> estimates. Let's have a look...
>
>> *****************************************************************
>> test=> \d person
>>           Table "public.person"
>> Column  |         Type          | Modifiers
>> ---------+-----------------------+-----------
>> name    | character varying(30) |
>> surname | character varying(30) |
>> id      | integer               | not null
>> parent  | integer               |
>> Indexes:
>>    "person_pkey" primary key, btree (id)
>>    "parent_ndx" btree (parent)
>
>
> OK - all very simple. And you've said there are about 580,000 rows.
>
>> test=> explain analyze select id,name from person where id in
>> ('17201', '338191', '244319', '515209', '20415');
>
>
> Why are you quoting integers?

I qouted them to use indexes. The other method is type casting the
values to indexed column type. I prefer the quoting method.

>> test=> explain analyze select * from person where parent in ('17201',
>> '338191', '244319', '515209', '20415');
>>                                                   QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------
>>
>> Seq Scan on person  (cost=0.00..35871.26 rows=14427 width=32) (actual
>> time=0.063..11192.809 rows=5 loops=1)
>>   Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319)
>> OR (parent = 515209) OR (parent = 20415))
>> Total runtime: 11192.913 ms
>
>
> Hmm - for some reason it's expecting 14427 rows to be returned. If
> there were that many matches, then it might well be a better choice
> than going back and fore between the index and the table all the time.
>
> So - we need to find out why it thinks there will be so many rows
> returned.
>
> 1. VACUUM FULL ANALYSE person;
> 2. re-run the explain

This solved the problem. Now it takes about 213 ms.

Thanks in advance.

--
sy

Re: Non-unique index performance

From

Richard Huxton

Date:

24 June 2005, 07:25:50

Sezai YILMAZ wrote:
> Richard Huxton wrote:
>>
>> OK - all very simple. And you've said there are about 580,000 rows.
>>
>>> test=> explain analyze select id,name from person where id in
>>> ('17201', '338191', '244319', '515209', '20415');
>>
>> Why are you quoting integers?
>
> I qouted them to use indexes. The other method is type casting the
> values to indexed column type. I prefer the quoting method.

Sorry - this is just plain wrong.

If you had an int8 column and a value such as 17, then PG looked at 17
and said Ha! an int4. Then it would not use your index. In such cases
you could either cast the value to int8, or quote it (so that the
planner decided its type later in the process).

This was never required if you had a value that was large enough to be
int8 but not int4, nor when the column was int4. It is not an issue at
all as of version 8.0.

So - if the column is a plain old int4 - just do things normally.

>> Hmm - for some reason it's expecting 14427 rows to be returned. If
>> there were that many matches, then it might well be a better choice
>> than going back and fore between the index and the table all the time.
>>
>> So - we need to find out why it thinks there will be so many rows
>> returned.
>>
>> 1. VACUUM FULL ANALYSE person;
>> 2. re-run the explain
>
>
> This solved the problem. Now it takes about 213 ms.

If you'd analysed frequently anyway, perhaps repeat the steps a few
times and make sure the problem doesn't re-occur. If it does re-occur,
you'll want to increase the column statistics like I'd described.

--
   Richard Huxton
   Archonet Ltd

Re: Non-unique index performance

From

Martijn van Oosterhout

Date:

24 June 2005, 11:39:46

On Fri, Jun 24, 2005 at 11:23:54AM +0100, Richard Huxton wrote:
> >I qouted them to use indexes. The other method is type casting the
> >values to indexed column type. I prefer the quoting method.
>
> Sorry - this is just plain wrong.
>
> If you had an int8 column and a value such as 17, then PG looked at 17
> and said Ha! an int4. Then it would not use your index. In such cases
> you could either cast the value to int8, or quote it (so that the
> planner decided its type later in the process).
>
> This was never required if you had a value that was large enough to be
> int8 but not int4, nor when the column was int4. It is not an issue at
> all as of version 8.0.
>
> So - if the column is a plain old int4 - just do things normally.

Well, I'd have to disagree. If you're handcrafting every query and you
know the types of every column and you're confident they're not going
to change, then yes, you could leave off the quotes.

OTOH, if you want to be consistant, or you can't be sure your caller
passed the right type or you just don't know anything about the table
or function you're using, quoting is a simple effective method to make
sure you get it right. It costs nothing and is a hell of a lot easier.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

msg-32091-99777.dat