Thread: Non-unique index performance
Hello! I have a table eith name person as described below. It has an unique index for id column (it is also primary key) and has an index for parent column. If I run a query with where clause on id column it uses the index (look at the first explain analyze result; it says "Index Scan using...") and the query for 582856 rows table results in 225,893 ms. But, if I run another query with where clause on parent column it does not use the index (look at the second explain analyze result; it says "Seq Scan using...") and the query for 582856 rows table results in 11192.913 ms. Why the difference of both queries is so dramatical for unique and non-unique indexed columns? Why PostgreSQL does not use the non-unique indexes (it says that it does sequential scan)? I have to use an index on non-unique column. What is the solution for that? Is there a way to speed up non-unique indexes? ***************************************************************** test=> \d person Table "public.person" Column | Type | Modifiers ---------+-----------------------+----------- name | character varying(30) | surname | character varying(30) | id | integer | not null parent | integer | Indexes: "person_pkey" primary key, btree (id) "parent_ndx" btree (parent) test=> explain analyze select id,name from person where id in ('17201', '338191', '244319', '515209', '20415'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using person_pkey, person_pkey, person_pkey, person_pkey, person_pkey on person (cost=0.00..30.12 rows=5 width=18) (actual time=56.817..225.760 rows=5 loops=1) Index Cond: ((id = 17201) OR (id = 338191) OR (id = 244319) OR (id = 515209)OR (id = 20415)) Total runtime: 225.893 ms (3 rows) test=> explain analyze select * from person where parent in ('17201', '338191', '244319', '515209', '20415'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Seq Scan on person (cost=0.00..35871.26 rows=14427 width=32) (actual time=0.063..11192.809 rows=5 loops=1) Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319) OR (parent = 515209) OR (parent = 20415)) Total runtime: 11192.913 ms (3 rows) ***************************************************************** Thanks! -- sy
Sezai YILMAZ wrote: > Hello! > > I have a table eith name person as described below. It has an unique > index for id column (it is also primary key) and has an index for parent > column. > > If I run a query with where clause on id column it uses the index (look > at the first explain analyze result; it says "Index Scan using...") and > the query for 582856 rows table results in 225,893 ms. > > But, if I run another query with where clause on parent column it does > not use the index (look at the second explain analyze result; it says > "Seq Scan using...") and the query for 582856 rows table results in > 11192.913 ms. > > Why the difference of both queries is so dramatical for unique and > non-unique indexed columns? Why PostgreSQL does not use the non-unique > indexes (it says that it does sequential scan)? Because it thinks it will be faster/cheaper. > I have to use an index on non-unique column. What is the solution for > that? Is there a way to speed up non-unique indexes? You don't want to force it to use an index, you want it to make better estimates. Let's have a look... > ***************************************************************** > test=> \d person > Table "public.person" > Column | Type | Modifiers > ---------+-----------------------+----------- > name | character varying(30) | > surname | character varying(30) | > id | integer | not null > parent | integer | > Indexes: > "person_pkey" primary key, btree (id) > "parent_ndx" btree (parent) OK - all very simple. And you've said there are about 580,000 rows. > test=> explain analyze select id,name from person where id in ('17201', > '338191', '244319', '515209', '20415'); Why are you quoting integers? > test=> explain analyze select * from person where parent in ('17201', > '338191', '244319', '515209', '20415'); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------- > > Seq Scan on person (cost=0.00..35871.26 rows=14427 width=32) (actual > time=0.063..11192.809 rows=5 loops=1) > Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319) OR > (parent = 515209) OR (parent = 20415)) > Total runtime: 11192.913 ms Hmm - for some reason it's expecting 14427 rows to be returned. If there were that many matches, then it might well be a better choice than going back and fore between the index and the table all the time. So - we need to find out why it thinks there will be so many rows returned. 1. VACUUM FULL ANALYSE person; 2. re-run the explain That will make sure that the table's statistics are up-to-date. If that hasn't helped, then perhaps we need to educate PG about the distribution of values in "parent". 2. ALTER TABLE person ALTER COLUMN parent SET STATISTICS 100; 3. ANALYSE person; 4. re-run the explain If that still doesn't work, keep increasing the statistics (max. value 1000). Let us know how that works for you. -- Richard Huxton Archonet Ltd
On Fri, Jun 24, 2005 at 11:44:50AM +0300, Sezai YILMAZ wrote: > Hello! > > I have a table eith name person as described below. It has an unique > index for id column (it is also primary key) and has an index for parent > column. <snip> > Why the difference of both queries is so dramatical for unique and > non-unique indexed columns? Why PostgreSQL does not use the non-unique > indexes (it says that it does sequential scan)? <snip> It has nothing to do with the index and everything to do with how many rows it expected to return. If you look at the explain output you'll see that the first only expected 5 rows to be returned, whereas the second expected 14427 rows. Looking up 14000 rows in a index is rather expensive so PostgreSQL decided that a seq scan would be faster. If these numbers arn't accurate, you need to show EXPLAIN ANALYZE output as well as checking how often you run ANALYZE generally. Hope this helps, > test=> explain analyze select id,name from person where id in ('17201', > '338191', '244319', '515209', '20415'); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Scan using person_pkey, person_pkey, person_pkey, person_pkey, > person_pkey on person (cost=0.00..30.12 rows=5 width=18) (actual > time=56.817..225.760 rows=5 loops=1) > Index Cond: ((id = 17201) OR (id = 338191) OR (id = 244319) OR (id = > 515209)OR (id = 20415)) > Total runtime: 225.893 ms > (3 rows) > > > > > test=> explain analyze select * from person where parent in ('17201', > '338191', '244319', '515209', '20415'); > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------- > Seq Scan on person (cost=0.00..35871.26 rows=14427 width=32) (actual > time=0.063..11192.809 rows=5 loops=1) > Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319) > OR (parent = 515209) OR (parent = 20415)) > Total runtime: 11192.913 ms > (3 rows) > -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Richard Huxton wrote: > Sezai YILMAZ wrote: > >> Hello! >> >> I have a table eith name person as described below. It has an unique >> index for id column (it is also primary key) and has an index for >> parent column. >> >> If I run a query with where clause on id column it uses the index >> (look at the first explain analyze result; it says "Index Scan >> using...") and the query for 582856 rows table results in 225,893 ms. >> >> But, if I run another query with where clause on parent column it >> does not use the index (look at the second explain analyze result; it >> says "Seq Scan using...") and the query for 582856 rows table results >> in 11192.913 ms. >> >> Why the difference of both queries is so dramatical for unique and >> non-unique indexed columns? Why PostgreSQL does not use the >> non-unique indexes (it says that it does sequential scan)? > > > Because it thinks it will be faster/cheaper. > >> I have to use an index on non-unique column. What is the solution for >> that? Is there a way to speed up non-unique indexes? > > > You don't want to force it to use an index, you want it to make better > estimates. Let's have a look... > >> ***************************************************************** >> test=> \d person >> Table "public.person" >> Column | Type | Modifiers >> ---------+-----------------------+----------- >> name | character varying(30) | >> surname | character varying(30) | >> id | integer | not null >> parent | integer | >> Indexes: >> "person_pkey" primary key, btree (id) >> "parent_ndx" btree (parent) > > > OK - all very simple. And you've said there are about 580,000 rows. > >> test=> explain analyze select id,name from person where id in >> ('17201', '338191', '244319', '515209', '20415'); > > > Why are you quoting integers? I qouted them to use indexes. The other method is type casting the values to indexed column type. I prefer the quoting method. >> test=> explain analyze select * from person where parent in ('17201', >> '338191', '244319', '515209', '20415'); >> QUERY PLAN >> ----------------------------------------------------------------------------------------------------------------- >> >> Seq Scan on person (cost=0.00..35871.26 rows=14427 width=32) (actual >> time=0.063..11192.809 rows=5 loops=1) >> Filter: ((parent = 17201) OR (parent = 338191) OR (parent = 244319) >> OR (parent = 515209) OR (parent = 20415)) >> Total runtime: 11192.913 ms > > > Hmm - for some reason it's expecting 14427 rows to be returned. If > there were that many matches, then it might well be a better choice > than going back and fore between the index and the table all the time. > > So - we need to find out why it thinks there will be so many rows > returned. > > 1. VACUUM FULL ANALYSE person; > 2. re-run the explain This solved the problem. Now it takes about 213 ms. Thanks in advance. -- sy
Sezai YILMAZ wrote: > Richard Huxton wrote: >> >> OK - all very simple. And you've said there are about 580,000 rows. >> >>> test=> explain analyze select id,name from person where id in >>> ('17201', '338191', '244319', '515209', '20415'); >> >> Why are you quoting integers? > > I qouted them to use indexes. The other method is type casting the > values to indexed column type. I prefer the quoting method. Sorry - this is just plain wrong. If you had an int8 column and a value such as 17, then PG looked at 17 and said Ha! an int4. Then it would not use your index. In such cases you could either cast the value to int8, or quote it (so that the planner decided its type later in the process). This was never required if you had a value that was large enough to be int8 but not int4, nor when the column was int4. It is not an issue at all as of version 8.0. So - if the column is a plain old int4 - just do things normally. >> Hmm - for some reason it's expecting 14427 rows to be returned. If >> there were that many matches, then it might well be a better choice >> than going back and fore between the index and the table all the time. >> >> So - we need to find out why it thinks there will be so many rows >> returned. >> >> 1. VACUUM FULL ANALYSE person; >> 2. re-run the explain > > > This solved the problem. Now it takes about 213 ms. If you'd analysed frequently anyway, perhaps repeat the steps a few times and make sure the problem doesn't re-occur. If it does re-occur, you'll want to increase the column statistics like I'd described. -- Richard Huxton Archonet Ltd
On Fri, Jun 24, 2005 at 11:23:54AM +0100, Richard Huxton wrote: > >I qouted them to use indexes. The other method is type casting the > >values to indexed column type. I prefer the quoting method. > > Sorry - this is just plain wrong. > > If you had an int8 column and a value such as 17, then PG looked at 17 > and said Ha! an int4. Then it would not use your index. In such cases > you could either cast the value to int8, or quote it (so that the > planner decided its type later in the process). > > This was never required if you had a value that was large enough to be > int8 but not int4, nor when the column was int4. It is not an issue at > all as of version 8.0. > > So - if the column is a plain old int4 - just do things normally. Well, I'd have to disagree. If you're handcrafting every query and you know the types of every column and you're confident they're not going to change, then yes, you could leave off the quotes. OTOH, if you want to be consistant, or you can't be sure your caller passed the right type or you just don't know anything about the table or function you're using, quoting is a simple effective method to make sure you get it right. It costs nothing and is a hell of a lot easier. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.