Re: Use of index in 7.0 vs 6.5 - Mailing list pgsql-sql
From | Tom Lane |
---|---|
Subject | Re: Use of index in 7.0 vs 6.5 |
Date | |
Msg-id | 21530.959235278@sss.pgh.pa.us Whole thread Raw |
In response to | Use of index in 7.0 vs 6.5 (Ryan Bradetich <ryan_bradetich@hp.com>) |
Responses |
RE: Use of index in 7.0 vs 6.5
|
List | pgsql-sql |
Ryan Bradetich <ryan_bradetich@hp.com> writes: > procman=# explain select count(catagory) from medusa where host_id = 404 > and catagory like 'A%'; > Here is my analysis of the stastics (based on the examples in the > archive). > The most common value host_id in the table is 446 with row fraction of > ~ 2.8%. The estimated number of rows in the index is 6704. This > table has 4,630,229 entries in the table. > I do not understand why the planner would choose a seqscan over the > index scan because 6704/4,630,229 is ~ 0.15%. I see at least part of the answer. The 'rows=' number in the EXPLAIN output is the planner's estimate of the net number of rows out after applying all available WHERE restrictions. In this case we've got a clause "host_id = 404" which can be used with the index on host_id, and then we have another clause "catagory like 'A%'" which will be applied on-the-fly to the tuples returned by the indexscan. The rows number tells you about the estimated number of rows out after that second filter step. However, the cost of the indexscan depends mainly on the number of tuples that have to be fetched, and that is determined by the selectivity of just the "host_id = 404" clause. I made a dummy table with the schema you showed and then inserted the statistics you reported into the system tables (who's afraid of "update pg_class ..." ;-)). If I didn't mess up, you should be able to reproduce these EXPLAIN results: set enable_seqscan = off; explain select count(catagory) from medusa where host_id = 404; NOTICE: QUERY PLAN: Aggregate (cost=206943.69..206943.69 rows=1 width=12) -> Index Scan using medusa_host_id_key on medusa (cost=0.00..206781.97rows=64690 width=12) set enable_seqscan = on; explain select count(catagory) from medusa where host_id = 404; NOTICE: QUERY PLAN: Aggregate (cost=178115.59..178115.59 rows=1 width=12) -> Seq Scan on medusa (cost=0.00..177953.86 rows=64690 width=12) This shows that the planner is actually estimating that the indexscan will fetch about 64690 rows (of which it guesses only 6704 will remain after applying the catagory clause, but that's not really relevant to the cost estimate). Since there are 120076 pages in the table, that would mean pretty nearly one separate page fetch for each retrieved tuple, if the matching tuples are randomly distributed --- and that very probably *would* take more time than reading the whole table sequentially. So the planner's chain of logic holds up if all these assumptions are correct. Since you find that in reality the indexscan method is very quick, I'm guessing that there are actually fairly few tuples matching host_id = 404. Could you run a quick "select count(*)" to check? This seems to point up (once again) the deficiency of assuming that the most-common value in the table is a good guide to the frequency of typical values. You showed that host_id = 446 occurs in 2.8% of the rows in this table; a search for 446 very probably would be faster as a seqscan than as an indexscan (you might care to try it and see). But that's probably a statistical outlier that's not got much to do with the frequency of typical values in the table. The only really good answer to this problem is to collect more-detailed statistics in VACUUM ANALYZE, which I hope to see us doing in a release or so. In the meantime I am wondering about deliberately skewing the cost model in favor of indexscans, because I sure haven't heard many complaints about erroneous selection of indexscans... One way to put a thumb on the scales is to reduce the value of the SET variable random_page_cost. The default value is 4.0, which seems to correspond more or less to reality, but reducing it to 3 or so would shift the planner pretty nicely in the direction of indexscans. regards, tom lane