Thread: Seq Scan
Hi, I'm having some problems in performance in a simple select count(id) from.... I have 700 000 records in one table, and when I do: # explain select (id) from table_name; -[ RECORD 1 ]---------------------------------------------------------------- QUERY PLAN | Seq Scan on table_name (cost=0.00..8601.30 rows=266730 width=4) I had created an index for id(btree), but still shows "Seq Scan". What I'm doing wrong? Thanks, Tyler
Tyler Durden wrote: > I'm having some problems in performance in a simple select count(id) > from.... I have 700 000 records in one table, and when I do: > > # explain select (id) from table_name; > -[ RECORD > 1 ]---------------------------------------------------------------- > QUERY PLAN | Seq Scan on table_name (cost=0.00..8601.30 > rows=266730 width=4) > > I had created an index for id(btree), but still shows "Seq Scan". > What I'm doing wrong? You mention SELECT COUNT(ID), but your example shows SELECT ID. In either case, the planner is choosing the correct plan. Indexes exist to save the engine from visiting every row in the table, but both of these queries require every row to be visited anyway. Perhaps you think that these queries can be satisfied without visiting the actual table rows at all, using only the index. This is incorrect - PG doesn't work that way. - John Burger MITRE
Tyler Durden wrote: > Hi, > I'm having some problems in performance in a simple select count(id) > from.... I have 700 000 records in one table, and when I do: > > # explain select (id) from table_name; > -[ RECORD 1 > ]---------------------------------------------------------------- > QUERY PLAN | Seq Scan on table_name (cost=0.00..8601.30 rows=266730 > width=4) > > I had created an index for id(btree), but still shows "Seq Scan". > What I'm doing wrong? Nothing. You have to scan the table because you aren't giving postgresql anything to use the index by. Joshua D. Drake > > Thanks, > Tyler > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
Yes, either case happens the same. I'm come recently from MySQL and it works in a different way. I find strange that a simple SELECT COUNT(...) is so slow with only 700 000 records. Has been a nightmare optimizing this tables/queries. Sorry about this silly question, but I'm new to Posgresql. Thanks, Tyler On 6/1/07, John D. Burger <john@mitre.org> wrote: > You mention SELECT COUNT(ID), but your example shows SELECT ID. In > either case, the planner is choosing the correct plan. Indexes exist > to save the engine from visiting every row in the table, but both of > these queries require every row to be visited anyway. > > Perhaps you think that these queries can be satisfied without > visiting the actual table rows at all, using only the index. This is > incorrect - PG doesn't work that way. > > - John Burger > MITRE > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
Ok, my bad. But why this happens: # explain ANALYZE select id from table_name where id>200000; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using table_name_pkey on table_name (cost=0.00..2618.96 rows=68960 width=4) (actual time=220.543..1479.495 rows=66730 loops=1) Index Cond: (id > 200000) Total runtime: 1504.839 ms (3 rows) dun=# explain ANALYZE select id from table_name where id>10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Seq Scan on table_name (cost=0.00..9268.12 rows=266703 width=4) (actual time=107.935..2733.592 rows=266720 loops=1) Filter: (id > 10) Total runtime: 2833.744 ms (3 rows) It uses Index Scan for id>200000 and Seq Scan for id>10?! On 6/1/07, Joshua D. Drake <jd@commandprompt.com> wrote: > Nothing. You have to scan the table because you aren't giving postgresql > anything to use the index by. > > Joshua D. Drake >
On Jun 1, 2007, at 12:24 , Tyler Durden wrote: > On 6/1/07, Joshua D. Drake <jd@commandprompt.com> wrote: > >> Nothing. You have to scan the table because you aren't giving >> postgresql >> anything to use the index by. > # explain ANALYZE select id from table_name where id>200000; > > QUERY PLAN > ---------------------------------------------------------------------- > ---------------------------------------------------------------------- > Index Scan using table_name_pkey on table_name (cost=0.00..2618.96 > rows=68960 width=4) (actual time=220.543..1479.495 rows=66730 loops=1) > Index Cond: (id > 200000) > Total runtime: 1504.839 ms > (3 rows) > > dun=# explain ANALYZE select id from table_name where id>10; > QUERY PLAN > ---------------------------------------------------------------------- > ------------------------------------------------ > Seq Scan on table_name (cost=0.00..9268.12 rows=266703 width=4) > (actual time=107.935..2733.592 rows=266720 loops=1) > Filter: (id > 10) > Total runtime: 2833.744 ms > (3 rows) > > > It uses Index Scan for id>200000 and Seq Scan for id>10?! [Please don't top-post. It makes discussions harder to follow] Because the planner estimates that it will be faster for it to scan the entire table than to use the index in the latter case. Note that only about 70,000 rows need to be visited for id > 200000, while nearly 270,000 rows need to be visited when id > 10. Michael Glaesemann grzm seespotcode net
On Fri, 2007-06-01 at 18:24 +0100, Tyler Durden wrote:
Based on the statistics pg has for your table, and on the cost of using the index, the cost based optimizer decided that it's more efficient to seq scan all of the rows than to incur the index overhead. These decisions are not always correct, but they're usually quite good.
If you don't believe it, try:
# set enable_seqscan=off;
# explain analyze ...
# set enable_seqscan=on;
# explain analyze ...
-Reece
It uses Index Scan for id>200000 and Seq Scan for id>10?!
Based on the statistics pg has for your table, and on the cost of using the index, the cost based optimizer decided that it's more efficient to seq scan all of the rows than to incur the index overhead. These decisions are not always correct, but they're usually quite good.
If you don't believe it, try:
# set enable_seqscan=off;
# explain analyze ...
# set enable_seqscan=on;
# explain analyze ...
-Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
of the conversation. breaks the flow because it toppost Please don't At 1:17p -0400 on 01 Jun 2007, Tyler Durden wrote: > Yes, either case happens the same. > I'm come recently from MySQL and it works in a different way. > I find strange that a simple SELECT COUNT(...) is so slow with only > 700 000 records. > Has been a nightmare optimizing this tables/queries. > Sorry about this silly question, but I'm new to Posgresql. The much more knowledgable will correct me, but the abbr. version is that it is for data integrity and correctness reasons and the inherent way in which the MVCC model works. If it's any condolence, I believe that Oracle also suffers with this particular query, but I haven't seen any benchmarks to prove that (I think because Oracle expressly forbids benchmarks and comparisons in their license, but don't quote me on that). For a complete discussion, go this page and look for the thread with the subject 'Performance of count(*)' http://archives.postgresql.org/pgsql-performance/2007-03/threads.php I believe that may help. Kevin
Kevin Hunter <hunteke@earlham.edu> writes: > At 1:17p -0400 on 01 Jun 2007, Tyler Durden wrote: >> I find strange that a simple SELECT COUNT(...) is so slow with only >> 700 000 records. > The much more knowledgable will correct me, but the abbr. version is > that it is for data integrity and correctness reasons and the > inherent way in which the MVCC model works. The bottom line is that a "correct" implementation (ie, one that fully respects MVCC behavior) would create enormous overhead, as well as bad contention bottlenecks for concurrent updates. It doesn't seem worth it. If you want a cheap approximate answer, there are a couple of ways to get one, but SELECT COUNT(*) is not that. regards, tom lane
Ok, Thank all for the clarification.