Thread: index vs seqscan question
I was looking and one of my SQL statements today and decided to try a slight variation to my query - and what a difference it made!! The query went from over 10 minutes down to under 30 seconds. I curious though; if I read the explain output correctly... -> Index Scan using timesheet_index_emp on timesheet (cost=0.00..19056.43 rows=122207 width=40) -> Seq Scan on timesheet (cost=0.00..7244.02 rows=122207 width=28) These statements imply the planner knows a seqscan is quicker than an index scan (only 3 times faster by its estimate, actually much more), why does it still use an index scan? Additional information: - pgsql version 7.1 - vacuum analyse is run nightly. - the "earncode in..." clause by itself will select 85% of rows. - earncode = ' ' selects 80% of rows in table. - typ ' ' or 'A' selects 99% of rows in table. - final result is just approx 100 rows. Frank The original (slow) version: gwacl=> explain SELECT * FROM (SELECT emp, SUM (CASE WHEN earncode IN (' ','A','O','P','Q','X','Z') THEN date_part('epoch',age(hi_shift,lo_shift))/(60*60) ELSE 0 END) AS hours FROM timesheet WHERE typ=' ' OR typ='A' GROUP BY emp) AS ts INNER JOIN (SELECT emp, first, last FROM employee WHERE status='A') AS emp ON (ts.emp = emp.emp) WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC; NOTICE: QUERY PLAN: Sort (cost=21441.77..21441.77 rows=592 width=56) -> Merge Join (cost=21261.28..21414.52 rows=592 width=56) -> Sort (cost=21300.99..21300.99 rows=12221 width=40) -> Subquery Scan ts (cost=0.00..20278.50 rows=12221 width=40) -> Aggregate (cost=0.00..20278.50 rows=12221 width=40) -> Group (cost=0.00..19361.94 rows=122207 width=40) -> Index Scan using timesheet_index_emp on timesheet (cost=0.00..19056.43 rows=122207 width=40) -> Sort (cost=20.89..20.89 rows=38 width=36) -> Seq Scan on employee (cost=0.00..19.89 rows=38 width=36) EXPLAIN The revised (much improved) version: gwacl=> explain SELECT * FROM (SELECT emp, SUM(date_part('epoch',age(hi_shift,lo_shift))/(60*60)) as hours FROM timesheet WHERE typ=' ' OR typ='A' AND earncode IN (' ','A','O','P','Q','X','Z') GROUP BY emp) AS ts INNER JOIN (SELECT emp, first, last FROM employee WHERE status='A') AS emp ON (ts.emp = emp.emp) WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC; NOTICE: QUERY PLAN: Sort (cost=23993.79..23993.79 rows=592 width=56) -> Merge Join (cost=23813.31..23966.55 rows=592 width=56) -> Sort (cost=23816.60..23816.60 rows=12221 width=28) -> Subquery Scan ts (cost=21608.46..22830.53 rows=12221 width=28) -> Aggregate (cost=21608.46..22830.53 rows=12221 width=28) -> Group (cost=21608.46..21913.97 rows=122207 width=28) -> Sort (cost=21608.46..21608.46 rows=122207 width=28) -> Seq Scan on timesheet (cost=0.00..7244.02 rows=122207 width=28) -> Sort (cost=20.89..20.89 rows=38 width=36) -> Seq Scan on employee (cost=0.00..19.89 rows=38 width=36) EXPLAIN
OK, I should have finished testing my changes before posting - the new/faster query should have brackets around (typ=' ' OR typ='A'), but its still fast as lightning! I don't think it made a big difference to explain results, but it appears seqscan is cheaper than it was before? Sort (cost=17171.83..17171.83 rows=488 width=56) -> Merge Join (cost=17023.63..17150.05 rows=488 width=56) -> Sort (cost=17002.73..17002.73 rows=10075 width=28) -> Subquery Scan ts (cost=15325.25..16332.79 rows=10075 width=28) -> Aggregate (cost=15325.25..16332.79 rows=10075 width=28) -> Group (cost=15325.25..15577.14 rows=100754 width=28) -> Sort (cost=15325.25..15325.25 rows=100754 width=28) -> Seq Scan on timesheet (cost=0.00..5410.22 rows=100754 width=28) -> Sort (cost=20.89..20.89 rows=38 width=36) -> Seq Scan on employee (cost=0.00..19.89 rows=38 width=36) At 04:15 PM 11/22/02, Frank Bax wrote: >I was looking and one of my SQL statements today and decided to try a >slight variation to my query - and what a difference it made!! The query >went from over 10 minutes down to under 30 seconds. I curious though; if >I read the explain output correctly... >-> Index Scan using timesheet_index_emp on >timesheet (cost=0.00..19056.43 rows=122207 width=40) >-> Seq Scan on timesheet (cost=0.00..7244.02 rows=122207 width=28) > >These statements imply the planner knows a seqscan is quicker than an >index scan (only 3 times faster by its estimate, actually much more), why >does it still use an index scan? > >Additional information: >- pgsql version 7.1 >- vacuum analyse is run nightly. >- the "earncode in..." clause by itself will select 85% of rows. >- earncode = ' ' selects 80% of rows in table. >- typ ' ' or 'A' selects 99% of rows in table. >- final result is just approx 100 rows. > >Frank > > >The original (slow) version: >gwacl=> explain SELECT * FROM (SELECT emp, SUM (CASE WHEN earncode IN (' >','A','O','P','Q','X','Z') THEN >date_part('epoch',age(hi_shift,lo_shift))/(60*60) ELSE 0 END) AS hours >FROM timesheet WHERE typ=' ' OR typ='A' GROUP BY emp) AS ts INNER JOIN >(SELECT emp, first, last FROM employee WHERE status='A') AS emp ON (ts.emp >= emp.emp) WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC; >NOTICE: QUERY PLAN: > >Sort (cost=21441.77..21441.77 rows=592 width=56) > -> Merge Join (cost=21261.28..21414.52 rows=592 width=56) > -> Sort (cost=21300.99..21300.99 rows=12221 width=40) > -> Subquery Scan ts (cost=0.00..20278.50 rows=12221 width=40) > -> Aggregate (cost=0.00..20278.50 rows=12221 width=40) > -> Group (cost=0.00..19361.94 rows=122207 > width=40) > -> Index Scan using timesheet_index_emp > on timesheet (cost=0.00..19056.43 rows=122207 width=40) > -> Sort (cost=20.89..20.89 rows=38 width=36) > -> Seq Scan on employee (cost=0.00..19.89 rows=38 width=36) >EXPLAIN > >The revised (much improved) version: >gwacl=> explain SELECT * FROM (SELECT emp, >SUM(date_part('epoch',age(hi_shift,lo_shift))/(60*60)) as hours FROM >timesheet WHERE typ=' ' OR typ='A' AND earncode IN (' >','A','O','P','Q','X','Z') GROUP BY emp) AS ts INNER JOIN (SELECT emp, >first, last FROM employee WHERE status='A') AS emp ON (ts.emp = emp.emp) >WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC; >NOTICE: QUERY PLAN: > >Sort (cost=23993.79..23993.79 rows=592 width=56) > -> Merge Join (cost=23813.31..23966.55 rows=592 width=56) > -> Sort (cost=23816.60..23816.60 rows=12221 width=28) > -> Subquery Scan ts (cost=21608.46..22830.53 rows=12221 > width=28) > -> Aggregate (cost=21608.46..22830.53 rows=12221 > width=28) > -> Group (cost=21608.46..21913.97 rows=122207 > width=28) > -> Sort (cost=21608.46..21608.46 > rows=122207 width=28) > -> Seq Scan on > timesheet (cost=0.00..7244.02 rows=122207 width=28) > -> Sort (cost=20.89..20.89 rows=38 width=36) > -> Seq Scan on employee (cost=0.00..19.89 rows=38 width=36) >EXPLAIN > >---------------------------(end of broadcast)--------------------------- >TIP 6: Have you searched our list archives? > >http://archives.postgresql.org
Frank Bax <fbax@sympatico.ca> writes: > OK, I should have finished testing my changes before posting - the > new/faster query should have brackets around (typ=' ' OR typ='A'), but its > still fast as lightning! I don't think it made a big difference to explain > results, but it appears seqscan is cheaper than it was before? The version with the typo couldn't use an indexscan, I think (planner's not real smart about asymmetrical AND/OR structures). The fixed version is probably going for seqscan because with the additional AND condition, it's estimating fewer rows need to be sorted. The seqscan isn't getting cheaper, but the sort is. Looking only at the planner's estimates is not very reliable though. What does EXPLAIN ANALYZE have to say? regards, tom lane
At 05:30 PM 11/22/02, Tom Lane wrote: >Frank Bax <fbax@sympatico.ca> writes: > > OK, I should have finished testing my changes before posting - the > > new/faster query should have brackets around (typ=' ' OR typ='A'), but its > > still fast as lightning! I don't think it made a big difference to > explain > > results, but it appears seqscan is cheaper than it was before? > >The version with the typo couldn't use an indexscan, I think (planner's >not real smart about asymmetrical AND/OR structures). The fixed >version is probably going for seqscan because with the additional AND >condition, it's estimating fewer rows need to be sorted. The seqscan >isn't getting cheaper, but the sort is. > >Looking only at the planner's estimates is not very reliable though. >What does EXPLAIN ANALYZE have to say? Did I say 7.1? But I do have access to 7.2.2 on a faster system and there seqscan was used for both queries. The new query is still faster; probably because more rows were removed sooner. The old query was written the way it was because at one time hours were reported in two columns (therefore the case stmt). Sorry to be a bother. Results below if you're still interested... NOTICE: QUERY PLAN: Sort (cost=20886.51..20886.51 rows=5045 width=48) (actual time=9647.67..9647.80 rows=116 loops=1) -> Merge Join (cost=20481.62..20576.24 rows=5045 width=48) (actual time=9642.71..9646.58 rows=116 loops=1) -> Sort (cost=20484.38..20484.38 rows=12302 width=31) (actual time=9638.32..9638.57 rows=230 loops=1) -> Subquery Scan ts (cost=18247.82..19478.04 rows=12302 width=31) (actual time=5976.92..9636.87 rows=230 loops=1) -> Aggregate (cost=18247.82..19478.04 rows=12302 width=31) (actual time=5976.91..9636.01 rows=230 loops=1) -> Group (cost=18247.82..18555.38 rows=123022 width=31) (actual time=5966.50..7083.89 rows=122970 loops=1) -> Sort (cost=18247.82..18247.82 rows=123022 width=31) (actual time=5966.48..6228.12 rows=122970 loops =1) -> Seq Scan on timesheet (cost=0.00..3293.56 rows=123022 width=31) (actual time=0.04..624.57 rows= 122970 loops=1) -> Sort (cost=33.55..33.55 rows=326 width=30) (actual time=4.26..4.66 rows=326 loops=1) -> Seq Scan on employee (cost=0.00..19.94 rows=326 width=30) (actual time=0.49..2.07 rows=326 loops=1) Total runtime: 9984.54 msec NOTICE: QUERY PLAN: Sort (cost=17738.45..17738.45 rows=4237 width=48) (actual time=8188.82..8188.95 rows=116 loops=1) -> Merge Join (cost=17403.58..17483.19 rows=4237 width=48) (actual time=8183.96..8187.71 rows=116 loops=1) -> Sort (cost=17502.32..17502.32 rows=10333 width=26) (actual time=8179.38..8179.64 rows=230 loops=1) -> Subquery Scan ts (cost=15647.81..16681.09 rows=10333 width=26) (actual time=5528.26..8177.94 rows=230 loops=1 ) -> Aggregate (cost=15647.81..16681.09 rows=10333 width=26) (actual time=5528.26..8177.07 rows=230 loops=1) -> Group (cost=15647.81..15906.13 rows=103328 width=26) (actual time=5519.65..6468.57 rows=107328 loops= 1) -> Sort (cost=15647.81..15647.81 rows=103328 width=26) (actual time=5519.64..5737.93 rows=107328 loo ps=1) -> Seq Scan on timesheet (cost=0.00..5447.88 rows=103328 width=26) (actual time=0.06..848.90 row s=107328 loops=1) -> Sort (cost=33.55..33.55 rows=326 width=30) (actual time=4.45..4.81 rows=326 loops=1) -> Seq Scan on employee (cost=0.00..19.94 rows=326 width=30) (actual time=0.54..2.18 rows=326 loops=1) Total runtime: 8412.90 msec