Thread: index vs seqscan question

index vs seqscan question

From
Frank Bax
Date:
I was looking and one of my SQL statements today and decided to try a
slight variation to my query - and what a difference it made!!  The query
went from over 10 minutes down to under 30 seconds.  I curious though; if I
read the explain output correctly...
->  Index Scan using timesheet_index_emp on timesheet  (cost=0.00..19056.43
rows=122207 width=40)
->  Seq Scan on timesheet  (cost=0.00..7244.02 rows=122207 width=28)

These statements imply the planner knows a seqscan is quicker than an index
scan (only 3 times faster by its estimate, actually much more), why does it
still use an index scan?

Additional information:
- pgsql version 7.1
- vacuum analyse is run nightly.
- the "earncode in..." clause by itself will select 85% of rows.
- earncode = ' ' selects 80% of rows in table.
- typ ' ' or 'A' selects 99% of rows in table.
- final result is just approx 100 rows.

Frank


The original (slow) version:
gwacl=> explain SELECT * FROM (SELECT emp, SUM (CASE WHEN earncode IN ('
','A','O','P','Q','X','Z') THEN
date_part('epoch',age(hi_shift,lo_shift))/(60*60) ELSE 0 END) AS hours FROM
timesheet WHERE typ=' ' OR typ='A' GROUP BY emp) AS ts INNER JOIN (SELECT
emp, first, last FROM employee WHERE status='A') AS emp ON (ts.emp =
emp.emp) WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC;
NOTICE:  QUERY PLAN:

Sort  (cost=21441.77..21441.77 rows=592 width=56)
   ->  Merge Join  (cost=21261.28..21414.52 rows=592 width=56)
         ->  Sort  (cost=21300.99..21300.99 rows=12221 width=40)
               ->  Subquery Scan ts  (cost=0.00..20278.50 rows=12221 width=40)
                     ->  Aggregate  (cost=0.00..20278.50 rows=12221 width=40)
                           ->  Group  (cost=0.00..19361.94 rows=122207
width=40)
                                 ->  Index Scan using timesheet_index_emp
on timesheet  (cost=0.00..19056.43 rows=122207 width=40)
         ->  Sort  (cost=20.89..20.89 rows=38 width=36)
               ->  Seq Scan on employee  (cost=0.00..19.89 rows=38 width=36)
EXPLAIN

The revised (much improved) version:
gwacl=> explain SELECT * FROM (SELECT emp,
SUM(date_part('epoch',age(hi_shift,lo_shift))/(60*60)) as hours FROM
timesheet WHERE typ=' ' OR typ='A' AND earncode IN ('
','A','O','P','Q','X','Z') GROUP BY emp) AS ts INNER JOIN (SELECT emp,
first, last FROM employee WHERE status='A') AS emp ON (ts.emp = emp.emp)
WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC;
NOTICE:  QUERY PLAN:

Sort  (cost=23993.79..23993.79 rows=592 width=56)
   ->  Merge Join  (cost=23813.31..23966.55 rows=592 width=56)
         ->  Sort  (cost=23816.60..23816.60 rows=12221 width=28)
               ->  Subquery Scan ts  (cost=21608.46..22830.53 rows=12221
width=28)
                     ->  Aggregate  (cost=21608.46..22830.53 rows=12221
width=28)
                           ->  Group  (cost=21608.46..21913.97 rows=122207
width=28)
                                 ->  Sort  (cost=21608.46..21608.46
rows=122207 width=28)
                                      ->  Seq Scan on
timesheet  (cost=0.00..7244.02 rows=122207 width=28)
         ->  Sort  (cost=20.89..20.89 rows=38 width=36)
               ->  Seq Scan on employee  (cost=0.00..19.89 rows=38 width=36)
EXPLAIN

Re: index vs seqscan question

From
Frank Bax
Date:
OK, I should have finished testing my changes before posting - the
new/faster query should have brackets around (typ=' ' OR typ='A'), but its
still fast as lightning!  I don't think it made a big difference to explain
results, but it appears seqscan is cheaper than it was before?

Sort  (cost=17171.83..17171.83 rows=488 width=56)
   ->  Merge Join  (cost=17023.63..17150.05 rows=488 width=56)
         ->  Sort  (cost=17002.73..17002.73 rows=10075 width=28)
               ->  Subquery Scan ts  (cost=15325.25..16332.79 rows=10075
width=28)
                     ->  Aggregate  (cost=15325.25..16332.79 rows=10075
width=28)
                           ->  Group  (cost=15325.25..15577.14 rows=100754
width=28)
                                 ->  Sort  (cost=15325.25..15325.25
rows=100754 width=28)
                                       ->  Seq Scan on
timesheet  (cost=0.00..5410.22 rows=100754 width=28)
         ->  Sort  (cost=20.89..20.89 rows=38 width=36)
               ->  Seq Scan on employee  (cost=0.00..19.89 rows=38 width=36)



At 04:15 PM 11/22/02, Frank Bax wrote:
>I was looking and one of my SQL statements today and decided to try a
>slight variation to my query - and what a difference it made!!  The query
>went from over 10 minutes down to under 30 seconds.  I curious though; if
>I read the explain output correctly...
>->  Index Scan using timesheet_index_emp on
>timesheet  (cost=0.00..19056.43 rows=122207 width=40)
>->  Seq Scan on timesheet  (cost=0.00..7244.02 rows=122207 width=28)
>
>These statements imply the planner knows a seqscan is quicker than an
>index scan (only 3 times faster by its estimate, actually much more), why
>does it still use an index scan?
>
>Additional information:
>- pgsql version 7.1
>- vacuum analyse is run nightly.
>- the "earncode in..." clause by itself will select 85% of rows.
>- earncode = ' ' selects 80% of rows in table.
>- typ ' ' or 'A' selects 99% of rows in table.
>- final result is just approx 100 rows.
>
>Frank
>
>
>The original (slow) version:
>gwacl=> explain SELECT * FROM (SELECT emp, SUM (CASE WHEN earncode IN ('
>','A','O','P','Q','X','Z') THEN
>date_part('epoch',age(hi_shift,lo_shift))/(60*60) ELSE 0 END) AS hours
>FROM timesheet WHERE typ=' ' OR typ='A' GROUP BY emp) AS ts INNER JOIN
>(SELECT emp, first, last FROM employee WHERE status='A') AS emp ON (ts.emp
>= emp.emp) WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC;
>NOTICE:  QUERY PLAN:
>
>Sort  (cost=21441.77..21441.77 rows=592 width=56)
>   ->  Merge Join  (cost=21261.28..21414.52 rows=592 width=56)
>         ->  Sort  (cost=21300.99..21300.99 rows=12221 width=40)
>               ->  Subquery Scan ts  (cost=0.00..20278.50 rows=12221 width=40)
>                     ->  Aggregate  (cost=0.00..20278.50 rows=12221 width=40)
>                           ->  Group  (cost=0.00..19361.94 rows=122207
> width=40)
>                                 ->  Index Scan using timesheet_index_emp
> on timesheet  (cost=0.00..19056.43 rows=122207 width=40)
>         ->  Sort  (cost=20.89..20.89 rows=38 width=36)
>               ->  Seq Scan on employee  (cost=0.00..19.89 rows=38 width=36)
>EXPLAIN
>
>The revised (much improved) version:
>gwacl=> explain SELECT * FROM (SELECT emp,
>SUM(date_part('epoch',age(hi_shift,lo_shift))/(60*60)) as hours FROM
>timesheet WHERE typ=' ' OR typ='A' AND earncode IN ('
>','A','O','P','Q','X','Z') GROUP BY emp) AS ts INNER JOIN (SELECT emp,
>first, last FROM employee WHERE status='A') AS emp ON (ts.emp = emp.emp)
>WHERE hours BETWEEN 0 AND 1250 ORDER BY hours DESC;
>NOTICE:  QUERY PLAN:
>
>Sort  (cost=23993.79..23993.79 rows=592 width=56)
>   ->  Merge Join  (cost=23813.31..23966.55 rows=592 width=56)
>         ->  Sort  (cost=23816.60..23816.60 rows=12221 width=28)
>               ->  Subquery Scan ts  (cost=21608.46..22830.53 rows=12221
> width=28)
>                     ->  Aggregate  (cost=21608.46..22830.53 rows=12221
> width=28)
>                           ->  Group  (cost=21608.46..21913.97 rows=122207
> width=28)
>                                 ->  Sort  (cost=21608.46..21608.46
> rows=122207 width=28)
>                                      ->  Seq Scan on
> timesheet  (cost=0.00..7244.02 rows=122207 width=28)
>         ->  Sort  (cost=20.89..20.89 rows=38 width=36)
>               ->  Seq Scan on employee  (cost=0.00..19.89 rows=38 width=36)
>EXPLAIN
>
>---------------------------(end of broadcast)---------------------------
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org

Re: index vs seqscan question

From
Tom Lane
Date:
Frank Bax <fbax@sympatico.ca> writes:
> OK, I should have finished testing my changes before posting - the
> new/faster query should have brackets around (typ=' ' OR typ='A'), but its
> still fast as lightning!  I don't think it made a big difference to explain
> results, but it appears seqscan is cheaper than it was before?

The version with the typo couldn't use an indexscan, I think (planner's
not real smart about asymmetrical AND/OR structures).  The fixed
version is probably going for seqscan because with the additional AND
condition, it's estimating fewer rows need to be sorted.  The seqscan
isn't getting cheaper, but the sort is.

Looking only at the planner's estimates is not very reliable though.
What does EXPLAIN ANALYZE have to say?

            regards, tom lane

Re: index vs seqscan question

From
Frank Bax
Date:
At 05:30 PM 11/22/02, Tom Lane wrote:
>Frank Bax <fbax@sympatico.ca> writes:
> > OK, I should have finished testing my changes before posting - the
> > new/faster query should have brackets around (typ=' ' OR typ='A'), but its
> > still fast as lightning!  I don't think it made a big difference to
> explain
> > results, but it appears seqscan is cheaper than it was before?
>
>The version with the typo couldn't use an indexscan, I think (planner's
>not real smart about asymmetrical AND/OR structures).  The fixed
>version is probably going for seqscan because with the additional AND
>condition, it's estimating fewer rows need to be sorted.  The seqscan
>isn't getting cheaper, but the sort is.
>
>Looking only at the planner's estimates is not very reliable though.
>What does EXPLAIN ANALYZE have to say?


Did I say 7.1?  But I do have access to 7.2.2 on a faster system and there
seqscan was used for both queries.  The new query is still faster; probably
because more rows were removed sooner.  The old query was written the way
it was because at one time hours were reported in two columns (therefore
the case stmt).  Sorry to be a bother.  Results below if you're still
interested...

NOTICE:  QUERY PLAN:

Sort  (cost=20886.51..20886.51 rows=5045 width=48) (actual
time=9647.67..9647.80 rows=116 loops=1)
   ->  Merge Join  (cost=20481.62..20576.24 rows=5045 width=48) (actual
time=9642.71..9646.58 rows=116 loops=1)
       ->  Sort  (cost=20484.38..20484.38 rows=12302 width=31) (actual
time=9638.32..9638.57 rows=230 loops=1)
           ->  Subquery Scan ts  (cost=18247.82..19478.04 rows=12302
width=31) (actual time=5976.92..9636.87 rows=230 loops=1)
               ->  Aggregate  (cost=18247.82..19478.04 rows=12302 width=31)
(actual time=5976.91..9636.01 rows=230 loops=1)
                   ->  Group  (cost=18247.82..18555.38 rows=123022
width=31) (actual time=5966.50..7083.89 rows=122970 loops=1)
                       ->  Sort  (cost=18247.82..18247.82 rows=123022
width=31) (actual time=5966.48..6228.12 rows=122970 loops
=1)
                           ->  Seq Scan on timesheet  (cost=0.00..3293.56
rows=123022 width=31) (actual time=0.04..624.57 rows=
122970 loops=1)
       ->  Sort  (cost=33.55..33.55 rows=326 width=30) (actual
time=4.26..4.66 rows=326 loops=1)
           ->  Seq Scan on employee  (cost=0.00..19.94 rows=326 width=30)
(actual time=0.49..2.07 rows=326 loops=1)
Total runtime: 9984.54 msec

NOTICE:  QUERY PLAN:

Sort  (cost=17738.45..17738.45 rows=4237 width=48) (actual
time=8188.82..8188.95 rows=116 loops=1)
   ->  Merge Join  (cost=17403.58..17483.19 rows=4237 width=48) (actual
time=8183.96..8187.71 rows=116 loops=1)
         ->  Sort  (cost=17502.32..17502.32 rows=10333 width=26) (actual
time=8179.38..8179.64 rows=230 loops=1)
             ->  Subquery Scan ts  (cost=15647.81..16681.09 rows=10333
width=26) (actual time=5528.26..8177.94 rows=230 loops=1
)
                 ->  Aggregate  (cost=15647.81..16681.09 rows=10333
width=26) (actual time=5528.26..8177.07 rows=230 loops=1)
                     ->  Group  (cost=15647.81..15906.13 rows=103328
width=26) (actual time=5519.65..6468.57 rows=107328 loops=
1)
                         ->  Sort  (cost=15647.81..15647.81 rows=103328
width=26) (actual time=5519.64..5737.93 rows=107328 loo
ps=1)
                             ->  Seq Scan on timesheet  (cost=0.00..5447.88
rows=103328 width=26) (actual time=0.06..848.90 row
s=107328 loops=1)
         ->  Sort  (cost=33.55..33.55 rows=326 width=30) (actual
time=4.45..4.81 rows=326 loops=1)
             ->  Seq Scan on employee  (cost=0.00..19.94 rows=326 width=30)
(actual time=0.54..2.18 rows=326 loops=1)
Total runtime: 8412.90 msec