Home > mailing lists

Statistics use with functions - Mailing list pgsql-performance

From	Matthew Wakeling
Subject	Statistics use with functions
Date	May 8, 2009 10:46:27
Msg-id	alpine.DEB.2.00.0905081428430.2341@aragorn.flymine.org Whole thread Raw
Responses	Re: Statistics use with functions
List	pgsql-performance

Tree view

I'm running a rather complex query and noticed a peculiarity in the usage
of statistics that seriously affects the plan generated. I can extract the
relevant bit:

modmine-r9=# select * from pg_stats where tablename = 'geneflankingregion' AND attname IN ('distance', 'direction');
  schemaname |     tablename      |  attname  | null_frac | avg_width | n_distinct |         most_common_vals         |
             most_common_freqs                | histogram_bounds | correlation 

------------+--------------------+-----------+-----------+-----------+------------+----------------------------------+------------------------------------------------+------------------+-------------
  public     | geneflankingregion | distance  |         0 |         6 |          5 | {5.0kb,0.5kb,1.0kb,2.0kb,10.0kb} |
{0.201051,0.200798,0.200479,0.199088,0.198583}|                  |    0.197736 
  public     | geneflankingregion | direction |         0 |        10 |          2 | {downstream,upstream}            |
{0.500719,0.499281}                           |                  |    0.495437 
(2 rows)

modmine-r9=# SELECT COUNT(*) FROM geneflankingregion;
  count
--------
  455020
(1 row)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE distance = '10.0kb' AND direction = 'upstream';
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Seq Scan on geneflankingregion  (cost=0.00..15507.30 rows=45115 width=213) (actual time=0.053..181.764 rows=45502
loops=1)
    Filter: ((distance = '10.0kb'::text) AND (direction = 'upstream'::text))
  Total runtime: 227.245 ms
(3 rows)

modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE LOWER(distance) = '10.0kb' AND LOWER(direction) =
'upstream';
                                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on geneflankingregion
    (cost=66.95..88.77 rows=11 width=213)
    (actual time=207.555..357.359 rows=45502 loops=1)
    Recheck Cond: ((lower(distance) = '10.0kb'::text) AND (lower(direction) = 'upstream'::text))
    ->  BitmapAnd
          (cost=66.95..66.95 rows=11 width=0)
          (actual time=205.978..205.978 rows=0 loops=1)
          ->  Bitmap Index Scan on geneflankingregion__distance_equals
                (cost=0.00..31.34 rows=2275 width=0)
                (actual time=79.380..79.380 rows=91004 loops=1)
                Index Cond: (lower(distance) = '10.0kb'::text)
          ->  Bitmap Index Scan on geneflankingregion__direction_equals
                (cost=0.00..35.35 rows=2275 width=0)
                (actual time=124.639..124.639 rows=227510 loops=1)
                Index Cond: (lower(direction) = 'upstream'::text)
  Total runtime: 401.740 ms
(8 rows)

When I wrap the fields in the constraints in a LOWER() function, the
planner stops looking at the statistics and makes a wild guess, even
though it is very obvious from just looking what the result should be.
Embedded in a much larger query, the inaccuracy in the number of rows (11
instead of 45502) causes major planning problems. Also, why does the
BitmapAnd say zero actual rows?

I understand this probably isn't Priority No. 1, and there are some
interesting corner cases when n_distinct is higher than the histogram
width, but would it be possible to fix this one up?

Matthew

--
 I would like to think that in this day and age people would know better than
 to open executables in an e-mail. I'd also like to be able to flap my arms
 and fly to the moon.                                    -- Tim Mullen

pgsql-performance by date:

From: Viktor Rosenfeld
Date: 08 May 2009, 07:18:07
Subject: Re: Indexes not used in DELETE

From: Tom Lane
Date: 08 May 2009, 12:48:14
Subject: Re: Statistics use with functions

Statistics use with functions - Mailing list pgsql-performance

Previous

Next