Statistics use with functions - Mailing list pgsql-performance
From | Matthew Wakeling |
---|---|
Subject | Statistics use with functions |
Date | |
Msg-id | alpine.DEB.2.00.0905081428430.2341@aragorn.flymine.org Whole thread Raw |
Responses |
Re: Statistics use with functions
|
List | pgsql-performance |
I'm running a rather complex query and noticed a peculiarity in the usage of statistics that seriously affects the plan generated. I can extract the relevant bit: modmine-r9=# select * from pg_stats where tablename = 'geneflankingregion' AND attname IN ('distance', 'direction'); schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+--------------------+-----------+-----------+-----------+------------+----------------------------------+------------------------------------------------+------------------+------------- public | geneflankingregion | distance | 0 | 6 | 5 | {5.0kb,0.5kb,1.0kb,2.0kb,10.0kb} | {0.201051,0.200798,0.200479,0.199088,0.198583}| | 0.197736 public | geneflankingregion | direction | 0 | 10 | 2 | {downstream,upstream} | {0.500719,0.499281} | | 0.495437 (2 rows) modmine-r9=# SELECT COUNT(*) FROM geneflankingregion; count -------- 455020 (1 row) modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE distance = '10.0kb' AND direction = 'upstream'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Seq Scan on geneflankingregion (cost=0.00..15507.30 rows=45115 width=213) (actual time=0.053..181.764 rows=45502 loops=1) Filter: ((distance = '10.0kb'::text) AND (direction = 'upstream'::text)) Total runtime: 227.245 ms (3 rows) modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE LOWER(distance) = '10.0kb' AND LOWER(direction) = 'upstream'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on geneflankingregion (cost=66.95..88.77 rows=11 width=213) (actual time=207.555..357.359 rows=45502 loops=1) Recheck Cond: ((lower(distance) = '10.0kb'::text) AND (lower(direction) = 'upstream'::text)) -> BitmapAnd (cost=66.95..66.95 rows=11 width=0) (actual time=205.978..205.978 rows=0 loops=1) -> Bitmap Index Scan on geneflankingregion__distance_equals (cost=0.00..31.34 rows=2275 width=0) (actual time=79.380..79.380 rows=91004 loops=1) Index Cond: (lower(distance) = '10.0kb'::text) -> Bitmap Index Scan on geneflankingregion__direction_equals (cost=0.00..35.35 rows=2275 width=0) (actual time=124.639..124.639 rows=227510 loops=1) Index Cond: (lower(direction) = 'upstream'::text) Total runtime: 401.740 ms (8 rows) When I wrap the fields in the constraints in a LOWER() function, the planner stops looking at the statistics and makes a wild guess, even though it is very obvious from just looking what the result should be. Embedded in a much larger query, the inaccuracy in the number of rows (11 instead of 45502) causes major planning problems. Also, why does the BitmapAnd say zero actual rows? I understand this probably isn't Priority No. 1, and there are some interesting corner cases when n_distinct is higher than the histogram width, but would it be possible to fix this one up? Matthew -- I would like to think that in this day and age people would know better than to open executables in an e-mail. I'd also like to be able to flap my arms and fly to the moon. -- Tim Mullen
pgsql-performance by date: