Re: SQL command speed - Mailing list pgsql-sql
From | Kate Collins |
---|---|
Subject | Re: SQL command speed |
Date | |
Msg-id | 3925975A.B8880CCD@wsicorp.com Whole thread Raw |
In response to | SQL command speed (Kate Collins <klcollins@wsicorp.com>) |
Responses |
Re: SQL command speed
|
List | pgsql-sql |
Tom, Thank you for the explanation. You are correct in your assessment of the nature of the distribution of the elements in the item_a column. Some values return none or a few rows. While other values return many rows (100 or more). It is not an even distribution. I had never really considered what effect this would have on searching before. Kate Tom Lane wrote: > Kate Collins <klcollins@wsicorp.com> writes: > > The table I am using has 114600 total rows. The full query returns 1129 > > rows. Right now the table is static, i.e. I am not modifying it while I am > > running these tests. > > > Here are the results of the EXPLAIN with the different numbers of OR's. > > > ---QUERY 1, returns 1129 rows--- > > pbi=> [ 32-or variant ] > > Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12) > > > --- QUERY 2, returns 11 rows --- > > pbi=> [ just one WHERE clause ] > > Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57 > > rows=927 width=12) > > > --- QUERY 3, returns 11 rows --- > > pbi=> [ 2 ORed clauses ] > > Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12) > > OK, the problem here is the disconnect between the planner's estimate > of the number of returned rows (the "rows" number in EXPLAIN) and the > actual result row count. If the query actually did need to pull nearly > a thousand rows for each OR'd key, then using a seqscan for more than a > couple of ORs would make sense. But in reality you don't have nearly > that many rows per OR key, so an indexscan is needed. > > That estimate is largely driven by a single statistic collected by > VACUUM ANALYZE, which is the frequency of the most common value in > the item_a column. If the MCV is pretty common then the planner assumes > that the column doesn't have very many distinct values, so you get a > high estimate of the number of returned rows. Evidently you have > a very common MCV for item_a, but the key values you are looking for > aren't nearly that common. > > I have seen a number of cases where someone was using a dummy value > (like 'N/A', or an empty string, etc) to indicate unknown data, and > there were so many of these entries as to not merely be the MCV, > but drive the MCV's frequency statistic far above the frequency of > occurrence of any "real" value. This fools the planner into thinking > that *all* the values are like that, and so it generates plans > accordingly. The long-run solution is to gather more-detailed > statistics, and that's in the TODO list for a future version. In the > meantime, there is a workaround that you may be able to use: instead > of a dummy value, store NULL for unknown entries. The statistics > gatherer already accounts for NULLs separately, so a large fraction > of NULLs won't fool the planner about the frequency of non-null values. > > Note: if you try this, be sure to re-run VACUUM ANALYZE after you > replace the dummies with NULLs. The plans won't change until there's > a more accurate statistical entry for your table... > > regards, tom lane -- ================================================= Katherine (Kate) L. Collins Senior Software Engineer/Meteorologist Weather Services International (WSI Corporation) 4 Federal Street Billerica, MA 01821 EMAIL: klcollins@wsicorp.com PHONE: (978) 670-5110 FAX: (978) 670-5100 http://www.intellicast.com