Re: Planner estimates cost of 'like' a lot lower than '='?? - Mailing list pgsql-general
From | Mats Lofkvist |
---|---|
Subject | Re: Planner estimates cost of 'like' a lot lower than '='?? |
Date | |
Msg-id | 20010722183643.20949.qmail@kairos.algonet.se Whole thread Raw |
In response to | Re: Planner estimates cost of 'like' a lot lower than '='?? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Planner estimates cost of 'like' a lot lower than '='??
Re: Planner estimates cost of 'like' a lot lower than '='?? |
List | pgsql-general |
Mats Lofkvist <mal@algonet.se> writes: > Fiddling around revealed that the problem is that the cost of > 'like' is severely underestimated, making the database use > the wrong index. Not cost --- selectivity. How many rows actually match the criterion WHERE find0.key2 = 'llll' ? How about WHERE find0.value like 't10k__' There are 11004 rows matching key2 = 'llll' and 90 rows matching value like 't10k__' (all 90 have key2 = 'llll'). It would appear from your timings that the latter is not very selective at all, whereas the former is quite selective. Did you mean the other way around? However, given the limitations of the planner's statistical routines, I wouldn't be too surprised if it makes the opposite guess in 7.1 and before. Notice the difference between the estimated rows counts and reality in your simplified test :-(. The speed differential in your join almost certainly has nothing to do with the execution time of a single '=' or 'like' operator, and everything to do with the number of rows coming out of the first-stage index scan. So if the planner guesses wrong about which index is more selective for the query, it will choose a bad plan. How large is your dataset? Would you be willing to build a trial installation of current sources, and see if the 7.2-to-be planner does any better? We've done some major overhauling of the statistical code since 7.1, and I'm curious to see its results in the field. See our CVS server, or the nightly snapshot tarball at http://www.ca.postgresql.org/ftpsite/dev/ The 'data' table contains 162135 rows, the 'test2' table contains 118113 rows (the latter is a subset of the data.value column). (I'm downloading the CVS tree right now. Do I need to do dump/restore or can I just start it on the current data?) Also: the overly large rows estimate for "where value = 't10k9999'" is most likely caused by having some one extremely common value in the column. (In 7.1 and before, the most common value is the *only* statistic the planner has, and so a common MCV drives it to assume that there are only a few distinct values in the column.) Often the most common value is actually a dummy value, like an empty string. If you have a lot of dummies, consider whether you can't replace them with NULL. 7.1's VACUUM ANALYZE does distinguish NULLs from real values, so this hack can help it derive somewhat less bogus stats. Yes, there are very common values, but none can be considered dummies (i.e. they can't be replaced by null in a production database). data.key0 is an object id, data.key1 and data.key2 is a two-part object member name and data.longValue is the member value. data.value is data.longValue truncated to make it possible to index a prefix of data.longValue with databases not supporting this explicitly. When running the tests, data contained ~11k objects each having about a dozen members, some with unique values but may with common values. Maybe I'm mistaken in assuming that the simplified test points at the problem with 'real' test, but aren't cost estimates comparable between two different explains? If they should be I still don't understand how "where value = 'xxx'" can be estimated to return 600 times more rows than "where value like 'xxx%'" (this is what happens in my simplified test). regards, tom lane thanks for the reply, _ Mats Lofkvist mal@algonet.se
pgsql-general by date: