Thread: Use "average field correlation per hard disk page" instead of global one?
Hello. I have a table of 2'500'000 tuples and 100'000 pages, and an index on non-unique field, to each key value corresponds approximately 50'000 tuples. Due to the updating algorithm the physical order of tuples in the table happens to be such that all equal keys are placed together, but not ordered globally. Correlation computed by "VACUUM ANALYZE" is 0.15. When computing indexscan cost for query with clause "key = ?" the planner makes it closer to "Mackert and Lohman formula" value than to "selectivity * pages". As a result it chooses seqscan rather than indexscan while in fact indexscan is 20 times faster. The question is, which is the best way to correct this behavior? Maybe "VACUUM ANALYZE" could calculate some average of "field correlation per page" and even use this value somewhere inside (not outside) "Mackert and Lohman formula"? Are there any better ideas?
Alexey Nalbat <alexey@price.ru> writes: > Due to the updating algorithm the physical order of tuples in the > table happens to be such that all equal keys are placed together, > but not ordered globally. Hmm... this is of course a weak spot of the correlation-based estimation method. If you were doing a range query then the computed correlation might have some bearing on the cost, but when probing for a single key value, your table will behave much differently than the correlation model can guess. > Are there any better ideas? None at the moment, but I'm open to suggestions. It seems like we might need different stats for equality probes than range probes. regards, tom lane
At 04:08 PM 10/03/2004, Tom Lane wrote: >None at the moment, but I'm open to suggestions. It seems like we might >need different stats for equality probes than range probes. What about my suggestion from August 2000: "There might be a way to side-step the issue here. I assume that the index nodes contain a pointer to a record ina file, which has some kind of file position. By comparing the file positions on one leaf node, and then averagingthe node cluster values, you might be able to get a pretty good idea of the *real* clustering." I don't use the CLUSTER command, but I have clustered data and would like to be able to take advantage of the fact if possible. *If* the record pointers can be used to indicate closeness, then the same approach of randomly sampling index nodes would seem to work. Then again, maybe I don't know enough about the storage techniques... ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 03 5330 3172 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp.mit.edu:11371 |/