Thread: Use "average field correlation per hard disk page" instead of global one?

Use "average field correlation per hard disk page" instead of global one?

From

Alexey Nalbat

Date:

09 March 2004, 20:02:14

Hello.

I have a table of 2'500'000 tuples and 100'000 pages, and an index
on non-unique field, to each key value corresponds approximately
50'000 tuples.

Due to the updating algorithm the physical order of tuples in the
table happens to be such that all equal keys are placed together,
but not ordered globally. Correlation computed by "VACUUM ANALYZE"
is 0.15.

When computing indexscan cost for query with clause "key = ?"
the planner makes it closer to "Mackert and Lohman formula" value
than to "selectivity * pages". As a result it chooses seqscan
rather than indexscan while in fact indexscan is 20 times faster.

The question is, which is the best way to correct this behavior?

Maybe "VACUUM ANALYZE" could calculate some average of "field
correlation per page" and even use this value somewhere inside
(not outside) "Mackert and Lohman formula"?

Are there any better ideas?

Re: Use "average field correlation per hard disk page" instead of global one?

From

Tom Lane

Date:

10 March 2004, 01:08:28

Alexey Nalbat <alexey@price.ru> writes:
> Due to the updating algorithm the physical order of tuples in the
> table happens to be such that all equal keys are placed together,
> but not ordered globally.

Hmm... this is of course a weak spot of the correlation-based estimation
method.  If you were doing a range query then the computed correlation
might have some bearing on the cost, but when probing for a single key
value, your table will behave much differently than the correlation model
can guess.

> Are there any better ideas?

None at the moment, but I'm open to suggestions.  It seems like we might
need different stats for equality probes than range probes.
        regards, tom lane

Re: Use "average field correlation per hard disk

From

Philip Warner

Date:

10 March 2004, 02:02:59

At 04:08 PM 10/03/2004, Tom Lane wrote:
>None at the moment, but I'm open to suggestions.  It seems like we might
>need different stats for equality probes than range probes.

What about my suggestion from August 2000:
    "There might be a way to side-step the issue here. I assume that    the index nodes contain a pointer to a record
ina file, which    has some kind of file position. By comparing the file positions    on one leaf node, and then
averagingthe node cluster values,    you might be able to get a pretty good idea of the *real* clustering."
 

I don't use the CLUSTER command, but I have clustered data and would like 
to be able to take advantage of the fact if possible. *If* the record 
pointers can be used to indicate closeness, then the same approach of 
randomly sampling index nodes would seem to work. Then again, maybe I don't 
know enough about the storage techniques...




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 03 5330 3172          |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                 |    --________--
PGP key available upon request,  |  /
and from pgp.mit.edu:11371       |/