Re: question about index cost estimates - Mailing list pgsql-hackers
From | Jeff Hoffmann |
---|---|
Subject | Re: question about index cost estimates |
Date | |
Msg-id | 39237277.F22C4C12@propertykey.com Whole thread Raw |
In response to | question about index cost estimates (Jeff Hoffmann <jeff@propertykey.com>) |
Responses |
Re: question about index cost estimates
|
List | pgsql-hackers |
Tom Lane wrote: > > Jeff Hoffmann <jeff@propertykey.com> writes: > > i understand that, but still, if you have a tuple thats, say, 144 bytes, > > you're going to have a higher chance of revisiting the same page than if > > the tuple is 4k. > > Are you? If you are pulling ten tuples from a thousand-page relation, > seems to me the odds of revisiting any one page are somewhere around > 0.01 (too lazy to do the exact combinatorics right now). well, for something like 10 tuples out of say 50,000 - 100,000 records, that probably wouldn't come into play and you'd almost definitely talking 1 page access per tuple. i was thinking it would come into play with higher selectivity. actually i don't know if i'm thinking through the probabilities and formulas require as much as just brainstorming what things _might_ be coming into play. > Doesn't really > matter what the average tuple size is --- or if you prefer, we already > accounted for that because we are looking at target tuples divided by > total pages rather than target tuples divided by total tuples. i thought about that after i sent my message. i find that a lot of times, i'll answer my own question roughly 15 seconds after sending the message. i'm still trying to figure out how that happens. > > i'm still taking everything with a grain of salt until i can > > explain it, though. > > Good man. I don't think anyone but me has looked at this stuff in a > long while, and it could use review. i can tell you that the more i look at things and start understanding the logic, the more it seems that the numbers are fairly reasonable if you don't or can't know a lot of the factors like caching or clustering. unfortunately those factors can make a lot of difference in actual results. i was looking at the cost estimate of an rtree index scan, but it turns out that when you add in the page accesses for the actual tuples, it's rare to see an index scan cost that matters to the total cost for reasonably large tables. because of that, i'm not sure what the value of fixing the rtree index scan cost estimator, other than making it semantically correct. it seems in the big picture, we're subject to the whims of the disk cache more than anything. i think i'm starting to see what you said about fixing selectivity being more important, although i'm still not convinced that making the selectivity more accurate is going to come close to solving the problem. the selectivity problem is better defined, though, and therefore easier to fix than the other problems in cost estimation. i'm assuming you're considering adding some sort of histogram to the stats that vacuum collects. is that something that you're seriously considering or do you have another plan to introduce a useful concept of the distribution of an attribute? the concept of making a histogram on a b-tree is pretty simple, but if you're going to do it, you should probably be taking into account r-trees, where you'd need a 2-d histogram making the job just a bit tougher. jeff
pgsql-hackers by date: