Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers
From | Mark Kirkwood |
---|---|
Subject | Re: More thoughts about planner's cost estimates |
Date | |
Msg-id | 447F9737.10702@paradise.net.nz Whole thread Raw |
In response to | More thoughts about planner's cost estimates (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: More thoughts about planner's cost estimates
|
List | pgsql-hackers |
Tom Lane wrote: > Another thing that's bothering me is that the index access cost computation > (in genericcostestimate) is looking sillier and sillier: > > /* > * Estimate the number of index pages that will be retrieved. > * > * For all currently-supported index types, the first page of the index is > * a metadata page, and we should figure on fetching that plus a pro-rated > * fraction of the remaining pages. > */ > if (index->pages > 1 && index->tuples > 0) > { > numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1); > numIndexPages += 1; /* count the metapage too */ > numIndexPages = ceil(numIndexPages); > } > else > numIndexPages = 1.0; > > /* > * Compute the index access cost. > * > * Disk cost: our generic assumption is that the index pages will be read > * sequentially, so they have cost 1.0 each, not random_page_cost. > */ > *indexTotalCost = numIndexPages; > > There are several things wrong with this, at least in its application to > btrees. It's not counting descent of the btree (ie, touches of the root > page and intermediate-level pages). On the other hand it's surely > overcharging for metapage touches. As of CVS tip we cache the metapage in > the relcache, so it's probably fair to disregard that cost altogether. > And on the third hand, when we need to retrieve multiple leaf pages it's > over-optimistic to assume that those'll be purely sequential fetches. > (That would be approximately true in a freshly-built btree, but not in one > that's suffered any amount of page splitting or recycling.) > > > I am inclined to think that a reasonable model is to continue to estimate > the number of index leaf pages touched as above (pro-rating against the > total number of index entries), to charge random_page_cost per leaf page > touched, and not to count anything for metapage or tree descent. I > justify the latter on the grounds that upper tree levels will tend to stay > in cache because they're touched so often. random_page_cost per leaf page > may be an overestimate, since sometimes logically adjacent leaf pages will > be physically adjacent too, but not charging for tree descent should help > to cancel that out. With this model, the disk cost to fetch a single > index entry will be estimated as random_page_cost (default 4.0) rather > than the current fixed 2.0. This shouldn't hurt things too much for > simple indexscans --- especially since anyone running with a reduced > random_page_cost won't see as much change. And it will increase the costs > for bitmap scans that cross many index pages, which is what we need in > light of Philippe's numbers. > This sounds good to me, although the 2.0 -> 4.0 cost jump may cause (more) cases of people seeing their index scans in pre-8.2 versions becoming some other type of access in 8.2. I guess a comment about testing existing applications could be placed in the 8.2 release notes? > Now we have seen a lot of cases in which indexscans were being drastically > overestimated, so increasing the cost estimate even more may seem like a > horrid idea. But I believe that most or all of these cases were ones > where the same index was being visited many times, and so the real > estimation failure is not accounting for cache effects across multiple > indexscans. Rather than being afraid to raise the cost estimate for an > indexscan in isolation, I think it's time to bite the bullet and do > something about that. > > The big difficulty in modeling cache effects from multiple scans is that > we usually don't know how many times the index will be scanned. If we did > have that number, I think it'd be reasonable to use the Mackert and Lohman > approximation already used in cost_index to estimate the total number of > index leaf pages fetched over N scans, and then divide by N to get our > per-scan estimated cost. But because the planner builds cost estimates > bottom-up, in most scenarios we don't have any idea whether an indexscan > plan node will be iterated once or many times in the finished plan. > However there is one case where we do have some information: when > computing a best_inner_indexscan() for a nestloop join, it'd be reasonable > to use the estimated size of the outer relation as the loop count. > And this is exactly the case where we are falling down in practice: > the complaints we've seen are mostly about overestimating the cost of a > nestloop-with-inner-index-scan. So I think handling just this case would > be a big step forward, even if it's not a theoretically pure general > solution. > > If I understand correctly, this is the case were we normally see folks needing add several 'set enable_xxx=off' statements to get the nested loop plan that *actually* works best :-). So, also looks good! Cheers Mark
pgsql-hackers by date: