Re: Bad query optimizer misestimation because of TOAST - Mailing list pgsql-performance
From | Tom Lane |
---|---|
Subject | Re: Bad query optimizer misestimation because of TOAST |
Date | |
Msg-id | 9501.1107388153@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Bad query optimizer misestimation because of TOAST (Markus Schaber <schabios@logi-track.com>) |
Responses |
Re: Bad query optimizer misestimation because of TOAST
Re: Bad query optimizer misestimation because of TOAST |
List | pgsql-performance |
Markus Schaber <schabios@logi-track.com> writes: > Tom Lane schrieb: >> I don't buy this analysis at all. The toasted columns are not those in >> the index (because we don't support out-of-line-toasted index entries), >> so a WHERE clause that only touches indexed columns isn't going to need >> to fetch anything from the toast table. > The column is a PostGIS column, and the index was created using GIST. > Those are lossy indices that do not store the whole geometry, but only > the bounding box corners of the Geometry (2 Points). > Without using the index, the && Operator (which tests for bbox > overlapping) has to load the whole geometry from disk, and extract the > bbox therein (as it cannot make use of partial fetch). Ah, I see; I forgot to consider the GIST "storage" option, which allows the index contents to be something different from the represented column. Hmm ... What I would be inclined to do is to extend ANALYZE to make an estimate of the extent of toasting of every toastable column, and then modify cost_qual_eval to charge a nonzero cost for evaluation of Vars that are potentially toasted. This implies an initdb-forcing change in pg_statistic, which might or might not be allowed for 8.1 ... we are still a bit up in the air on what our release policy will be for 8.1. My first thought about what stat ANALYZE ought to collect is "average number of out-of-line TOAST chunks per value". Armed with that number and size information about the TOAST table, it'd be relatively simple for costsize.c to estimate the average cost of fetching such values. I'm not sure if it's worth trying to model the cost of decompression of compressed values. Surely that's a lot cheaper than fetching out-of-line values, so maybe we can just ignore it. If we did want to model it then we'd also need to make ANALYZE note the fraction of values that require decompression, and maybe something about their sizes. This approach would overcharge for operations that are able to work with partially fetched values, but it's probably not reasonable to expect the planner to account for that with any accuracy. Given this we'd have a pretty accurate computation of the true cost of the seqscan alternative, but what of indexscans? The current implementation charges one evaluation of the index qual(s) per indexscan, which is not really right because actually the index component is never evaluated at all. This didn't matter when the index component was a Var with zero eval cost, but if we're charging some eval cost it might. But ... since it's charging only one eval per scan ... the error is probably down in the noise in practice, and it may not be worth trying to get it exactly right. A bigger concern is "what about lossy indexes"? We currently ignore the costs of rechecking qual expressions for fetched rows, but this might be too inaccurate for situations like yours. I'm hesitant to mess with it though. For one thing, to get it right we'd need to understand how many rows will be returned by the raw index search (which is the number of times we'd need to recheck). At the moment the only info we have is the number that will pass the recheck, which could be a lot less ... and of course, even that is probably a really crude estimate when we are dealing with this sort of operator. Seems like a bit of a can of worms ... regards, tom lane
pgsql-performance by date: