Thread: Cost estimates for parameterized paths
Awhile back I ranted about replacing the planner's concept of inner indexscans with a more generalized notion of "parameterized paths": http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php The executor fixes for that are done, and now I'm grappling with getting the planner to do something useful with it. The biggest problem I've run into is that a parameterized path can't really be assigned a fixed cost in the same way that a normal path can. The current implementation of cost_index() depends on knowing the size of the outer relation --- that is, the expected number of execution loops for the indexscan --- in order to account for cache effects sanely while estimating the average cost of any one inner indexscan. We know that that is an important thing to do because the cost estimates seem to be a lot closer to reality now that we do that than what we were getting before; so dropping the consideration is entirely out of the question. The planner is already cheating on this to a considerable extent, because it estimates the cost of an inner indexscan only once, using the first outer rel we try to join to. That cost is cached and reused with other potential outer-rel join partners, even though very different numbers of outer rows might be involved. This problem will get a lot worse with the types of plans that I hope the planner will be able to come up with after this fix goes in, because the indexscan might be at the bottom of a join nest. So we need a real fix not another hack. The best idea I can come up with at the moment is to compute "best case" and "worst case" costs for a parameterized path, corresponding to the largest and smallest numbers of loops we might expect it to be repeated for. The largest number of loops could be estimated via the cartesian product of the sizes of the other tables in the query, for example. The "worst case" cost is its cost if only repeated once. Then, during path pruning in add_path, we only discard a parameterized path if its best-case cost is worse than the worst-case cost of some otherwise comparable path. Whenever we join the parameterized path with the required outer relation, we redo the cost calculation using that rel's actual rowcount estimate in order to form a final cost estimate for the no-longer-parameterized join path. While this looks like it would work in principle, I'm concerned that it would be unable to prune very many parameterized paths, and thus that planning time might run unacceptably long. The repeated cost calculations aren't much fun either, although we could probably cache most of that work if we're willing to throw memory at the problem. I wonder if anyone's got a better idea ... regards, tom lane
On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Awhile back I ranted about replacing the planner's concept of inner > indexscans with a more generalized notion of "parameterized paths": > http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php > > The executor fixes for that are done, and now I'm grappling with getting > the planner to do something useful with it. The biggest problem I've run > into is that a parameterized path can't really be assigned a fixed cost > in the same way that a normal path can. The current implementation of > cost_index() depends on knowing the size of the outer relation --- that > is, the expected number of execution loops for the indexscan --- in order > to account for cache effects sanely while estimating the average cost of > any one inner indexscan. We know that that is an important thing to do > because the cost estimates seem to be a lot closer to reality now that we > do that than what we were getting before; so dropping the consideration is > entirely out of the question. > > The planner is already cheating on this to a considerable extent, because > it estimates the cost of an inner indexscan only once, using the first > outer rel we try to join to. That cost is cached and reused with other > potential outer-rel join partners, even though very different numbers of > outer rows might be involved. This problem will get a lot worse with the > types of plans that I hope the planner will be able to come up with after > this fix goes in, because the indexscan might be at the bottom of a join > nest. So we need a real fix not another hack. > > The best idea I can come up with at the moment is to compute "best case" > and "worst case" costs for a parameterized path, corresponding to the > largest and smallest numbers of loops we might expect it to be repeated > for. The largest number of loops could be estimated via the cartesian > product of the sizes of the other tables in the query, for example. The > "worst case" cost is its cost if only repeated once. Then, during path > pruning in add_path, we only discard a parameterized path if its best-case > cost is worse than the worst-case cost of some otherwise comparable path. > Whenever we join the parameterized path with the required outer relation, > we redo the cost calculation using that rel's actual rowcount estimate in > order to form a final cost estimate for the no-longer-parameterized join > path. Interestingly, I previously proposed almost exactly this approach to handle a couple of other problems: http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php (index only scans) http://archives.postgresql.org/pgsql-hackers/2009-10/msg00125.php (per-query work mem) I'm not entirely sure whether we can use this approach for more than one kind of problem at a time; if we can't, it's probably not a good idea to do it at all. I also fear that any venture into this area will involve slowing down add_path(), which is a hotspot when planning large join nests. That might not be a win, if this is the only case it allows us to improve. *thinks* It strikes me that the outer tuple count is essentially being used to derate the cost of the index probes. So another way to look at this is that the best-case cost of an index probe is just the CPU cost, and the worst-case cost of an index probe is the CPU cost plus the disk cost. So your fear that not many parameterized paths will be discarded is probably valid. But then, maybe that's OK. It seems to me that the critical point is to make sure that we don't form them in the first place in cases where we could get the same benefit by commuting the joins. Once we've gotten to the point of considering a plan of this type, the chances that it's actually the best plan seem pretty high. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The best idea I can come up with at the moment is to compute "best case" >> and "worst case" costs for a parameterized path, > Interestingly, I previously proposed almost exactly this approach to > handle a couple of other problems: I thought it seemed familiar ;-) > I'm not entirely sure whether we can use this approach for more than > one kind of problem at a time; if we can't, it's probably not a good > idea to do it at all. I don't see why we couldn't do it in principle. The issue of course is how many paths survive, and can we tolerate that from a planner performance standpoint? On reflection I think that for parameterized paths the problem won't be too bad, because (a) we'll ignore parameterized paths except when considering a join to the right outer rel, so their presence in the rel's pathlist won't cost much otherwise, and (b) we will only generate them when there's a suitable joinclause, so the number of potential paths will be limited. I am not sure those statements can be made for the other applications you suggested, though. There's probably not much we can do at this point except code up the idea and see how well it works. regards, tom lane
On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > On reflection I think that for parameterized paths the problem won't be > too bad, because (a) we'll ignore parameterized paths except when > considering a join to the right outer rel, so their presence in the > rel's pathlist won't cost much otherwise, Hmm. Maybe they should go into a separate path list, and perhaps we could do the min/max calculations only with that pathlist (at least for now), thus avoiding taking a generalized penalty to handle this specific case. IIUC, a parameterized path should never cause an unparamaterized path to be thrown out, even if the unparameterized path costs more than the worst-case cost for the parameterized path, because the parameterized path constrains the possible join strategies higher up, so what looked like a great idea at first blush might turn out to suck when the chips are down. Then, too, I'm not sure we can even guarantee it will always be possible to form a valid plan around a given parameterized path, which is another good reason not to discard any unparameterized alternatives that may exist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On reflection I think that for parameterized paths the problem won't be >> too bad, because (a) we'll ignore parameterized paths except when >> considering a join to the right outer rel, so their presence in the >> rel's pathlist won't cost much otherwise, > Hmm. Maybe they should go into a separate path list, and perhaps we > could do the min/max calculations only with that pathlist (at least > for now), thus avoiding taking a generalized penalty to handle this > specific case. IIUC, a parameterized path should never cause an > unparamaterized path to be thrown out, Yeah, but the converse isn't true. I had considered the idea of keeping parameterized paths in a separate list, but you'd still have to examine the main list to look for unparameterized paths that might dominate them. regards, tom lane
On Fri, Sep 3, 2010 at 5:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Sep 3, 2010 at 2:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> On reflection I think that for parameterized paths the problem won't be >>> too bad, because (a) we'll ignore parameterized paths except when >>> considering a join to the right outer rel, so their presence in the >>> rel's pathlist won't cost much otherwise, > >> Hmm. Maybe they should go into a separate path list, and perhaps we >> could do the min/max calculations only with that pathlist (at least >> for now), thus avoiding taking a generalized penalty to handle this >> specific case. IIUC, a parameterized path should never cause an >> unparamaterized path to be thrown out, > > Yeah, but the converse isn't true. I had considered the idea of keeping > parameterized paths in a separate list, but you'd still have to examine > the main list to look for unparameterized paths that might dominate them. Definitely true, but if it avoids slowing down add_path() in the common case, it's worth it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
More than a year ago, I wrote in http://archives.postgresql.org/message-id/14624.1283463072@sss.pgh.pa.us > Awhile back I ranted about replacing the planner's concept of inner > indexscans with a more generalized notion of "parameterized paths": > http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php > The executor fixes for that are done, and now I'm grappling with getting > the planner to do something useful with it. The biggest problem I've run > into is that a parameterized path can't really be assigned a fixed cost > in the same way that a normal path can. The current implementation of > cost_index() depends on knowing the size of the outer relation --- that > is, the expected number of execution loops for the indexscan --- in order > to account for cache effects sanely while estimating the average cost of > any one inner indexscan. Since this project has been stalled for so long, I am thinking that what I need to do to make some progress is to punt on the repeated-execution cache effects problem, at least for the first cut. I propose costing parameterized inner paths on the "worst case" basis that they're only executed once, and don't get any benefit from caching across repeated executions. This seems like a reasonably sane first-order approximation on two grounds: 1. In most of the cases where such a plan is of interest, the outer relation for the nestloop actually does provide only one or a few rows. If it generates a lot of rows, you probably don't want a nestloop anyhow. 2. In the cases where we really care, the alternatives are so much worse that the parameterized nestloop will win even if it's estimated very conservatively. Another thing in the back of my mind is that the whole issue of cache effects is something we know we don't model very well, so putting large amounts of time into enlarging the present approach to handle more complicated plan structures may be misplaced effort anyway. So unless somebody's got a better idea, I'm going to push forward with getting the parameterized-path infrastructure in place, and just settle for a crude cost model for the moment. Perhaps the implementation experience will shake loose some new ideas about how to do the cost modeling. regards, tom lane
On Wed, Nov 9, 2011 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > More than a year ago, I wrote in > http://archives.postgresql.org/message-id/14624.1283463072@sss.pgh.pa.us > >> Awhile back I ranted about replacing the planner's concept of inner >> indexscans with a more generalized notion of "parameterized paths": >> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php > >> The executor fixes for that are done, and now I'm grappling with getting >> the planner to do something useful with it. The biggest problem I've run >> into is that a parameterized path can't really be assigned a fixed cost >> in the same way that a normal path can. The current implementation of >> cost_index() depends on knowing the size of the outer relation --- that >> is, the expected number of execution loops for the indexscan --- in order >> to account for cache effects sanely while estimating the average cost of >> any one inner indexscan. > > Since this project has been stalled for so long, I am thinking that what > I need to do to make some progress is to punt on the repeated-execution > cache effects problem, at least for the first cut. I propose costing > parameterized inner paths on the "worst case" basis that they're only > executed once, and don't get any benefit from caching across repeated > executions. This seems like a reasonably sane first-order approximation > on two grounds: > > 1. In most of the cases where such a plan is of interest, the outer > relation for the nestloop actually does provide only one or a few rows. > If it generates a lot of rows, you probably don't want a nestloop > anyhow. > > 2. In the cases where we really care, the alternatives are so much worse > that the parameterized nestloop will win even if it's estimated very > conservatively. I agree, on all counts. Errors that make new planner possibilities look unduly expensive aren't as serious as those that go the other way, because the alternative is that you can never generate the new plan at all. That's why I'm sweating about the costing index-only scans a bit. > Another thing in the back of my mind is that the whole issue of cache > effects is something we know we don't model very well, so putting large > amounts of time into enlarging the present approach to handle more > complicated plan structures may be misplaced effort anyway. True. And I think that might not even be the highest priority project to tackle anyway. The things that are hurting people most routinely and hardest to fix with existing tools seem to be things like cross-column correlation, and other selectivity estimation errors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company