Re: Parameterized-path cost comparisons need some work - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parameterized-path cost comparisons need some work |
Date | |
Msg-id | CA+Tgmob8OGcdA1vYzFJP-SWoiyN16rE-i6wUVSALTSF=XQQw5g@mail.gmail.com Whole thread Raw |
In response to | Parameterized-path cost comparisons need some work (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Parameterized-path cost comparisons need some work
|
List | pgsql-hackers |
On Tue, Feb 28, 2012 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The most obvious thing to do about this is to consider that one path can > dominate another on cost only if it is both cheaper *and* produces no > more rows. Hmm. Are you sure that's the right rule? I am having trouble constructing an example, but I feel like there might be cases where it's possible to have path A, not parameterized, path B, parameterized by R, and path C, parameterized by S, and maybe also path D, parameterized by both R and S. In that case, I feel like paths B and C are incomparable. Even if one is cheaper than the other and also produces fewer rows, they're not the same rows; so the expense of a subsequent join might be different with one vs. the other. Even in the simple case where there's only one possible parameterization, it seems very difficult to compare the cost of a parameterized path A with the cost of an unparameterized path B. No matter how much cheaper A is than B, the eventual nested loop join might be so inefficient as to completely wipe A out. On the flip side, even if A is more expensive than B, it's nearly always going to produce at least somewhat fewer rows. There are degenerate cases where that might not be true, like a parameterized join where the table we're index-scanning has only one value in the paramaterized column, but presumably that's rare. So it may be worth keeping A around just because the subsequent join could turn out to be much cheaper with the smaller row count. Thus it seems unlikely that we'll be able to conclude that either A or B is categorically superior. If you accept that reasoning, then it seems like there's little if any point in ever comparing a parameterized path to a non-parameterized path; or of comparing two differently-parameterized paths against each other. You basically need a separate bucket for each group of paths, where they compete against each other but not against differently-parameterized paths; and then after you form the nested loop, all the paths that are now parameterized the same way (but weren't before) can finally go head-to-head against each other. I almost wonder if the bottom-up approach is the wrong way to tackle this entirely. Suppose we're trying to build paths for {A B}. We first build unparameterized paths for {A} and {B} and compute join paths for {A B}. Now we know the cheapest way to build {A B} without using parameterization, so we can make some deductions about a plan of the form: Nested Loop -> A -> B (parameterized by A) In particular, if we take the best cost either overall or for a path with the pathkeys that will be produced by this join, and divide by the number of rows for A, a parameterized path for B only makes sense if the total cost is less than that value. Now we have a meaningful bound that we can use to limit consideration of paths for B: anything that's more expensive than that bound should be chucked. If B is just a single rel that's not that interesting, but if B is a joinrel, then the bound applies individually to any subset of its members. So we start replanning B as a separate join problem, but any path for any baserel or joinrel that exceeds our cutoff cost gets chucked; and if any rel ends up with no workable paths, then we just forget the whole thing. Of course, the problem with this approach (aside from complexity) is that you might end up planning the same subproblem multiple times with different cost bounds. You could cache the results of previous computations so that you only replan it when the cost bound goes up, but that's still pretty awful. So maybe this is unworkable. But the current approach seems problematic, too, because you don't really know enough to throw very much away at the time that you need to make that decision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: