Re: Improving non-joinable EXISTS subqueries - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Improving non-joinable EXISTS subqueries |
Date | |
Msg-id | 2412.1219254200@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Improving non-joinable EXISTS subqueries ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Responses |
Re: Improving non-joinable EXISTS subqueries
Re: Improving non-joinable EXISTS subqueries Re: Improving non-joinable EXISTS subqueries |
List | pgsql-hackers |
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [ complicated scheme for improving planning of EXISTS ] > So I'd be very happy to see this work done, not because I can't find a > workaround, but because trying to teach all the programmers tricky > hand-optimizations is a losing battle, and if I lose that battle the > queries degenerate into spaghetti-land. I spent some time looking at this, and soon grew rather discouraged: even the very first step of what I'd had in mind, which was to delay replacement of uplevel Vars with Params until late in the planning process, looks like it will destabilize large amounts of code that aren't particularly related to the problem at hand. (Most of the planner blithely assumes that it will never see an uplevel Var, and tends to just treat any Var as being of the current query level.) So I backed off and thought some more, and eventually came to this conclusion: when we have an EXISTS that could be done both ways, why not just generate plans for both ways, and leave the decision which to use until later? Like maybe even execution time? We have speculated in the past about having alternative plans that could be conditionally executed based on information not available at planning time. This could be seen as a first experiment in that direction. I am not thinking of a general-purpose AlternativePlan kind of execution node, because SubPlans aren't actually part of the main plan-node tree, but an AlternativeSubPlans expression node type might work. The two issues that would obviously have to be faced to make this work are: 1. While the planner is estimating evaluation costs of the qual conditions for the upper query, which EXISTS implementation do we assume will be used? It might be that we could still use my original idea of providing cost_qual_eval() with some context about the likely number of calls, but what I'm thinking at the moment is that it's not worth the trouble, because it isn't going to matter that much. Either possibility is expensive enough compared to ordinary qual conditions that the planner will be driven in the direction of plans that minimize the number of EXISTS evaluations, and that's all that we really care about. So I'd be inclined to just use the numbers for the base (non hashed) implementation and be done with it. 2. How will the executor make the decision which to use? Well, it's got access to the overall rowcount estimates that the planner made. What I'm thinking of doing is having the AlternativeSubPlans node look at the rowcount estimate of its immediate parent Plan node. This is actually exactly the right number for a subplan in the targetlist of the Plan node. For a subplan in the qual list, it's an underestimate, but probably not an enormous underestimate. (Assuming that the subplan is at the end of the qual list, which is where it'd normally be, the expected number of calls of the subplan would be the output rowcount estimate divided by the estimated selectivity of the subplan qual --- but at present the latter is always 0.5 ...) Another technique that we could play with is to have the AlternativeSubPlans node track the actual number of calls it gets, and switch from the "retail" implementation to the "hashed" implementation if that exceeds a threshold. This'd provide some robustness in the face of bad estimates, although of course it's not optimal compared to having made the right choice to start with. Thoughts? regards, tom lane
pgsql-hackers by date: