Re: Improving non-joinable EXISTS subqueries - Mailing list pgsql-hackers
| From | Tom Lane | 
|---|---|
| Subject | Re: Improving non-joinable EXISTS subqueries | 
| Date | |
| Msg-id | 2412.1219254200@sss.pgh.pa.us Whole thread Raw | 
| In response to | Re: Improving non-joinable EXISTS subqueries ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) | 
| Responses | Re: Improving non-joinable EXISTS subqueries Re: Improving non-joinable EXISTS subqueries Re: Improving non-joinable EXISTS subqueries | 
| List | pgsql-hackers | 
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> [ complicated scheme for improving planning of EXISTS ]
> So I'd be very happy to see this work done, not because I can't find a
> workaround, but because trying to teach all the programmers tricky
> hand-optimizations is a losing battle, and if I lose that battle the
> queries degenerate into spaghetti-land.
I spent some time looking at this, and soon grew rather discouraged:
even the very first step of what I'd had in mind, which was to delay
replacement of uplevel Vars with Params until late in the planning
process, looks like it will destabilize large amounts of code that
aren't particularly related to the problem at hand.  (Most of the
planner blithely assumes that it will never see an uplevel Var, and
tends to just treat any Var as being of the current query level.)
So I backed off and thought some more, and eventually came to this
conclusion: when we have an EXISTS that could be done both ways,
why not just generate plans for both ways, and leave the decision
which to use until later?  Like maybe even execution time?
We have speculated in the past about having alternative plans that
could be conditionally executed based on information not available
at planning time.  This could be seen as a first experiment in that
direction.  I am not thinking of a general-purpose AlternativePlan
kind of execution node, because SubPlans aren't actually part of the
main plan-node tree, but an AlternativeSubPlans expression node
type might work.
The two issues that would obviously have to be faced to make this
work are:
1. While the planner is estimating evaluation costs of the qual
conditions for the upper query, which EXISTS implementation do we assume
will be used?  It might be that we could still use my original idea of
providing cost_qual_eval() with some context about the likely number of
calls, but what I'm thinking at the moment is that it's not worth the
trouble, because it isn't going to matter that much.  Either possibility
is expensive enough compared to ordinary qual conditions that the
planner will be driven in the direction of plans that minimize the
number of EXISTS evaluations, and that's all that we really care about.
So I'd be inclined to just use the numbers for the base (non hashed)
implementation and be done with it.
2. How will the executor make the decision which to use?  Well, it's
got access to the overall rowcount estimates that the planner made.
What I'm thinking of doing is having the AlternativeSubPlans node
look at the rowcount estimate of its immediate parent Plan node.
This is actually exactly the right number for a subplan in the
targetlist of the Plan node.  For a subplan in the qual list, it's
an underestimate, but probably not an enormous underestimate.
(Assuming that the subplan is at the end of the qual list, which is
where it'd normally be, the expected number of calls of the subplan
would be the output rowcount estimate divided by the estimated
selectivity of the subplan qual --- but at present the latter is always
0.5 ...)
Another technique that we could play with is to have the
AlternativeSubPlans node track the actual number of calls it gets,
and switch from the "retail" implementation to the "hashed"
implementation if that exceeds a threshold.  This'd provide some
robustness in the face of bad estimates, although of course it's
not optimal compared to having made the right choice to start with.
Thoughts?
        regards, tom lane
		
	pgsql-hackers by date: