Re: BUG #5059: Planner ignores estimates when planning an IN () subquery - Mailing list pgsql-bugs
From | Robert Haas |
---|---|
Subject | Re: BUG #5059: Planner ignores estimates when planning an IN () subquery |
Date | |
Msg-id | 603c8f070909161432k3fe8c46i7b36ec3655767c5c@mail.gmail.com Whole thread Raw |
In response to | BUG #5059: Planner ignores estimates when planning an IN () subquery ("Kenaniah Cerny" <kenaniah@gmail.com>) |
Responses |
Re: BUG #5059: Planner ignores estimates when planning an IN
() subquery
|
List | pgsql-bugs |
On Tue, Sep 15, 2009 at 11:35 PM, Kenaniah Cerny <kenaniah@gmail.com> wrote: > > The following bug has been logged online: > > Bug reference: =A0 =A0 =A05059 > Logged by: =A0 =A0 =A0 =A0 =A0Kenaniah Cerny > Email address: =A0 =A0 =A0kenaniah@gmail.com > PostgreSQL version: 8.4.1 > Operating system: =A0 Centos5.2 > Description: =A0 =A0 =A0 =A0Planner ignores estimates when planning an IN= () > subquery > Details: > > Consider the following query: > > http://pgsql.privatepaste.com/aa5DAtiwws > > When planning the subquery of the IN () statement, the planner chose to s= can > the indexes of the outer and inner columns in parallel using a nested loop > semi join. > > http://pgsql.privatepaste.com/4eXj3zRcy7 > > By not enabling the planner to sort via the index of the outer column in = the > WHERE clause (query above), the a nested loop version of the plan executes > in a fraction of the time. > > http://pgsql.privatepaste.com/5c0bOcL3t6 > > As you can see from the above query, forcing the materialization of the > subquery produces a much superior plan. > > http://pgsql.privatepaste.com/371nl6KFrI > > For comparison, this query replaces the subquery with hard-coded values. > > The planner appears to not be weighing the benefits of materializing the > subquery of the IN () statement properly when ordering is involved, and > still produces an inferior plan when ordering is not a factor. > > Please feel free to contact me for additional test cases if needed. I've seen this kind of plan before. The planner knows that a backwards index scan over user_activity_log is slow, but it thinks that it won't have to scan very much of it because it will accumulate enough rows to satisfy the LIMIT before it goes very far. When it turns out that there are not as many matching rows in user_friends as anticipated, things get very slow. Unfortunately, there's not an easy solution to this problem. If it were really true that most rows in user_activity_log had matches in user_friends, then planner would be correct to choose the plan it did - and wrong to choose the plan you want it to use. In this case, your table is small enough that a bitmap heap scan on user_activity_log is reasonable, but for a larger table in a situation where the join selectivity is higher, it might really be necessary to use the index. The problem is just that the planner is doing a poor job figuring out where the breakpoint is between those two strategies. I'm not 100% sure exactly which selectivity estimate is wrong. It's not clear to me whether things are already wrong here: INDEX Scan USING user_friends_user_account_id_friend_id_key ON user_friends (cost=3D0.00..0.27 rows=3D1 width=3D4) (actual time=3D0.004..0.004 rows=3D0 loops=3D350713) But they're definitely wrong here: Nested Loop Semi JOIN (cost=3D0.00..144564.33 rows=3D62298 width=3D52) (actual time=3D5138.961..5405.075 rows=3D10 loops=3D1) I *think* the root cause of the problem is that we don't have multi-column statistics. When you look at user_friends, we can tell you the selectivity of user_friends.user_account_id =3D 74650 pretty accurately, and we should also be able to spit out MCVs to compare against the MCVs of user_activity_log, but if the selectivity of those two expressions together doesn't resemble the product of their individual selectivities, which is probably the case here, the planner has no way to detect that. ...Robert
pgsql-bugs by date: