Re: Two slightly different queries produce same results, - Mailing list pgsql-general
From | Richard Huxton |
---|---|
Subject | Re: Two slightly different queries produce same results, |
Date | |
Msg-id | 43745EE1.1010208@archonet.com Whole thread Raw |
In response to | Two slightly different queries produce same results, one takes 1/4 the time.. bug in planner? (Andrew Schmidt <andrew@lifescale.com>) |
Responses |
Re: Two slightly different queries produce same results,
|
List | pgsql-general |
Andrew Schmidt wrote: > However, I've run into a problem where one query took about twice as > long as innodb. Some investigation and playing around with the query, I > ended up with a slightly different query but was about 3 times as fast > as innodb (and 5 times faster than the original query). I didn't add > any indices and the EXPLAIN ANALYZE doesn't show any new indices being > used with the new query. The order of which tables are read first was > changed in the query planner and also it uses HashAggregate vs > GroupAggregate.. not entirely sure what that means. It's a different method of aggregating data (calculating your sum()). The HashAggregate puts values into buckets and works best when you have a fairly small number of different values (so each value can get its own bucket). I think - bear in mind I'm just another user, so I'm not familiar with the internals. The different plans are because PG thinks it will get different numbers of rows... > SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS > total_goals > FROM > team t JOIN team_players tp ON tp.team_id = t.team_id > JOIN player_updates pu ON pu.player_id = tp.player_id > JOIN player pl ON pl.player_id = pu.player_id > WHERE t.pool_id = 21699 > GROUP BY tp.team_id,is_goalie; > Explain analyze of this query: > -> Index Scan using team_pool_id on team t > (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313 > loops=1) > Index Cond: (pool_id = 21699) With this index scan you have one condition and PG expects 391 matching rows (it gets 313, so that's a pretty good estimate). > SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals > FROM > team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = > 21699 <--- ** added pool_id > JOIN player_updates pu ON pu.player_id = tp.player_id > JOIN player pl ON pl.player_id = pu.player_id > WHERE t.pool_id = 21699 > GROUP BY tp.team_id,is_goalie; > > The explain analyze: > -> Index Scan using team_pool_id on > team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556 > rows=313 loops=1) > Index Cond: ((pool_id = 21699) AND > (pool_id = 21699)) Here it sees two conditions, and PG is expecting that will restrict things further. It's clearly not expecting the same condition twice, but is presumably treating it something like "a>-4 AND a<4". This actually makes the row estimate wrong. So - if PG is getting this bit wrong why is query 2 faster? Looking at the row estimates for case 1 we can see that the final cases are badly wrong. > GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual > time=1186.973..1432.548 rows=626 loops=1) Here it's got the number of rows badly wrong - if you actually ended up with 44,000 rows then maybe this would be the best plan. This is the last step though, so what happens before? > -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual > time=1186.237..1309.562 rows=40928 loops=1) > Sort Key: tp.team_id, (pl."position" = 'G'::bpchar) We're about right here. The problem must be in the last step, and looking at the row estimates it seems to think that no summarising is actually going to take place. I think the "position = 'G'" bit is confusing it. Well, we can test that: 1. Just group by "pl.position" - remove the test. Do the estimated and actual rows match now? 2. Rewrite the position='G' as a CASE structure 3. Do the position='G' part in a sub-query and join to that. The results of those should show whether this is actually the problem. -- Richard Huxton Archonet Ltd
pgsql-general by date: