Thread: Bad plan chosen for union all
I'm on PostgreSQL 9.6.5 and getting an awkwardly bad plan chosen for my
query.
I want to do:
select investments.id, cim.yield
FROM contributions
JOIN investments ON contributions.investment_id = investments.id
JOIN contribution_investment_metrics_view cim ON cim.investment_id =
investments.id
WHERE contributions.id IN ('\x58c9c0d3ee944c48b32f814d', '\x11')
Where contribution_investment_metrics_view is morally
select investment_id, first(val) from (select * from contribution_metrics
UNION ALL select * from investment_metrics) group by id
Typically, querying this view is very fast since I have indexes in both
component queries, leading to a very tight plan:
Sort Key: "*SELECT* 1".metric
-> Subquery Scan on "*SELECT* 1" (cost=14.68..14.68 rows=1 width=26)
(actual time=0.043..0.044 rows=2 loops=1) -> Sort (cost=14.68..14.68 rows=1 width=42) (actual
time=0.042..0.043 rows=2 loops=1) Sort Key: cm.metric, cm.last_update_on DESC Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=0.14..14.68 rows=1 width=42) (actual
time=0.032..0.034 rows=2 loops=1) -> Index Scan using contributions_investment_id_idx on
contributions (cost=0.08..4.77 rows=2 width=26) (actual time=0.026..0.027
rows=1 loops=1) Index Cond: (investment_id = $1) -> Index Only Scan using
contribution_metrics_contribution_id_metric_last_update_on_idx on
contribution_metrics cm (cost=0.06..4.95 rows=2 width=34) (actual
time=0.005..0.006 r Index Cond: (contribution_id = contributions.id) Heap Fetches: 2
-> Subquery Scan on "*SELECT* 2" (cost=0.08..5.86 rows=3 width=26)
(actual time=0.008..0.008 rows=3 loops=1) -> Index Only Scan using
investment_metrics_investment_id_metric_last_updated_on_idx on
investment_metrics im (cost=0.08..5.85 rows=3 width=42) (actual
time=0.008..0.008 rows=3 loops=1) Index Cond: (investment_id = $1) Heap Fetches: 3
Unfortunately, when I try to query this view in the larger query above, I
get a *much* worse plan for this view, leading to >1000x degradation in
performance:
-> Append (cost=10329.18..26290.92 rows=482027 width=26) (actual
time=90.157..324.544 rows=482027 loops=1) -> Subquery Scan on "*SELECT* 1" (cost=10329.18..10349.44 rows=5788
width=26) (actual time=90.157..91.207 rows=5788 loops=1) -> Sort (cost=10329.18..10332.08 rows=5788 width=42) (actual
time=90.156..90.567 rows=5788 loops=1) Sort Key: contributions_1.investment_id, cm.metric,
cm.last_update_on DESC Sort Method: quicksort Memory: 645kB -> Hash Join (cost=105.62..10256.84 rows=5788 width=42)
(actual time=1.924..85.913 rows=5788 loops=1) Hash Cond: (contributions_1.id = cm.contribution_id) -> Seq Scan on contributions contributions_1
(cost=0.00..9694.49 rows=351495 width=26) (actual time=0.003..38.794
rows=351495 loops=1) -> Hash (cost=85.36..85.36 rows=5788 width=34)
(actual time=1.907..1.907 rows=5788 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 453kB -> Seq Scan on contribution_metrics cm
(cost=0.00..85.36 rows=5788 width=34) (actual time=0.003..0.936 rows=5788
loops=1) -> Subquery Scan on "*SELECT* 2" (cost=0.08..15941.48 rows=476239
width=26) (actual time=0.017..203.006 rows=476239 loops=1) -> Index Only Scan using
investment_metrics_investment_id_metric_last_updated_on_idx1 on
investment_metrics im (cost=0.08..14512.76 rows=476239 width=42) (actual
time=0.016..160.410 rows=476239 l Heap Fetches: 476239
I've played around with a number of solutions (including lateral joins) and
the closest I can come is:
select investment_id
from contribution_investment_metrics
where investment_id = ( select investments.id from investments join contributions on investments.id = contributions.investment_id where contributions.id = '\x58c9c0d3ee944c48b32f814d'
)
This doesn't really work for my purposes, since I want to project columns
from contributions and investments and I want to run this query on "up to a
handful" contributions at once (maybe more than one, never more than 100).
I'm on PostgreSQL 9.6.5.
Schema and full explain analyzes:
https://gist.github.com/awreece/28c359c6d834717ab299665022b19fd6
I don't think it's relevant, but since
https://wiki.postgresql.org/wiki/SlowQueryQuestions asks -- I'm running in
Heroku.
What are my options here? Currently, I'm planning to avoid these bad plans
by using a less straightforward query for the view:
SELECT coalesce(contrib.id, cm.contribution_id) AS contribution_id, coalesce(cm.yield, im.yield) AS yield, coalesce(cm.term, im.term) AS term
FROM contributions contrib
JOIN investment_metrics_view im ON im.investment_id = contrib.investment_id
FULL OUTER JOIN contribution_metrics_view cm ON cm.contribution_id =
contrib.id
Best,
~Alex Reece
I managed to reduce my test case: the following query does not take
advantage of the index on contribution metrics.
explain select cim.yield
from earnings
JOIN contributions on contributions.id = earnings.note_id
JOIN
(
SELECT contribution_id, max(CASE metrics.name WHEN 'Yield'::text THEN projected ELSE
NULL::double precision END) AS yield
from contribution_metrics
JOIN metrics ON metrics.id = metric
group by contribution_id
) cim ON cim.contribution_id = contributions.id
WHERE earnings.id = '\x595400456c1f1400116b3843';
I got this:
Hash Join (cost=125.02..147.03 rows=1 width=8) (actual time=4.781..4.906
rows=1 loops=1) Hash Cond: (contribution_metrics.contribution_id = contributions.id) -> HashAggregate (cost=116.86..126.64 rows=3261 width=21) (actual
time=4.157..4.600 rows=3261 loops=1) Group Key: contribution_metrics.contribution_id -> Hash Join (cost=1.11..108.18 rows=5788 width=33) (actual
time=0.021..2.425 rows=5788 loops=1) Hash Cond: (contribution_metrics.metric = metrics.id) -> Seq Scan on contribution_metrics (cost=0.00..85.36
rows=5788 width=34) (actual time=0.006..0.695 rows=5788 loops=1) -> Hash (cost=1.05..1.05 rows=17 width=25) (actual
time=0.009..0.009 rows=17 loops=1) -> Seq Scan on metrics (cost=0.00..1.05 rows=17
width=25) (actual time=0.002..0.005 rows=17 loops=1) -> Hash (cost=8.15..8.15 rows=1 width=26) (actual time=0.022..0.022
rows=1 loops=1) -> Nested Loop (cost=0.14..8.15 rows=1 width=26) (actual
time=0.019..0.020 rows=1 loops=1) -> Index Scan using earnings_pkey on earnings
(cost=0.06..4.06 rows=1 width=13) (actual time=0.009..0.009 rows=1 loops=1) Index Cond: (id = '\x595400456c1f1400116b3843'::bytea) -> Index Only Scan using contributions_pkey on
contributions (cost=0.08..4.09 rows=1 width=13) (actual time=0.008..0.009
rows=1 loops=1) Index Cond: (id = earnings.note_id) Planning time: 0.487 ms Execution time: 4.975 ms
But I expected it to be equivalent to the plan from this query:
select cim.yield from (
select contribution_id,
max(CASE metrics.name WHEN 'Yield'::text THEN projected ELSE NULL::double
precision END) AS yield
from contribution_metrics JOIN metrics ON metrics.id = metric group by
contribution_id
) cim where cim.contribution_id = (
select contributions.id from contributions
join earnings on earnings.note_id = contributions.id
where earnings.id = '\x595400456c1f1400116b3843')
Which gives me _this_ plan, that correctly uses the index on
contribution_metrics.
Subquery Scan on cim (cost=9.32..14.23 rows=2 width=8) (actual
time=0.108..0.108 rows=1 loops=1) InitPlan 1 (returns $1) -> Nested Loop (cost=0.14..8.15 rows=1 width=13) (actual
time=0.054..0.055 rows=1 loops=1) -> Index Scan using earnings_pkey on earnings
(cost=0.06..4.06 rows=1 width=13) (actual time=0.025..0.026 rows=1 loops=1) Index Cond: (id = '\x595400456c1f1400116b3843'::bytea) -> Index Only Scan using contributions_pkey on contributions
(cost=0.08..4.09 rows=1 width=13) (actual time=0.026..0.026 rows=1 loops=1) Index Cond: (id = earnings.note_id) -> GroupAggregate (cost=1.17..6.07 rows=2 width=21) (actual
time=0.108..0.108 rows=1 loops=1) Group Key: contribution_metrics.contribution_id -> Hash Join (cost=1.17..6.07 rows=2 width=33) (actual
time=0.100..0.101 rows=2 loops=1) Hash Cond: (contribution_metrics.metric = metrics.id) -> Index Scan using
contribution_metrics_contribution_id_metric_last_update_on_idx1 on
contribution_metrics ( cost=0.06..4.95 rows=2 width=34) (actual time Index Cond: (contribution_id = $1) -> Hash (cost=1.05..1.05 rows=17 width=25) (actual
time=0.012..0.012 rows=17 loops=1) -> Seq Scan on metrics (cost=0.00..1.05 rows=17
width=25) (actual time=0.004..0.006 rows=17 loops=1) Planning time: 0.396 ms Execution time: 0.165 ms
schema here:
https://gist.github.com/awreece/aeacbc818277c7c6d99477645e7fcd03
Best,
~Alex
On Tue, Nov 28, 2017 at 2:13 AM Alex Reece wrote:
> I'm on PostgreSQL 9.6.5 and getting an awkwardly bad plan chosen for my
> query.
>
> I want to do:
>
> select investments.id, cim.yield
> FROM contributions
> JOIN investments ON contributions.investment_id = investments.id
> JOIN contribution_investment_metrics_view cim ON cim.investment_id =
> investments.id
> WHERE contributions.id IN ('\x58c9c0d3ee944c48b32f814d', '\x11')
> Where contribution_investment_metrics_view is morally
>
> select investment_id, first(val) from (select * from contribution_metrics
> UNION ALL select * from investment_metrics) group by id
>
> Typically, querying this view is very fast since I have indexes in both
> component queries, leading to a very tight plan:
>
> Sort Key: "*SELECT* 1".metric
> -> Subquery Scan on "*SELECT* 1" (cost=14.68..14.68 rows=1 width=26)
> (actual time=0.043..0.044 rows=2 loops=1)
> -> Sort (cost=14.68..14.68 rows=1 width=42) (actual
> time=0.042..0.043 rows=2 loops=1)
> Sort Key: cm.metric, cm.last_update_on DESC
> Sort Method: quicksort Memory: 25kB
> -> Nested Loop (cost=0.14..14.68 rows=1 width=42) (actual
> time=0.032..0.034 rows=2 loops=1)
> -> Index Scan using contributions_investment_id_idx on
> contributions (cost=0.08..4.77 rows=2 width=26) (actual time=0.026..0.027
> rows=1 loops=1)
> Index Cond: (investment_id = $1)
> -> Index Only Scan using
> contribution_metrics_contribution_id_metric_last_update_on_idx on
> contribution_metrics cm (cost=0.06..4.95 rows=2 width=34) (actual
> time=0.005..0.006 r
> Index Cond: (contribution_id = contributions.id)
> Heap Fetches: 2
> -> Subquery Scan on "*SELECT* 2" (cost=0.08..5.86 rows=3 width=26)
> (actual time=0.008..0.008 rows=3 loops=1)
> -> Index Only Scan using
> investment_metrics_investment_id_metric_last_updated_on_idx on
> investment_metrics im (cost=0.08..5.85 rows=3 width=42) (actual
> time=0.008..0.008 rows=3 loops=1)
> Index Cond: (investment_id = $1)
> Heap Fetches: 3
>
> Unfortunately, when I try to query this view in the larger query above, I
> get a *much* worse plan for this view, leading to >1000x degradation in
> performance:
>
> -> Append (cost=10329.18..26290.92 rows=482027 width=26) (actual
> time=90.157..324.544 rows=482027 loops=1)
> -> Subquery Scan on "*SELECT* 1" (cost=10329.18..10349.44
> rows=5788 width=26) (actual time=90.157..91.207 rows=5788 loops=1)
> -> Sort (cost=10329.18..10332.08 rows=5788 width=42) (actual
> time=90.156..90.567 rows=5788 loops=1)
> Sort Key: contributions_1.investment_id, cm.metric,
> cm.last_update_on DESC
> Sort Method: quicksort Memory: 645kB
> -> Hash Join (cost=105.62..10256.84 rows=5788
> width=42) (actual time=1.924..85.913 rows=5788 loops=1)
> Hash Cond: (contributions_1.id =
> cm.contribution_id)
> -> Seq Scan on contributions contributions_1
> (cost=0.00..9694.49 rows=351495 width=26) (actual time=0.003..38.794
> rows=351495 loops=1)
> -> Hash (cost=85.36..85.36 rows=5788 width=34)
> (actual time=1.907..1.907 rows=5788 loops=1)
> Buckets: 8192 Batches: 1 Memory Usage:
> 453kB
> -> Seq Scan on contribution_metrics cm
> (cost=0.00..85.36 rows=5788 width=34) (actual time=0.003..0.936 rows=5788
> loops=1)
> -> Subquery Scan on "*SELECT* 2" (cost=0.08..15941.48 rows=476239
> width=26) (actual time=0.017..203.006 rows=476239 loops=1)
> -> Index Only Scan using
> investment_metrics_investment_id_metric_last_updated_on_idx1 on
> investment_metrics im (cost=0.08..14512.76 rows=476239 width=42) (actual
> time=0.016..160.410 rows=476239 l
> Heap Fetches: 476239
>
> I've played around with a number of solutions (including lateral joins)
> and the closest I can come is:
>
> select investment_id
> from contribution_investment_metrics
> where investment_id = (
> select investments.id
> from investments
> join contributions on investments.id = contributions.investment_id
> where contributions.id = '\x58c9c0d3ee944c48b32f814d'
> )
>
> This doesn't really work for my purposes, since I want to project columns
> from contributions and investments and I want to run this query on "up to a
> handful" contributions at once (maybe more than one, never more than 100).
>
> I'm on PostgreSQL 9.6.5.
> Schema and full explain analyzes:
> https://gist.github.com/awreece/28c359c6d834717ab299665022b19fd6
> I don't think it's relevant, but since
> https://wiki.postgresql.org/wiki/SlowQueryQuestions asks -- I'm running
> in Heroku.
>
> What are my options here? Currently, I'm planning to avoid these bad plans
> by using a less straightforward query for the view:
>
> SELECT
> coalesce(contrib.id, cm.contribution_id) AS contribution_id,
> coalesce(cm.yield, im.yield) AS yield,
> coalesce(cm.term, im.term) AS term
> FROM contributions contrib
> JOIN investment_metrics_view im ON im.investment_id =
> contrib.investment_id
> FULL OUTER JOIN contribution_metrics_view cm ON cm.contribution_id =
> contrib.id
>
> Best,
> ~Alex Reece
>
Alex Reece <awreece@gmail.com> writes: > I managed to reduce my test case: the following query does not take > advantage of the index on contribution metrics. Yeah. What you're wishing is that the planner would push a join condition down into a subquery, but it won't do that at present. Doing so would require generating "parameterized paths" for subqueries. While I do not think there's any fundamental technical reason anymore that we couldn't do so, there's considerable risk of wasting a lot of planner cycles chasing unprofitable plan alternatives. Anyway it was totally impractical before 9.6's upper-planner-pathification changes, and not all of the dust has settled from that rewrite. > But I expected it to be equivalent to the plan from this query: The difference here is that, from the perspective of the outer query, the WHERE condition is a restriction clause on the "cim" relation, not a join clause. So it will get pushed down into the subquery without creating any join order constraints on the outer query. regards, tom lane
One more thing. Given this:
> The difference here is that, from the perspective of the outer query,
> the WHERE condition is a restriction clause on the "cim" relation,
> not a join clause. So it will get pushed down into the subquery
> without creating any join order constraints on the outer query.
I expected the lateral form of the query to properly use the indexes. Sure
enough, this correctly uses the index:
explain select cim.yield
from earnings
JOIN contributions on contributions.id = earnings.note_id
JOIN LATERAL
( SELECT contribution_id, max(CASE metrics.name WHEN 'Yield'::text THEN projected ELSE
NULL::double precision END) AS yield from contribution_metrics JOIN metrics ON metrics.id = metric WHERE contributions.id =
contribution_id group by contribution_id
) cim ON true
WHERE earnings.id = '\x595400456c1f1400116b3843'
However, when I try to wrap that subquery query again (e.g. as I would need
to if it were a view), it doesn't restrict:
select cim.yield
from earnings
JOIN contributions on contributions.id = earnings.note_id
JOIN LATERAL
( select * from ( SELECT contribution_id, max(CASE metrics.name WHEN 'Yield'::text THEN projected ELSE
NULL::double precision END) AS yield from contribution_metrics JOIN metrics ON metrics.id = metric
group by contribution_id ) my_view WHERE contribution_id = contributions.id
) cim ON true
WHERE earnings.id = '\x595400456c1f1400116b3843'
Is there a way I can get the restriction to be pushed down into my subquery
in this lateral form?
Best,
~Alex