d25ea01275 and partitionwise join - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | d25ea01275 and partitionwise join |
Date | |
Msg-id | CA+HiwqG2WVUGmLJqtR0tPFhniO=H=9qQ+Z3L_ZC+Y3-EVQHFGg@mail.gmail.com Whole thread Raw |
Responses |
Re: d25ea01275 and partitionwise join
|
List | pgsql-hackers |
Hi Tom, I think an assumption of d25ea01275 breaks partitionwise join. Sorry it took me a while to report it. In https://postgr.es/m/8168.1560446056@sss.pgh.pa.us, Tom wrote: > I poked into this and found the cause. For the sample query, we have > an EquivalenceClass containing the expression > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_3_1) > where each of the Vars belongs to an appendrel parent. > add_child_rel_equivalences() needs to add expressions representing the > transform of that to each child relation. That is, if the children > of table 1 are A1 and A2, of table 2 are B1 and B2, and of table 3 > are C1 and C2, what we'd like to add are the expressions > COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_3_1) > COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_B1_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_B2_1), Var_3_1) > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C1_1) > COALESCE(COALESCE(Var_1_1, Var_2_1), Var_C2_1) > However, what it's actually producing is additional combinations for > each appendrel after the first, because each call also mutates the > previously-added child expressions. So in this example we also get > COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_3_1) > COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_3_1) > COALESCE(COALESCE(Var_A1_1, Var_2_1), Var_C1_1) > COALESCE(COALESCE(Var_A2_1, Var_2_1), Var_C2_1) > COALESCE(COALESCE(Var_A1_1, Var_B1_1), Var_C1_1) > COALESCE(COALESCE(Var_A2_1, Var_B2_1), Var_C2_1) > With two appendrels involved, that's O(N^2) expressions; with > three appendrels, more like O(N^3). > > This is by no means specific to FULL JOINs; you could get the same > behavior with join clauses like "WHERE t1.a + t2.b + t3.c = t4.d". > > These extra expressions don't have any use, since we're not > going to join the children directly to each other. ...unless partition wise join thinks they can be joined. Partition wise join can't handle 3-way full joins today, but only because it's broken itself when trying to match a full join clause to the partition key due to one side being a COALESCE expression. Consider this example query: -- p is defined as: -- create table p (a int) partition by list (a); -- create table p1 partition of p for values in (1); -- create table p2 partition of p for values in (2); explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=16416733.32..16628145.85 rows=84565012 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Merge Full Join (cost=536957.40..1813748.77 rows=84565012 width=4) Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t4.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=536546.83..553128.21 rows=3316275 width=12) -> Sort (cost=536546.83..544837.52 rows=3316275 width=12) Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a)) -> Merge Full Join (cost=14254.85..64024.48 rows=3316275 width=12) Merge Cond: (t3.a = (COALESCE(t1.a, t2.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t3.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=13844.29..14169.41 rows=130050 width=8) Sort Key: (COALESCE(t1.a, t2.a)) -> Merge Full Join (cost=821.13..2797.38 rows=130050 width=8) Merge Cond: (t1.a = t2.a) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t1.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t2.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) -- turn on enable_partitionwise_join set enable_partitionwise_join to on; explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Sort (cost=16385259.94..16596672.47 rows=84565012 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Merge Full Join (cost=505484.02..1782275.39 rows=84565012 width=4) Merge Cond: (t4.a = (COALESCE(COALESCE(t1.a, t2.a), t3.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t4.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=505073.45..521654.83 rows=3316275 width=12) -> Sort (cost=505073.45..513364.14 rows=3316275 width=12) Sort Key: (COALESCE(COALESCE(t1.a, t2.a), t3.a)) -> Merge Full Join (cost=7653.92..32551.10 rows=3316275 width=12) Merge Cond: (t3.a = (COALESCE(t1.a, t2.a))) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: t3.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=7243.35..7405.91 rows=65024 width=8) Sort Key: (COALESCE(t1.a, t2.a)) -> Result (cost=359.57..2045.11 rows=65024 width=8) -> Append (cost=359.57..2045.11 rows=65024 width=8) -> Merge Full Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (t1.a = t2.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t1.a -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t2.a -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Merge Full Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (t1_1.a = t2_1.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t1_1.a -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: t2_1.a -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) See how it only managed to use partition wise join up to 2-way join, but gives up at 3-way join and higher, because the join condition looks like this: t3.a = (COALESCE(t1.a, t2.a). When building the join relation (t1, t2, t3) between (t3) and (t1, t2), it fails to see that COALESCE(t1.a, t2.a) actually matches the partition key of (t1, t2). When I fix the code that does the matching and run with merge joins disabled, I can get a plan where the whole 4-way join is partitioned: explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────── Gather Merge (cost=831480.11..1859235.87 rows=8808720 width=4) Workers Planned: 2 -> Sort (cost=830480.09..841490.99 rows=4404360 width=4) Sort Key: (COALESCE(COALESCE(COALESCE(t1.a, t2.a), t3.a), t4.a)) -> Parallel Append (cost=202.12..224012.93 rows=4404360 width=4) -> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4) Hash Cond: (COALESCE(COALESCE(t1.a, t2.a), t3.a) = t4.a) -> Hash Full Join (cost=134.75..15904.32 rows=414528 width=12) Hash Cond: (COALESCE(t1.a, t2.a) = t3.a) -> Hash Full Join (cost=67.38..1247.18 rows=32512 width=8) Hash Cond: (t1.a = t2.a) -> Seq Scan on p1 t1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t2 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t3 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p1 t4 (cost=0.00..35.50 rows=2550 width=4) -> Hash Full Join (cost=202.12..201991.13 rows=5285232 width=4) Hash Cond: (COALESCE(COALESCE(t1_1.a, t2_1.a), t3_1.a) = t4_1.a) -> Hash Full Join (cost=134.75..15904.32 rows=414528 width=12) Hash Cond: (COALESCE(t1_1.a, t2_1.a) = t3_1.a) -> Hash Full Join (cost=67.38..1247.18 rows=32512 width=8) Hash Cond: (t1_1.a = t2_1.a) -> Seq Scan on p2 t1_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t2_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t3_1 (cost=0.00..35.50 rows=2550 width=4) -> Hash (cost=35.50..35.50 rows=2550 width=4) -> Seq Scan on p2 t4_1 (cost=0.00..35.50 rows=2550 width=4) (31 rows) But with merge joins enabled: explain select * from p t1 full outer join p t2 using (a) full outer join p t3 using (a) full outer join p t4 using (a) order by 1; ERROR: could not find pathkey item to sort That's because, there's no child COALESCE(t1_1.a, t2_1.a) expression in the EC that contains COALESCE(t1.a, t2.a), where t1_1 and t2_1 represent the 1st partition of t1 and t2, resp. The problem is that add_child_rel_equivalences(), as of d25ea01275, only adds the following child expressions of COALESCE(t1.a, t2.a): -- when translating t1 COALESCE(t1_1.a, t2.a) COALESCE(t1_2.a, t2.a) -- when translating t2 COALESCE(t1.a, t2_1.a) COALESCE(t1.a, t2_2.a) whereas previously, the following would be added too when translating t2: COALESCE(t1_1.a, t2_1.a) COALESCE(t1_1.a, t2_2.a) COALESCE(t1_2.a, t2_1.a) COALESCE(t1_2.a, t2_2.a) Note that of those, only COALESCE(t1_1.a, t2_1.a) and COALESCE(t1_2.a, t2_2.a) are interesting, because partition wise join will only ever consider pairs (t1_1, t2_1) and (t1_2, t2_2) to be joined. We can get the needed child expressions and still avoid the combinatorial explosion in the size of resulting EC members list if we taught add_child_rel_equivalences() to only translate ECs that the input parent relation is capable of producing. So, COALESCE(t1.a, t2.a) will not be translated if the input relation is only (t1) or (t2), that is, when called from set_append_rel_size(). Instead it would be translated if it's passed the joinrel (t1, t2). IOW, teach build_child_join_rel() to call add_child_rel_equivalences(), which I've tried to implement in the attached. I have attached two patches. 0001 - fix partitionwise join to work correctly with n-way joins of which some are full joins (+ cosmetic improvements around the code that was touched) 0002 - fix to translate multi-relation EC members correctly Thanks, Amit
Attachment
pgsql-hackers by date: