Thread: Merge David and Goliath tables efficiently
In my use case I have a 2billion / 1To table. I have daily data to upsert around 2milion, with say 50% inserts, based onthe primary key in a fresh analyzed table. I have tested multiple strategies to merge the data, all based on first stage to copy the 2m dataset in an staging unlogged/ indexed table: 1. Join insert then join update 2.1. Usage of the new merge statement 2.2 Usage of merge on two hash partitioned tables wit partition wide join enabled 3. Usage of merge by batch of 1000 rows First remark is the merge statement is almost 30% faster than two statements in my benchmarks. Thanks to the pg communityfor this. While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), the third one finishes within 30 minutes. My interpretation reading the query plan is: well sized small batches of upserts leverage the indexes while the regular joinchoose the sequential scan, including sorting and hashing which takes forever time and resources including disk. Sadly my incoming dataset is too small to benefit from a seq scan and too large to benefit from an index scan join. Howeverwhen splited manuallyin N portions, the problem can be tackled with N * small cost, which is cheap anyway. Questions: 1. Is there another strategy ? 2. Could postgres support a "batched indexed join itself", leveraging indexes itself by dynamic sized batches ? It is error prone write code to split and iterate I suspect postgres has everything internally (indexes catalog, planner)to split itself the job, making David vs Goliath something trivial.
On 6/17/23 15:48, Nicolas Paris wrote: > In my use case I have a 2billion / 1To table. I have daily data to upsert around 2milion, with say 50% inserts, based onthe primary key in a fresh analyzed table. > > I have tested multiple strategies to merge the data, all based on first stage to copy the 2m dataset in an staging unlogged/ indexed table: > > 1. Join insert then join update > 2.1. Usage of the new merge statement > 2.2 Usage of merge on two hash partitioned tables wit partition wide join enabled > 3. Usage of merge by batch of 1000 rows > > First remark is the merge statement is almost 30% faster than two statements in my benchmarks. Thanks to the pg communityfor this. > > While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), the third one finishes within 30 minutes. > Seems pretty terrible, provided the data is on reasonable storage (with acceptable random I/O behavior). > My interpretation reading the query plan is: well sized small batches of upserts leverage the indexes while the regularjoin choose the sequential scan, including sorting and hashing which takes forever time and resources including disk. You may be right, but it's hard to tell without seeing the query plan. > > Sadly my incoming dataset is too small to benefit from a seq scan and too large to benefit from an index scan join. Howeverwhen splited manuallyin N portions, the problem can be tackled with N * small cost, which is cheap anyway. > Sounds very much like you'd benefit from tuning some cost parameters to make the index scan look cheaper. > Questions: > 1. Is there another strategy ? > 2. Could postgres support a "batched indexed join itself", leveraging indexes itself by dynamic sized batches ? > Not sure what 'batched indexed join' would be, but it very much sounds like a nested loop with an index scan. > > It is error prone write code to split and iterate I suspect postgres has everything internally (indexes catalog, planner)to split itself the job, making David vs Goliath something trivial. > What PostgreSQL version are you using, what hardware? Did you tune it in any way, or is everything just default? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> > My interpretation reading the query plan is: well sized small > > batches of upserts leverage the indexes while the regular join > > choose the sequential scan, including sorting and hashing which > > takes forever time and resources including disk. > > You may be right, but it's hard to tell without seeing the query > plan. Here are part of both plans: Bad case (strategy 2.1): -> Merge Left Join (cost=530202629.03..255884257913.32 rows=17023331531230 width=579) Merge Cond: (david.list_id = ca.list_id) -> Sort (cost=2019172.91..2024398.82 rows=2090361 width=569) Sort Key: david.list_id -> Append (cost=0.00..192152.41 rows=2090361 width=569) -> Seq Scan on david_0 david_1 (cost=0.00..1812.52 rows=20852 width=569) -> Seq Scan on david_1 david_2 (cost=0.00..1800.09 rows=20709 width=569) -> Seq Scan on david_2 david_3 (cost=0.00..1794.44 rows=20644 width=569) Good case (strategy 3): Merge on goliath_23 ca (cost=2139.75..11077.17 rows=0 width=0) -> Nested Loop Left Join (cost=2139.75..11077.17 rows=1000 width=575) -> Limit (cost=2139.19..2495.67 rows=1000 width=569) -> Index Scan using david_23_list_id_account_id_idx on david_23 (cost=0.29..6794.16 rows=19058 width=569) -> Index Scan using goliath_23_list_id_account_id_idx on goliath_23 ca (cost=0.56..8.56 rows=1 width=14) Index Cond: (list_id = david_23.list_id) > > Sounds very much like you'd benefit from tuning some cost parameters > to > make the index scan look cheaper. > Not sure what 'batched indexed join' would be, but it very much > sounds > like a nested loop with an index scan. Agreed, a 2M nested loop over index scan would likely work as well. Would tuning the costs param could lead to get such large nested loop ? > What PostgreSQL version are you using, what hardware? Did you tune it > in > any way, or is everything just default? It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud provider parameters (RDS).
On 6/17/23 23:42, nicolas paris wrote: >>> My interpretation reading the query plan is: well sized small >>> batches of upserts leverage the indexes while the regular join >>> choose the sequential scan, including sorting and hashing which >>> takes forever time and resources including disk. >> >> You may be right, but it's hard to tell without seeing the query >> plan. > > Here are part of both plans: > I don't understand why you're sharing just a part of the plan and not the whole thing, ideally including the actual update ... Giving us the info in small pieces just means we need to guess and speculate. > Bad case (strategy 2.1): > > -> Merge Left Join (cost=530202629.03..255884257913.32 > rows=17023331531230 width=579) > Merge Cond: (david.list_id = ca.list_id) > -> Sort (cost=2019172.91..2024398.82 rows=2090361 width=569) > Sort Key: david.list_id > -> Append (cost=0.00..192152.41 rows=2090361 width=569) > -> Seq Scan on david_0 david_1 (cost=0.00..1812.52 > rows=20852 width=569) > -> Seq Scan on david_1 david_2 (cost=0.00..1800.09 > rows=20709 width=569) > -> Seq Scan on david_2 david_3 (cost=0.00..1794.44 > rows=20644 width=569) > Well, I kinda doubt you have 17023331531230 rows (not even physically possible with 2TB disk), so that's immediately suspicious. I'd bet the UPDATE ... FROM ... is missing a condition or something like that, which results in a cartesian product. > Good case (strategy 3): > > Merge on goliath_23 ca (cost=2139.75..11077.17 rows=0 width=0) > -> Nested Loop Left Join (cost=2139.75..11077.17 rows=1000 > width=575) > -> Limit (cost=2139.19..2495.67 rows=1000 width=569) > -> Index Scan using david_23_list_id_account_id_idx on > david_23 (cost=0.29..6794.16 rows=19058 width=569) > -> Index Scan using goliath_23_list_id_account_id_idx on > goliath_23 ca (cost=0.56..8.56 rows=1 width=14) > Index Cond: (list_id = david_23.list_id) > >> >> Sounds very much like you'd benefit from tuning some cost parameters >> to >> make the index scan look cheaper. >> Not sure what 'batched indexed join' would be, but it very much >> sounds >> like a nested loop with an index scan. > > Agreed, a 2M nested loop over index scan would likely work as well. > Would tuning the costs param could lead to get such large nested loop ? > It should be, but maybe let's see if there are other problems in the query itself. If it's generating a cartesian product, it's pointless to tune parameters. >> What PostgreSQL version are you using, what hardware? Did you tune it >> in >> any way, or is everything just default? > > It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud > provider parameters (RDS). > I assume 2TO is 2TB? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I assume 2TO is 2TB? Yes. 2TB > I don't understand why you're sharing just a part of the plan As for the nested loop plan, what I shared is the full plan. Actually it is repeated many times, since 2M batched by 500 rows. I add it again: Merge on goliath_23 ca (cost=2139.75..11077.17 rows=0 width=0) -> Nested Loop Left Join (cost=2139.75..11077.17 rows=1000 width=575) -> Limit (cost=2139.19..2495.67 rows=1000 width=569) -> Index Scan using david_23_list_id_account_id_idx on david_23 (cost=0.29..6794.16 rows=19058 width=569) -> Index Scan using goliath_23_list_id_account_id_idx on goliath_23 ca (cost=0.56..8.56 rows=1 width=14) Index Cond: (list_id = david_23.list_id) > Well, I kinda doubt you have 17023331531230 rows (not even physically > possible with 2TB disk), so that's immediately suspicious. Below is the full plan for the strategy 2.1 (Indeed the previous email plan was truncated and wrong, sorry for that). Note that both plan acome from the same partitioned by hash table with 100 parts, with a unique index on the list_id + hash_key. For strategy 2.1, I turned on enable_partitionwise_join, since david table has the same partitioning scheme as goliath including unique indexe. In both case the query is: MERGE INTO "goliath" ca USING (SELECT * FROM "david" ORDER BY "list_id") AS t ON t."list_id" = ca."list_id" WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT (...) VALUES (...) Except in strategy 3 david is split by limit/offset 500 on each part tables such: MERGE INTO "goliath_23" ca USING (SELECT * FROM "david_23" ORDER BY "list_id" LIMIT 500 OFFSET 0) AS t ON t."list_id" = ca."list_id" WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT (...) VALUES (...) Merge on goliath ca (cost=178016528.81..192778842.44 rows=0 width=0) Merge on goliath_0 ca_1 Merge on goliath_1 ca_2 Merge on goliath_2 ca_3 Merge on goliath_3 ca_4 Merge on goliath_4 ca_5 Merge on goliath_5 ca_6 Merge on goliath_6 ca_7 Merge on goliath_7 ca_8 Merge on goliath_8 ca_9 Merge on goliath_9 ca_10 Merge on goliath_10 ca_11 Merge on goliath_11 ca_12 Merge on goliath_12 ca_13 Merge on goliath_13 ca_14 Merge on goliath_14 ca_15 Merge on goliath_15 ca_16 Merge on goliath_16 ca_17 Merge on goliath_17 ca_18 Merge on goliath_18 ca_19 Merge on goliath_19 ca_20 Merge on goliath_20 ca_21 Merge on goliath_21 ca_22 Merge on goliath_22 ca_23 Merge on goliath_23 ca_24 Merge on goliath_24 ca_25 Merge on goliath_25 ca_26 Merge on goliath_26 ca_27 Merge on goliath_27 ca_28 Merge on goliath_28 ca_29 Merge on goliath_29 ca_30 Merge on goliath_30 ca_31 Merge on goliath_31 ca_32 Merge on goliath_32 ca_33 Merge on goliath_33 ca_34 Merge on goliath_34 ca_35 Merge on goliath_35 ca_36 Merge on goliath_36 ca_37 Merge on goliath_37 ca_38 Merge on goliath_38 ca_39 Merge on goliath_39 ca_40 Merge on goliath_40 ca_41 Merge on goliath_41 ca_42 Merge on goliath_42 ca_43 Merge on goliath_43 ca_44 Merge on goliath_44 ca_45 Merge on goliath_45 ca_46 Merge on goliath_46 ca_47 Merge on goliath_47 ca_48 Merge on goliath_48 ca_49 Merge on goliath_49 ca_50 Merge on goliath_50 ca_51 Merge on goliath_51 ca_52 Merge on goliath_52 ca_53 Merge on goliath_53 ca_54 Merge on goliath_54 ca_55 Merge on goliath_55 ca_56 Merge on goliath_56 ca_57 Merge on goliath_57 ca_58 Merge on goliath_58 ca_59 Merge on goliath_59 ca_60 Merge on goliath_60 ca_61 Merge on goliath_61 ca_62 Merge on goliath_62 ca_63 Merge on goliath_63 ca_64 Merge on goliath_64 ca_65 Merge on goliath_65 ca_66 Merge on goliath_66 ca_67 Merge on goliath_67 ca_68 Merge on goliath_68 ca_69 Merge on goliath_69 ca_70 Merge on goliath_70 ca_71 Merge on goliath_71 ca_72 Merge on goliath_72 ca_73 Merge on goliath_73 ca_74 Merge on goliath_74 ca_75 Merge on goliath_75 ca_76 Merge on goliath_76 ca_77 Merge on goliath_77 ca_78 Merge on goliath_78 ca_79 Merge on goliath_79 ca_80 Merge on goliath_80 ca_81 Merge on goliath_81 ca_82 Merge on goliath_82 ca_83 Merge on goliath_83 ca_84 Merge on goliath_84 ca_85 Merge on goliath_85 ca_86 Merge on goliath_86 ca_87 Merge on goliath_87 ca_88 Merge on goliath_88 ca_89 Merge on goliath_89 ca_90 Merge on goliath_90 ca_91 Merge on goliath_91 ca_92 Merge on goliath_92 ca_93 Merge on goliath_93 ca_94 Merge on goliath_94 ca_95 Merge on goliath_95 ca_96 Merge on goliath_96 ca_97 Merge on goliath_97 ca_98 Merge on goliath_98 ca_99 Merge on goliath_99 ca_100 -> Hash Left Join (cost=178016528.81..192778842.44 rows=2187354 width=579) Hash Cond: (david.list_id = ca.list_id) -> Append (cost=0.00..201068.31 rows=2187354 width=569) -> Seq Scan on david_0 david_1 (cost=0.00..1926.65 rows=22165 width=569) -> Seq Scan on david_1 david_2 (cost=0.00..1809.13 rows=20813 width=569) -> Seq Scan on david_2 david_3 (cost=0.00..1812.52 rows=20852 width=569) -> Seq Scan on david_3 david_4 (cost=0.00..1648.67 rows=18967 width=569) -> Seq Scan on david_4 david_5 (cost=0.00..1853.20 rows=21320 width=569) -> Seq Scan on david_5 david_6 (cost=0.00..1735.68 rows=19968 width=569) -> Seq Scan on david_6 david_7 (cost=0.00..1693.87 rows=19487 width=569) -> Seq Scan on david_7 david_8 (cost=0.00..1872.41 rows=21541 width=569) -> Seq Scan on david_8 david_9 (cost=0.00..1827.21 rows=21021 width=569) -> Seq Scan on david_9 david_10 (cost=0.00..1815.91 rows=20891 width=569) -> Seq Scan on david_10 david_11 (cost=0.00..1757.15 rows=20215 width=569) -> Seq Scan on david_11 david_12 (cost=0.00..1624.94 rows=18694 width=569) -> Seq Scan on david_12 david_13 (cost=0.00..1867.89 rows=21489 width=569) -> Seq Scan on david_13 david_14 (cost=0.00..1979.76 rows=22776 width=569) -> Seq Scan on david_14 david_15 (cost=0.00..1706.30 rows=19630 width=569) -> Seq Scan on david_15 david_16 (cost=0.00..1828.34 rows=21034 width=569) -> Seq Scan on david_16 david_17 (cost=0.00..1702.91 rows=19591 width=569) -> Seq Scan on david_17 david_18 (cost=0.00..1805.74 rows=20774 width=569) -> Seq Scan on david_18 david_19 (cost=0.00..3531.25 rows=40625 width=569) -> Seq Scan on david_19 david_20 (cost=0.00..1522.11 rows=17511 width=569) -> Seq Scan on david_20 david_21 (cost=0.00..1950.38 rows=22438 width=569) -> Seq Scan on david_21 david_22 (cost=0.00..1957.16 rows=22516 width=569) -> Seq Scan on david_22 david_23 (cost=0.00..1745.85 rows=20085 width=569) -> Seq Scan on david_23 david_24 (cost=0.00..1730.03 rows=19903 width=569) -> Seq Scan on david_24 david_25 (cost=0.00..1784.27 rows=20527 width=569) -> Seq Scan on david_25 david_26 (cost=0.00..1698.39 rows=19539 width=569) -> Seq Scan on david_26 david_27 (cost=0.00..1900.66 rows=21866 width=569) -> Seq Scan on david_27 david_28 (cost=0.00..1813.65 rows=20865 width=569) -> Seq Scan on david_28 david_29 (cost=0.00..2009.14 rows=23114 width=569) -> Seq Scan on david_29 david_30 (cost=0.00..1778.62 rows=20462 width=569) -> Seq Scan on david_30 david_31 (cost=0.00..1779.75 rows=20475 width=569) -> Seq Scan on david_31 david_32 (cost=0.00..1892.75 rows=21775 width=569) -> Seq Scan on david_32 david_33 (cost=0.00..1988.80 rows=22880 width=569) -> Seq Scan on david_33 david_34 (cost=0.00..1804.61 rows=20761 width=569) -> Seq Scan on david_34 david_35 (cost=0.00..1857.72 rows=21372 width=569) -> Seq Scan on david_35 david_36 (cost=0.00..1782.01 rows=20501 width=569) -> Seq Scan on david_36 david_37 (cost=0.00..2352.66 rows=27066 width=569) -> Seq Scan on david_37 david_38 (cost=0.00..1962.81 rows=22581 width=569) -> Seq Scan on david_38 david_39 (cost=0.00..2002.36 rows=23036 width=569) -> Seq Scan on david_39 david_40 (cost=0.00..1852.07 rows=21307 width=569) -> Seq Scan on david_40 david_41 (cost=0.00..2116.49 rows=24349 width=569) -> Seq Scan on david_41 david_42 (cost=0.00..1785.40 rows=20540 width=569) -> Seq Scan on david_42 david_43 (cost=0.00..1838.51 rows=21151 width=569) -> Seq Scan on david_43 david_44 (cost=0.00..1931.17 rows=22217 width=569) -> Seq Scan on david_44 david_45 (cost=0.00..1878.06 rows=21606 width=569) -> Seq Scan on david_45 david_46 (cost=0.00..1859.98 rows=21398 width=569) -> Seq Scan on david_46 david_47 (cost=0.00..1882.58 rows=21658 width=569) -> Seq Scan on david_47 david_48 (cost=0.00..1791.05 rows=20605 width=569) -> Seq Scan on david_48 david_49 (cost=0.00..1925.52 rows=22152 width=569) -> Seq Scan on david_49 david_50 (cost=0.00..1953.77 rows=22477 width=569) -> Seq Scan on david_50 david_51 (cost=0.00..1797.83 rows=20683 width=569) -> Seq Scan on david_51 david_52 (cost=0.00..1680.31 rows=19331 width=569) -> Seq Scan on david_52 david_53 (cost=0.00..1626.07 rows=18707 width=569) -> Seq Scan on david_53 david_54 (cost=0.00..2003.49 rows=23049 width=569) -> Seq Scan on david_54 david_55 (cost=0.00..1771.84 rows=20384 width=569) -> Seq Scan on david_55 david_56 (cost=0.00..1700.65 rows=19565 width=569) -> Seq Scan on david_56 david_57 (cost=0.00..1931.17 rows=22217 width=569) -> Seq Scan on david_57 david_58 (cost=0.00..1833.99 rows=21099 width=569) -> Seq Scan on david_58 david_59 (cost=0.00..1918.74 rows=22074 width=569) -> Seq Scan on david_59 david_60 (cost=0.00..1885.97 rows=21697 width=569) -> Seq Scan on david_60 david_61 (cost=0.00..4095.12 rows=47112 width=569) -> Seq Scan on david_61 david_62 (cost=0.00..2076.94 rows=23894 width=569) -> Seq Scan on david_62 david_63 (cost=0.00..2876.98 rows=33098 width=569) -> Seq Scan on david_63 david_64 (cost=0.00..1647.54 rows=18954 width=569) -> Seq Scan on david_64 david_65 (cost=0.00..1653.19 rows=19019 width=569) -> Seq Scan on david_65 david_66 (cost=0.00..1684.83 rows=19383 width=569) -> Seq Scan on david_66 david_67 (cost=0.00..1863.37 rows=21437 width=569) -> Seq Scan on david_67 david_68 (cost=0.00..1717.60 rows=19760 width=569) -> Seq Scan on david_68 david_69 (cost=0.00..1847.55 rows=21255 width=569) -> Seq Scan on david_69 david_70 (cost=0.00..2235.14 rows=25714 width=569) -> Seq Scan on david_70 david_71 (cost=0.00..2273.56 rows=26156 width=569) -> Seq Scan on david_71 david_72 (cost=0.00..1745.85 rows=20085 width=569) -> Seq Scan on david_72 david_73 (cost=0.00..1861.11 rows=21411 width=569) -> Seq Scan on david_73 david_74 (cost=0.00..1856.59 rows=21359 width=569) -> Seq Scan on david_74 david_75 (cost=0.00..1885.97 rows=21697 width=569) -> Seq Scan on david_75 david_76 (cost=0.00..1665.62 rows=19162 width=569) -> Seq Scan on david_76 david_77 (cost=0.00..1870.15 rows=21515 width=569) -> Seq Scan on david_77 david_78 (cost=0.00..1776.36 rows=20436 width=569) -> Seq Scan on david_78 david_79 (cost=0.00..1766.19 rows=20319 width=569) -> Seq Scan on david_79 david_80 (cost=0.00..1812.52 rows=20852 width=569) -> Seq Scan on david_80 david_81 (cost=0.00..1995.58 rows=22958 width=569) -> Seq Scan on david_81 david_82 (cost=0.00..1701.78 rows=19578 width=569) -> Seq Scan on david_82 david_83 (cost=0.00..1658.84 rows=19084 width=569) -> Seq Scan on david_83 david_84 (cost=0.00..1840.77 rows=21177 width=569) -> Seq Scan on david_84 david_85 (cost=0.00..1688.22 rows=19422 width=569) -> Seq Scan on david_85 david_86 (cost=0.00..1918.74 rows=22074 width=569) -> Seq Scan on david_86 david_87 (cost=0.00..2963.99 rows=34099 width=569) -> Seq Scan on david_87 david_88 (cost=0.00..2075.81 rows=23881 width=569) -> Seq Scan on david_88 david_89 (cost=0.00..1783.14 rows=20514 width=569) -> Seq Scan on david_89 david_90 (cost=0.00..1765.06 rows=20306 width=569) -> Seq Scan on david_90 david_91 (cost=0.00..1950.38 rows=22438 width=569) -> Seq Scan on david_91 david_92 (cost=0.00..1840.77 rows=21177 width=569) -> Seq Scan on david_92 david_93 (cost=0.00..1783.14 rows=20514 width=569) -> Seq Scan on david_93 david_94 (cost=0.00..1705.17 rows=19617 width=569) -> Seq Scan on david_94 david_95 (cost=0.00..1817.04 rows=20904 width=569) -> Seq Scan on david_95 david_96 (cost=0.00..1977.50 rows=22750 width=569) -> Seq Scan on david_96 david_97 (cost=0.00..1946.99 rows=22399 width=569) -> Seq Scan on david_97 david_98 (cost=0.00..1814.78 rows=20878 width=569) -> Seq Scan on david_98 david_99 (cost=0.00..1844.16 rows=21216 width=569) -> Seq Scan on david_99 david_100 (cost=0.00..1769.58 rows=20358 width=569) -> Hash (cost=147650973.90..147650973.90 rows=1653953593 width=18) -> Append (cost=0.00..147650973.90 rows=1653953593 width=18) -> Seq Scan on goliath_0 ca_1 (cost=0.00..11177255.56 rows=150300456 width=18) -> Seq Scan on goliath_1 ca_2 (cost=0.00..1234238.22 rows=14780522 width=18) -> Seq Scan on goliath_2 ca_3 (cost=0.00..1323160.42 rows=15336142 width=18) -> Seq Scan on goliath_3 ca_4 (cost=0.00..1256666.46 rows=15029146 width=18) -> Seq Scan on goliath_4 ca_5 (cost=0.00..1272324.75 rows=15157175 width=18) -> Seq Scan on goliath_5 ca_6 (cost=0.00..1270349.37 rows=15044037 width=18) -> Seq Scan on goliath_6 ca_7 (cost=0.00..1284810.74 rows=15261474 width=18) -> Seq Scan on goliath_7 ca_8 (cost=0.00..1263715.41 rows=15020741 width=18) -> Seq Scan on goliath_8 ca_9 (cost=0.00..1265121.73 rows=14953673 width=18) -> Seq Scan on goliath_9 ca_10 (cost=0.00..1309331.70 rows=15314570 width=18) -> Seq Scan on goliath_10 ca_11 (cost=0.00..1269041.02 rows=15086702 width=18) -> Seq Scan on goliath_11 ca_12 (cost=0.00..1268299.98 rows=15042498 width=18) -> Seq Scan on goliath_12 ca_13 (cost=0.00..1294069.08 rows=15206708 width=18) -> Seq Scan on goliath_13 ca_14 (cost=0.00..1344155.97 rows=15480897 width=18) -> Seq Scan on goliath_14 ca_15 (cost=0.00..1258529.41 rows=15007641 width=18) -> Seq Scan on goliath_15 ca_16 (cost=0.00..1247612.99 rows=14801699 width=18) -> Seq Scan on goliath_16 ca_17 (cost=0.00..1398973.15 rows=15833115 width=18) -> Seq Scan on goliath_17 ca_18 (cost=0.00..1234430.89 rows=14907189 width=18) -> Seq Scan on goliath_18 ca_19 (cost=0.00..1491068.89 rows=16395989 width=18) -> Seq Scan on goliath_19 ca_20 (cost=0.00..1241254.74 rows=14743874 width=18) -> Seq Scan on goliath_20 ca_21 (cost=0.00..1308537.31 rows=15139031 width=18) -> Seq Scan on goliath_21 ca_22 (cost=0.00..1274257.03 rows=15133903 width=18) -> Seq Scan on goliath_22 ca_23 (cost=0.00..1348582.00 rows=15415900 width=18) -> Seq Scan on goliath_23 ca_24 (cost=0.00..1245613.99 rows=14770499 width=18) -> Seq Scan on goliath_24 ca_25 (cost=0.00..1232552.90 rows=14592890 width=18) -> Seq Scan on goliath_25 ca_26 (cost=0.00..1237272.86 rows=14785186 width=18) -> Seq Scan on goliath_26 ca_27 (cost=0.00..1395925.80 rows=15794380 width=18) -> Seq Scan on goliath_27 ca_28 (cost=0.00..1243112.59 rows=14888659 width=18) -> Seq Scan on goliath_28 ca_29 (cost=0.00..1261176.05 rows=15014705 width=18) -> Seq Scan on goliath_29 ca_30 (cost=0.00..1375287.81 rows=15912981 width=18) -> Seq Scan on goliath_30 ca_31 (cost=0.00..1236320.19 rows=14789519 width=18) -> Seq Scan on goliath_31 ca_32 (cost=0.00..1278375.97 rows=15203897 width=18) -> Seq Scan on goliath_32 ca_33 (cost=0.00..1263550.43 rows=14860643 width=18) -> Seq Scan on goliath_33 ca_34 (cost=0.00..1299830.39 rows=15186239 width=18) -> Seq Scan on goliath_34 ca_35 (cost=0.00..1352761.61 rows=15664361 width=18) -> Seq Scan on goliath_35 ca_36 (cost=0.00..1323724.61 rows=15543061 width=18) -> Seq Scan on goliath_36 ca_37 (cost=0.00..1508080.05 rows=16098705 width=18) -> Seq Scan on goliath_37 ca_38 (cost=0.00..1247180.20 rows=15092220 width=18) -> Seq Scan on goliath_38 ca_39 (cost=0.00..1265121.63 rows=14913063 width=18) -> Seq Scan on goliath_39 ca_40 (cost=0.00..1263161.74 rows=15008274 width=18) -> Seq Scan on goliath_40 ca_41 (cost=0.00..1422028.56 rows=15874056 width=18) -> Seq Scan on goliath_41 ca_42 (cost=0.00..1276259.24 rows=15052824 width=18) -> Seq Scan on goliath_42 ca_43 (cost=0.00..1331700.23 rows=15499623 width=18) -> Seq Scan on goliath_43 ca_44 (cost=0.00..1246053.10 rows=14665010 width=18) -> Seq Scan on goliath_44 ca_45 (cost=0.00..1275255.85 rows=15143785 width=18) -> Seq Scan on goliath_45 ca_46 (cost=0.00..1305362.83 rows=15361783 width=18) -> Seq Scan on goliath_46 ca_47 (cost=0.00..1280577.37 rows=15247837 width=18) -> Seq Scan on goliath_47 ca_48 (cost=0.00..1251285.15 rows=14806015 width=18) -> Seq Scan on goliath_48 ca_49 (cost=0.00..1351232.48 rows=15433548 width=18) -> Seq Scan on goliath_49 ca_50 (cost=0.00..1347924.50 rows=15296550 width=18) -> Seq Scan on goliath_50 ca_51 (cost=0.00..1357086.54 rows=15541854 width=18) -> Seq Scan on goliath_51 ca_52 (cost=0.00..1216370.11 rows=14562711 width=18) -> Seq Scan on goliath_52 ca_53 (cost=0.00..1358864.91 rows=15543891 width=18) -> Seq Scan on goliath_53 ca_54 (cost=0.00..1303103.21 rows=15233521 width=18) -> Seq Scan on goliath_54 ca_55 (cost=0.00..1247450.04 rows=14984504 width=18) -> Seq Scan on goliath_55 ca_56 (cost=0.00..1265316.35 rows=14879835 width=18) -> Seq Scan on goliath_56 ca_57 (cost=0.00..1256864.72 rows=14942272 width=18) -> Seq Scan on goliath_57 ca_58 (cost=0.00..1234443.50 rows=14857950 width=18) -> Seq Scan on goliath_58 ca_59 (cost=0.00..1293245.96 rows=15297596 width=18) -> Seq Scan on goliath_59 ca_60 (cost=0.00..1234137.56 rows=14820356 width=18) -> Seq Scan on goliath_60 ca_61 (cost=0.00..1561333.77 rows=16903277 width=18) -> Seq Scan on goliath_61 ca_62 (cost=0.00..1289386.58 rows=15273158 width=18) -> Seq Scan on goliath_62 ca_63 (cost=0.00..1375996.18 rows=15783618 width=18) -> Seq Scan on goliath_63 ca_64 (cost=0.00..1318835.42 rows=15393842 width=18) -> Seq Scan on goliath_64 ca_65 (cost=0.00..1279811.42 rows=15025442 width=18) -> Seq Scan on goliath_65 ca_66 (cost=0.00..1238623.43 rows=14795243 width=18) -> Seq Scan on goliath_66 ca_67 (cost=0.00..1305470.50 rows=15230950 width=18) -> Seq Scan on goliath_67 ca_68 (cost=0.00..1261104.19 rows=14894319 width=18) -> Seq Scan on goliath_68 ca_69 (cost=0.00..1283365.18 rows=15239718 width=18) -> Seq Scan on goliath_69 ca_70 (cost=0.00..1314101.63 rows=15292263 width=18) -> Seq Scan on goliath_70 ca_71 (cost=0.00..1253906.46 rows=14978446 width=18) -> Seq Scan on goliath_71 ca_72 (cost=0.00..1294493.71 rows=15331171 width=18) -> Seq Scan on goliath_72 ca_73 (cost=0.00..1286198.09 rows=15418009 width=18) -> Seq Scan on goliath_73 ca_74 (cost=0.00..1289391.43 rows=15229243 width=18) -> Seq Scan on goliath_74 ca_75 (cost=0.00..1242385.13 rows=14670113 width=18) -> Seq Scan on goliath_75 ca_76 (cost=0.00..1263916.51 rows=14982751 width=18) -> Seq Scan on goliath_76 ca_77 (cost=0.00..1294899.69 rows=15200869 width=18) -> Seq Scan on goliath_77 ca_78 (cost=0.00..1269406.36 rows=14930436 width=18) -> Seq Scan on goliath_78 ca_79 (cost=0.00..1235383.65 rows=14800365 width=18) -> Seq Scan on goliath_79 ca_80 (cost=0.00..1236916.73 rows=14710273 width=18) -> Seq Scan on goliath_80 ca_81 (cost=0.00..1421834.11 rows=16753911 width=18) -> Seq Scan on goliath_81 ca_82 (cost=0.00..1335250.54 rows=15351354 width=18) -> Seq Scan on goliath_82 ca_83 (cost=0.00..1243153.05 rows=14688205 width=18) -> Seq Scan on goliath_83 ca_84 (cost=0.00..1250345.33 rows=14741833 width=18) -> Seq Scan on goliath_84 ca_85 (cost=0.00..1301004.00 rows=14956400 width=18) -> Seq Scan on goliath_85 ca_86 (cost=0.00..1289135.50 rows=15009950 width=18) -> Seq Scan on goliath_86 ca_87 (cost=0.00..1488060.65 rows=16380865 width=18) -> Seq Scan on goliath_87 ca_88 (cost=0.00..1265255.27 rows=15004127 width=18) -> Seq Scan on goliath_88 ca_89 (cost=0.00..1291141.46 rows=15112446 width=18) -> Seq Scan on goliath_89 ca_90 (cost=0.00..1236365.37 rows=14733237 width=18) -> Seq Scan on goliath_90 ca_91 (cost=0.00..1276548.94 rows=14938294 width=18) -> Seq Scan on goliath_91 ca_92 (cost=0.00..1375334.48 rows=16060648 width=18) -> Seq Scan on goliath_92 ca_93 (cost=0.00..1257210.14 rows=14934014 width=18) -> Seq Scan on goliath_93 ca_94 (cost=0.00..1268048.19 rows=14998419 width=18) -> Seq Scan on goliath_94 ca_95 (cost=0.00..1277075.92 rows=15219292 width=18) -> Seq Scan on goliath_95 ca_96 (cost=0.00..1351108.41 rows=15454141 width=18) -> Seq Scan on goliath_96 ca_97 (cost=0.00..1283608.49 rows=15091049 width=18) -> Seq Scan on goliath_97 ca_98 (cost=0.00..1277019.32 rows=15265132 width=18) -> Seq Scan on goliath_98 ca_99 (cost=0.00..1258056.96 rows=15006496 width=18) -> Seq Scan on goliath_99 ca_100 (cost=0.00..1221325.89 rows=14612389 width=18)
I came here to talk about partitionwise join, but then noticed you have already thought of that: On 2023-Jun-18, nicolas paris wrote: > Note that both plan acome from the same partitioned by hash table with > 100 parts, with a unique index on the list_id + hash_key. For strategy > 2.1, I turned on enable_partitionwise_join, since david table has the > same partitioning scheme as goliath including unique indexe. In both > case the query is: Hmm, I suppose the reason partitionwise join isn't having any effect is that the presence of WHEN NOT MATCHED clauses force an outer join, which probably disarms partitionwise joining, since each join pair would require to match for nulls, so there would be two matching partitions at the other end. A quick test for this hypothesis might be to try the MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join works better. Maybe Tom L's new outer-join infrastructure in 16 allows to improve on this, not sure. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
On 6/18/23 22:57, nicolas paris wrote: >> ... >> Well, I kinda doubt you have 17023331531230 rows (not even physically >> possible with 2TB disk), so that's immediately suspicious. > > Below is the full plan for the strategy 2.1 (Indeed the previous email > plan was truncated and wrong, sorry for that). > None of the plans has estimates anywhere close to 17023331531230, so where did that come from? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/19/23 09:46, Alvaro Herrera wrote: > I came here to talk about partitionwise join, but then noticed you have > already thought of that: > > On 2023-Jun-18, nicolas paris wrote: > >> Note that both plan acome from the same partitioned by hash table with >> 100 parts, with a unique index on the list_id + hash_key. For strategy >> 2.1, I turned on enable_partitionwise_join, since david table has the >> same partitioning scheme as goliath including unique indexe. In both >> case the query is: > > Hmm, I suppose the reason partitionwise join isn't having any effect is > that the presence of WHEN NOT MATCHED clauses force an outer join, which > probably disarms partitionwise joining, since each join pair would > require to match for nulls, so there would be two matching partitions at > the other end. A quick test for this hypothesis might be to try the > MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join > works better. > > Maybe Tom L's new outer-join infrastructure in 16 allows to improve on > this, not sure. > Not sure why would that disarm partitionwise join - attached is a simple reproducer, generating two tables, loading 10000000 and 10000 rows into them, and then doing explain on a simple merge. IMHO the thing that breaks it is the ORDER BY in the merge, which likely acts as an optimization fence and prevents all sorts of smart things including the partitionwise join. I'd bet that if Nicolas replaces MERGE INTO "goliath" ca USING (SELECT * FROM "david" ORDER BY "list_id") AS t .. with MERGE INTO "goliath" ca USING "david" AS t ... it'll start doing the working much better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, 2023-06-19 at 13:34 +0200, Tomas Vondra wrote: > > > On 6/18/23 22:57, nicolas paris wrote: > > > ... > > > Well, I kinda doubt you have 17023331531230 rows (not even > > > physically > > > possible with 2TB disk), so that's immediately suspicious. > > > > Below is the full plan for the strategy 2.1 (Indeed the previous > > email > > plan was truncated and wrong, sorry for that). > > > > None of the plans has estimates anywhere close to 17023331531230, so > where did that come from? > Well this was an old plan where there was an issue: the david table did not have the same partitioning scheme as goliath. It was partitioned by an other column.
> IMHO the thing that breaks it is the ORDER BY in the merge, which > likely > acts as an optimization fence and prevents all sorts of smart things > including the partitionwise join. I'd bet that if Nicolas replaces > > MERGE INTO "goliath" ca > USING (SELECT * FROM "david" ORDER BY "list_id") AS t > . Sorry if it was not clear, however there is no order by in the 2.1 strategy. Then this cannot be the reason of not triggering the optim. For information I do enable partition join feature with jdbc call just before the merge: set enable_partitionwise_join=true
On 6/19/23 14:20, nicolas paris wrote: >> IMHO the thing that breaks it is the ORDER BY in the merge, which >> likely >> acts as an optimization fence and prevents all sorts of smart things >> including the partitionwise join. I'd bet that if Nicolas replaces >> >> MERGE INTO "goliath" ca >> USING (SELECT * FROM "david" ORDER BY "list_id") AS t >> . > > Sorry if it was not clear, however there is no order by in the 2.1 > strategy. Then this cannot be the reason of not triggering the optim. > > For information I do enable partition join feature with jdbc call just > before the merge: > set enable_partitionwise_join=true > But you wrote that in both cases the query is: MERGE INTO "goliath" ca USING (SELECT * FROM "david" ORDER BY "list_id") AS t ON t."list_id" = ca."list_id" WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT (...) VALUES (...) With all due respect, I'm getting a bit tired of having to speculate about what exactly you're doing etc. based on bits of information. I'm willing to continue to investigate, but only if you prepare a reproducer, i.e. a SQL script that demonstrates the issue - I don't think preparing that should be difficult, something like the SQL script I shared earlier today should do the trick. I suggest you do that directly, not through JDBC. Perhaps the JDBC connection pool does something funny (like connection pooling and resetting settings). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> But you wrote that in both cases the query is: that was indeed yet another tipo, hope to do better in the future. > I'm willing to continue to investigate, but only if you prepare a > reproducer, Thanks for your starter script. Please find attached 2 scripts which now illustrates two troubles. repro1.sql is a slight evolution of yours. When I play with david size (as described in the comments) you will see plan going from nested loop to sequential scan. Also note that the partition wise join is likely working. This illustrate my initial problem: the sequential scan is not going to work fine on my workload (david too large). How to suggest postgres to use a nested loop here ? (suspect playing with costs should help) repro2.sql now I changed the table layout (similar to my setup) to reproduce the partition wise join which does not triggers. I added a partition column, and a unique index to be able to mimic a primary key. Now partition wise (in my local docker vanilla postgres 15.3) does not work. Eventually, if I do small batch, then the merge is working fast. That's an other, unrelated problem. > I suggest you do that directly, not through JDBC. Perhaps the JDBC > connection pool does something funny (like connection pooling and > resetting settings). I can tell jdbc was working, and likely the reason might be in my current table setup.
Attachment
On 6/19/23 17:45, nicolas paris wrote: >> But you wrote that in both cases the query is: > > that was indeed yet another tipo, hope to do better in the future. > > >> I'm willing to continue to investigate, but only if you prepare a >> reproducer, > > Thanks for your starter script. Please find attached 2 scripts which > now illustrates two troubles. > > repro1.sql is a slight evolution of yours. When I play with david size > (as described in the comments) you will see plan going from nested loop > to sequential scan. Also note that the partition wise join is likely > working. This illustrate my initial problem: the sequential scan is not > going to work fine on my workload (david too large). How to suggest > postgres to use a nested loop here ? (suspect playing with costs should > help) > In general, this behavior is expected. The overall idea is that nested loops are efficient for small row counts, but the cost raises quickly exactly because they do a lot of random I/O (due to the index scan). At some point it's cheaper to switch to plan does sequential I/O. Which is exactly what's happening here - we switch from a plan doing a lot of random I/O on goliath QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=0.29..7888.69 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=0.29..7888.69 rows=9998 width=47) -> Nested Loop Left Join (cost=0.29..93.89 rows=120 ... -> Seq Scan on david_0 david_1 (cost=0.00..2.20 ... -> Index Scan using goliath_0_pkey on goliath_0 ... Index Cond: (id = david_1.id) -> Nested Loop Left Join (cost=0.29..77.10 rows=98 ... -> Seq Scan on david_1 david_2 (cost=0.00..1.98 ... -> Index Scan using goliath_1_pkey on goliath_1 ... Index Cond: (id = david_2.id) ... -> Nested Loop Left Join (cost=0.29..74.58 rows=95 ... -> Seq Scan on david_99 david_100 (cost=0.00..1.95 -> Index Scan using goliath_99_pkey on goliath_99 ... Index Cond: (id = david_100.id) (502 rows) to a plan that does a lot of sequential I/O QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=293.44..264556.47 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=293.44..264556.47 rows=951746 width=47) -> Hash Right Join (cost=293.44..2597.05 rows=9486 ... Hash Cond: (goliath_1.id = david_1.id) -> Seq Scan on goliath_0 goliath_1 (cost=0.00.. -> Hash (cost=174.86..174.86 rows=9486 width=37) -> Seq Scan on david_0 david_1 (cost=0.00.. -> Hash Right Join (cost=295.62..2613.90 rows=9583 width= Hash Cond: (goliath_2.id = david_2.id) -> Seq Scan on goliath_1 goliath_2 (cost=0.00..1845 -> Hash (cost=175.83..175.83 rows=9583 width=37) -> Seq Scan on david_1 david_2 (cost=0.00.. ... -> Hash Right Join (cost=288.33..2593.16 rows=9348 width Hash Cond: (goliath_100.id = david_100.id) -> Seq Scan on goliath_99 goliath_100 (cost=0.00.. -> Hash (cost=171.48..171.48 rows=9348 width=37) -> Seq Scan on david_99 david_100 (cost=0.00.. (602 rows) That's expected, because the cost if I force the Nested Loop with the higher row cound in "david" looks like this: QUERY PLAN ---------------------------------------------------------------------- Merge on goliath (cost=0.29..331253.00 rows=0 width=0) ... Of course, the question is at which point the switch should happen. You can try setting enable_hashjoin=off, which should push the optimizer to use the first plan. If it does, you'll know the cost difference between the two plans. If you run it and it's actually faster than the "default" plan with a hashjoin/seqscans, you can try lowering random_page_cost, which is likely the main input. The default is 4, in general it should be higher than seq_page_cost=1.0 (because random I/O is more expensive). In any case, there's a point when the nested loops get terrible. I mean, imagine the "david" has 100000 rows, and "goliath" hash 100000 pages (800MB). It's just cheaper to do seqscan 100k pages than randomly scan the same 100k pages. You can tune where the plan flips, ofc. > > repro2.sql now I changed the table layout (similar to my setup) to > reproduce the partition wise join which does not triggers. I added a > partition column, and a unique index to be able to mimic a primary key. > Now partition wise (in my local docker vanilla postgres 15.3) does not > work. Eventually, if I do small batch, then the merge is working fast. > That's an other, unrelated problem. > This is absolutely expected. If you partition by hash (id, part_key), you can't join on (id) and expect partitionwise join to work. To quote the enable_partitionwise_join documentation [1]: Enables or disables the query planner's use of partitionwise join, which allows a join between partitioned tables to be performed by joining the matching partitions. Partitionwise join currently applies only when the join conditions include all the partition keys, which must be of the same data type and have one-to-one matching sets of child partitions. So the fact that merge into goliath using david on david.id = goliath.id when matched then update set val = david.val when not matched then insert (id, val) values (david.id, david.val); does not work is absolutely expected. You need to join on part_col too. This is because the partition is determined by hash of both columns, a bit like md5(id+part_col) so knowing just the "id" is not enough to determine the partition. Even if you fix that, the last query won't use partitionwise join because of the ORDER BY subquery, which serves as an optimization fence (which means the merge does not actually see the underlying table is partitioned). If you get rid of that and add the part_col to the join, it translates to the first issue with setting costs to flip to the sequential scan at the right point. [1] https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> This is absolutely expected. If you partition by hash (id, part_key), > you can't join on (id) and expect partitionwise join to work. To > quote > the enable_partitionwise_join documentation [1]: > > Enables or disables the query planner's use of partitionwise > join, > which allows a join between partitioned tables to be performed by > joining the matching partitions. Partitionwise join currently > applies only when the join conditions include all the partition > keys, which must be of the same data type and have one-to-one > matching sets of child partitions. > > So the fact that > > merge into goliath using david on david.id = goliath.id > when matched then update set val = david.val > when not matched then insert (id, val) values (david.id, > david.val); > > does not work is absolutely expected. You need to join on part_col > too. Definitely this makes sense to add the part_col in the join columns. Also it helps the planner to choose a better plan, since now it goes with per partition nested loop without having to trick the costs (either enable_hashjoin/random_page_cost), with my current workload so far. Thanks you goliath -- david
On 6/20/23 12:02, nicolas paris wrote: >... > > Definitely this makes sense to add the part_col in the join columns. > Also it helps the planner to choose a better plan, since now it goes > with per partition nested loop without having to trick the costs > (either enable_hashjoin/random_page_cost), with my current workload so > far. > Right. With non-partitionwise join the nestloop inner lookup has to do indexscan on every partition (it can't decide which of the partitions will have a match, and for costing we assume there's at least 1 row in each lookup). Which essentially amplifies the amount of random I/O by a factor of 100x (or whatever the number of partitions is). That is, instead of doing 100x nested loops like this: -> Nested Loop Left Join (cost=0.29..33.42 rows=8 width=47) -> Seq Scan on david_98 david_99 (cost=0.00..1.08 -> Index Scan using goliath_98_id_part_col_idx on Index Cond: ((id = david_99.id) AND ...) we end up doing one nested loop with an inner lookup like this -> Append (cost=0.29..557.63 rows=100 width=14) -> Index Scan using ... goliath_1 (cost=0.29..5.57 ... Index Cond: (id = david.id) ... And this is per-loop, of which there'll be 500 (because the small david table has 500 rows). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company