Thread: Merge David and Goliath tables efficiently

Merge David and Goliath tables efficiently

From

Nicolas Paris

Date:

17 June 2023, 13:48:33

In my use case I have a 2billion / 1To table. I have daily data to upsert around 2milion, with say 50% inserts, based
onthe primary key in a fresh analyzed table.

I have tested multiple strategies to merge the data, all based on first stage to copy the 2m dataset in an staging
unlogged/ indexed table:

1. Join insert then join update
2.1. Usage of the new merge statement
2.2 Usage of merge on two hash partitioned tables wit partition wide join enabled
3. Usage of merge by batch of 1000 rows

First remark is the merge statement is almost 30% faster than two statements in my benchmarks. Thanks to the pg
communityfor this.

While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), the third one finishes within 30 minutes.

My interpretation reading the query plan is: well sized small batches of upserts leverage the indexes while the regular
joinchoose the sequential scan, including sorting and hashing which takes forever time and resources including disk.

Sadly my incoming dataset is too small to benefit from a seq scan and too large to benefit from an index scan join.
Howeverwhen splited manuallyin N portions, the problem can be tackled with N * small cost, which is cheap anyway.

Questions:
1. Is there another strategy ?
2. Could postgres support a "batched indexed join itself", leveraging indexes itself by dynamic sized batches ?

It is error prone write code to split and iterate I suspect postgres has everything internally (indexes catalog,
planner)to split itself the job, making David vs Goliath something trivial.

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

17 June 2023, 19:52:07

On 6/17/23 15:48, Nicolas Paris wrote:
> In my use case I have a 2billion / 1To table. I have daily data to upsert around 2milion, with say 50% inserts, based
onthe primary key in a fresh analyzed table.
 
> 
> I have tested multiple strategies to merge the data, all based on first stage to copy the 2m dataset in an staging
unlogged/ indexed table:
 
> 
> 1. Join insert then join update 
> 2.1. Usage of the new merge statement
> 2.2 Usage of merge on two hash partitioned tables wit partition wide join enabled
> 3. Usage of merge by batch of 1000 rows
> 
> First remark is the merge statement is almost 30% faster than two statements in my benchmarks. Thanks to the pg
communityfor this.
 
> 
> While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), the third one finishes within 30
minutes.
> 

Seems pretty terrible, provided the data is on reasonable storage (with
acceptable random I/O behavior).

> My interpretation reading the query plan is: well sized small batches of upserts leverage the indexes while the
regularjoin choose the sequential scan, including sorting and hashing which takes forever time and resources including
disk.

You may be right, but it's hard to tell without seeing the query plan.

> 
> Sadly my incoming dataset is too small to benefit from a seq scan and too large to benefit from an index scan join.
Howeverwhen splited manuallyin N portions, the problem can be tackled with N * small cost, which is cheap anyway.
 
> 

Sounds very much like you'd benefit from tuning some cost parameters to
make the index scan look cheaper.

> Questions:
> 1. Is there another strategy  ?
> 2. Could postgres support a "batched indexed join itself", leveraging indexes itself by dynamic sized batches ?
> 

Not sure what 'batched indexed join' would be, but it very much sounds
like a nested loop with an index scan.

> 
> It is error prone write code to split and iterate  I suspect postgres has everything internally (indexes catalog,
planner)to split itself the job, making David vs Goliath something trivial.
 
> 

What PostgreSQL version are you using, what hardware? Did you tune it in
any way, or is everything just default?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

17 June 2023, 21:42:48

> > My interpretation reading the query plan is: well sized small
> > batches of upserts leverage the indexes while the regular join
> > choose the sequential scan, including sorting and hashing which
> > takes forever time and resources including disk.
>
> You may be right, but it's hard to tell without seeing the query
> plan.

Here are part of both plans:

Bad case (strategy 2.1):

->  Merge Left Join  (cost=530202629.03..255884257913.32
rows=17023331531230 width=579)
Merge Cond: (david.list_id = ca.list_id)
->  Sort  (cost=2019172.91..2024398.82 rows=2090361 width=569)
      Sort Key: david.list_id
      ->  Append  (cost=0.00..192152.41 rows=2090361 width=569)
            ->  Seq Scan on david_0 david_1  (cost=0.00..1812.52
rows=20852 width=569)
            ->  Seq Scan on david_1 david_2  (cost=0.00..1800.09
rows=20709 width=569)
            ->  Seq Scan on david_2 david_3  (cost=0.00..1794.44
rows=20644 width=569)

Good case (strategy 3):

Merge on goliath_23 ca  (cost=2139.75..11077.17 rows=0 width=0)
  ->  Nested Loop Left Join  (cost=2139.75..11077.17 rows=1000
width=575)
        ->  Limit  (cost=2139.19..2495.67 rows=1000 width=569)
              ->  Index Scan using david_23_list_id_account_id_idx on
david_23  (cost=0.29..6794.16 rows=19058 width=569)
        ->  Index Scan using goliath_23_list_id_account_id_idx on
goliath_23 ca  (cost=0.56..8.56 rows=1 width=14)
              Index Cond: (list_id = david_23.list_id)

>
> Sounds very much like you'd benefit from tuning some cost parameters
> to
> make the index scan look cheaper.
> Not sure what 'batched indexed join' would be, but it very much
> sounds
> like a nested loop with an index scan.

Agreed, a 2M nested loop over index scan would likely work as well.
Would tuning the costs param could lead to get such large nested loop ?

> What PostgreSQL version are you using, what hardware? Did you tune it
> in
> any way, or is everything just default?

It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud
provider parameters (RDS).

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

17 June 2023, 23:48:51


On 6/17/23 23:42, nicolas paris wrote:
>>> My interpretation reading the query plan is: well sized small
>>> batches of upserts leverage the indexes while the regular join
>>> choose the sequential scan, including sorting and hashing which
>>> takes forever time and resources including disk.
>>
>> You may be right, but it's hard to tell without seeing the query
>> plan.
> 
> Here are part of both plans:
> 

I don't understand why you're sharing just a part of the plan and not
the whole thing, ideally including the actual update ... Giving us the
info in small pieces just means we need to guess and speculate.

> Bad case (strategy 2.1):
> 
> ->  Merge Left Join  (cost=530202629.03..255884257913.32
> rows=17023331531230 width=579)
> Merge Cond: (david.list_id = ca.list_id)
> ->  Sort  (cost=2019172.91..2024398.82 rows=2090361 width=569)
>       Sort Key: david.list_id
>       ->  Append  (cost=0.00..192152.41 rows=2090361 width=569)
>             ->  Seq Scan on david_0 david_1  (cost=0.00..1812.52
> rows=20852 width=569)
>             ->  Seq Scan on david_1 david_2  (cost=0.00..1800.09
> rows=20709 width=569)
>             ->  Seq Scan on david_2 david_3  (cost=0.00..1794.44
> rows=20644 width=569)
> 

Well, I kinda doubt you have 17023331531230 rows (not even physically
possible with 2TB disk), so that's immediately suspicious. I'd bet the
UPDATE ... FROM ... is missing a condition or something like that, which
results in a cartesian product.

> Good case (strategy 3):
> 
> Merge on goliath_23 ca  (cost=2139.75..11077.17 rows=0 width=0)
>   ->  Nested Loop Left Join  (cost=2139.75..11077.17 rows=1000
> width=575)
>         ->  Limit  (cost=2139.19..2495.67 rows=1000 width=569)
>               ->  Index Scan using david_23_list_id_account_id_idx on
> david_23  (cost=0.29..6794.16 rows=19058 width=569)
>         ->  Index Scan using goliath_23_list_id_account_id_idx on
> goliath_23 ca  (cost=0.56..8.56 rows=1 width=14)
>               Index Cond: (list_id = david_23.list_id)
> 
>>
>> Sounds very much like you'd benefit from tuning some cost parameters
>> to
>> make the index scan look cheaper.
>> Not sure what 'batched indexed join' would be, but it very much
>> sounds
>> like a nested loop with an index scan.
> 
> Agreed, a 2M nested loop over index scan would likely work as well.
> Would tuning the costs param could lead to get such large nested loop ?
> 

It should be, but maybe let's see if there are other problems in the
query itself. If it's generating a cartesian product, it's pointless to
tune parameters.

>> What PostgreSQL version are you using, what hardware? Did you tune it
>> in
>> any way, or is everything just default?
> 
> It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud
> provider parameters (RDS). 
> 

I assume 2TO is 2TB?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

18 June 2023, 20:57:11

> I assume 2TO is 2TB?

Yes. 2TB


> I don't understand why you're sharing just a part of the plan


As for the nested loop plan, what I shared is the full plan. Actually
it is repeated many times, since 2M batched by 500 rows. I add it
again:

Merge on goliath_23 ca  (cost=2139.75..11077.17 rows=0 width=0)
  ->  Nested Loop Left Join  (cost=2139.75..11077.17 rows=1000
width=575)
        ->  Limit  (cost=2139.19..2495.67 rows=1000 width=569)
              ->  Index Scan using david_23_list_id_account_id_idx on
david_23  (cost=0.29..6794.16 rows=19058 width=569)
        ->  Index Scan using goliath_23_list_id_account_id_idx on
goliath_23 ca  (cost=0.56..8.56 rows=1 width=14)
              Index Cond: (list_id = david_23.list_id)


> Well, I kinda doubt you have 17023331531230 rows (not even physically
> possible with 2TB disk), so that's immediately suspicious.

Below is the full plan for the strategy 2.1 (Indeed the previous email
plan was truncated and wrong, sorry for that). 

Note that both plan acome from the same partitioned by hash table with
100 parts, with a unique index on the list_id + hash_key. For strategy
2.1, I turned on enable_partitionwise_join, since david table has the
same partitioning scheme as goliath including unique indexe. In both
case the query is:

MERGE INTO "goliath" ca
USING (SELECT * FROM "david" ORDER BY "list_id") AS t
ON t."list_id" = ca."list_id"
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT (...)
VALUES (...)

Except in strategy 3 david is split by limit/offset 500 on each part
tables such:

MERGE INTO "goliath_23" ca
USING (SELECT * FROM "david_23" ORDER BY "list_id" LIMIT 500 OFFSET 0)
AS t
ON t."list_id" = ca."list_id"
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT (...)
VALUES (...)


                                                                                                                

Merge on goliath ca  (cost=178016528.81..192778842.44 rows=0 width=0)
  Merge on goliath_0 ca_1
  Merge on goliath_1 ca_2
  Merge on goliath_2 ca_3
  Merge on goliath_3 ca_4
  Merge on goliath_4 ca_5
  Merge on goliath_5 ca_6
  Merge on goliath_6 ca_7
  Merge on goliath_7 ca_8
  Merge on goliath_8 ca_9
  Merge on goliath_9 ca_10
  Merge on goliath_10 ca_11
  Merge on goliath_11 ca_12
  Merge on goliath_12 ca_13
  Merge on goliath_13 ca_14
  Merge on goliath_14 ca_15
  Merge on goliath_15 ca_16
  Merge on goliath_16 ca_17
  Merge on goliath_17 ca_18
  Merge on goliath_18 ca_19
  Merge on goliath_19 ca_20
  Merge on goliath_20 ca_21
  Merge on goliath_21 ca_22
  Merge on goliath_22 ca_23
  Merge on goliath_23 ca_24
  Merge on goliath_24 ca_25
  Merge on goliath_25 ca_26
  Merge on goliath_26 ca_27
  Merge on goliath_27 ca_28
  Merge on goliath_28 ca_29
  Merge on goliath_29 ca_30
  Merge on goliath_30 ca_31
  Merge on goliath_31 ca_32
  Merge on goliath_32 ca_33
  Merge on goliath_33 ca_34
  Merge on goliath_34 ca_35
  Merge on goliath_35 ca_36
  Merge on goliath_36 ca_37
  Merge on goliath_37 ca_38
  Merge on goliath_38 ca_39
  Merge on goliath_39 ca_40
  Merge on goliath_40 ca_41
  Merge on goliath_41 ca_42
  Merge on goliath_42 ca_43
  Merge on goliath_43 ca_44
  Merge on goliath_44 ca_45
  Merge on goliath_45 ca_46
  Merge on goliath_46 ca_47
  Merge on goliath_47 ca_48
  Merge on goliath_48 ca_49
  Merge on goliath_49 ca_50
  Merge on goliath_50 ca_51
  Merge on goliath_51 ca_52
  Merge on goliath_52 ca_53
  Merge on goliath_53 ca_54
  Merge on goliath_54 ca_55
  Merge on goliath_55 ca_56
  Merge on goliath_56 ca_57
  Merge on goliath_57 ca_58
  Merge on goliath_58 ca_59
  Merge on goliath_59 ca_60
  Merge on goliath_60 ca_61
  Merge on goliath_61 ca_62
  Merge on goliath_62 ca_63
  Merge on goliath_63 ca_64
  Merge on goliath_64 ca_65
  Merge on goliath_65 ca_66
  Merge on goliath_66 ca_67
  Merge on goliath_67 ca_68
  Merge on goliath_68 ca_69
  Merge on goliath_69 ca_70
  Merge on goliath_70 ca_71
  Merge on goliath_71 ca_72
  Merge on goliath_72 ca_73
  Merge on goliath_73 ca_74
  Merge on goliath_74 ca_75
  Merge on goliath_75 ca_76
  Merge on goliath_76 ca_77
  Merge on goliath_77 ca_78
  Merge on goliath_78 ca_79
  Merge on goliath_79 ca_80
  Merge on goliath_80 ca_81
  Merge on goliath_81 ca_82
  Merge on goliath_82 ca_83
  Merge on goliath_83 ca_84
  Merge on goliath_84 ca_85
  Merge on goliath_85 ca_86
  Merge on goliath_86 ca_87
  Merge on goliath_87 ca_88
  Merge on goliath_88 ca_89
  Merge on goliath_89 ca_90
  Merge on goliath_90 ca_91
  Merge on goliath_91 ca_92
  Merge on goliath_92 ca_93
  Merge on goliath_93 ca_94
  Merge on goliath_94 ca_95
  Merge on goliath_95 ca_96
  Merge on goliath_96 ca_97
  Merge on goliath_97 ca_98
  Merge on goliath_98 ca_99
  Merge on goliath_99 ca_100
  ->  Hash Left Join  (cost=178016528.81..192778842.44 rows=2187354 width=579)
        Hash Cond: (david.list_id = ca.list_id)
        ->  Append  (cost=0.00..201068.31 rows=2187354 width=569)
              ->  Seq Scan on david_0 david_1  (cost=0.00..1926.65 rows=22165 width=569)
              ->  Seq Scan on david_1 david_2  (cost=0.00..1809.13 rows=20813 width=569)
              ->  Seq Scan on david_2 david_3  (cost=0.00..1812.52 rows=20852 width=569)
              ->  Seq Scan on david_3 david_4  (cost=0.00..1648.67 rows=18967 width=569)
              ->  Seq Scan on david_4 david_5  (cost=0.00..1853.20 rows=21320 width=569)
              ->  Seq Scan on david_5 david_6  (cost=0.00..1735.68 rows=19968 width=569)
              ->  Seq Scan on david_6 david_7  (cost=0.00..1693.87 rows=19487 width=569)
              ->  Seq Scan on david_7 david_8  (cost=0.00..1872.41 rows=21541 width=569)
              ->  Seq Scan on david_8 david_9  (cost=0.00..1827.21 rows=21021 width=569)
              ->  Seq Scan on david_9 david_10  (cost=0.00..1815.91 rows=20891 width=569)
              ->  Seq Scan on david_10 david_11  (cost=0.00..1757.15 rows=20215 width=569)
              ->  Seq Scan on david_11 david_12  (cost=0.00..1624.94 rows=18694 width=569)
              ->  Seq Scan on david_12 david_13  (cost=0.00..1867.89 rows=21489 width=569)
              ->  Seq Scan on david_13 david_14  (cost=0.00..1979.76 rows=22776 width=569)
              ->  Seq Scan on david_14 david_15  (cost=0.00..1706.30 rows=19630 width=569)
              ->  Seq Scan on david_15 david_16  (cost=0.00..1828.34 rows=21034 width=569)
              ->  Seq Scan on david_16 david_17  (cost=0.00..1702.91 rows=19591 width=569)
              ->  Seq Scan on david_17 david_18  (cost=0.00..1805.74 rows=20774 width=569)
              ->  Seq Scan on david_18 david_19  (cost=0.00..3531.25 rows=40625 width=569)
              ->  Seq Scan on david_19 david_20  (cost=0.00..1522.11 rows=17511 width=569)
              ->  Seq Scan on david_20 david_21  (cost=0.00..1950.38 rows=22438 width=569)
              ->  Seq Scan on david_21 david_22  (cost=0.00..1957.16 rows=22516 width=569)
              ->  Seq Scan on david_22 david_23  (cost=0.00..1745.85 rows=20085 width=569)
              ->  Seq Scan on david_23 david_24  (cost=0.00..1730.03 rows=19903 width=569)
              ->  Seq Scan on david_24 david_25  (cost=0.00..1784.27 rows=20527 width=569)
              ->  Seq Scan on david_25 david_26  (cost=0.00..1698.39 rows=19539 width=569)
              ->  Seq Scan on david_26 david_27  (cost=0.00..1900.66 rows=21866 width=569)
              ->  Seq Scan on david_27 david_28  (cost=0.00..1813.65 rows=20865 width=569)
              ->  Seq Scan on david_28 david_29  (cost=0.00..2009.14 rows=23114 width=569)
              ->  Seq Scan on david_29 david_30  (cost=0.00..1778.62 rows=20462 width=569)
              ->  Seq Scan on david_30 david_31  (cost=0.00..1779.75 rows=20475 width=569)
              ->  Seq Scan on david_31 david_32  (cost=0.00..1892.75 rows=21775 width=569)
              ->  Seq Scan on david_32 david_33  (cost=0.00..1988.80 rows=22880 width=569)
              ->  Seq Scan on david_33 david_34  (cost=0.00..1804.61 rows=20761 width=569)
              ->  Seq Scan on david_34 david_35  (cost=0.00..1857.72 rows=21372 width=569)
              ->  Seq Scan on david_35 david_36  (cost=0.00..1782.01 rows=20501 width=569)
              ->  Seq Scan on david_36 david_37  (cost=0.00..2352.66 rows=27066 width=569)
              ->  Seq Scan on david_37 david_38  (cost=0.00..1962.81 rows=22581 width=569)
              ->  Seq Scan on david_38 david_39  (cost=0.00..2002.36 rows=23036 width=569)
              ->  Seq Scan on david_39 david_40  (cost=0.00..1852.07 rows=21307 width=569)
              ->  Seq Scan on david_40 david_41  (cost=0.00..2116.49 rows=24349 width=569)
              ->  Seq Scan on david_41 david_42  (cost=0.00..1785.40 rows=20540 width=569)
              ->  Seq Scan on david_42 david_43  (cost=0.00..1838.51 rows=21151 width=569)
              ->  Seq Scan on david_43 david_44  (cost=0.00..1931.17 rows=22217 width=569)
              ->  Seq Scan on david_44 david_45  (cost=0.00..1878.06 rows=21606 width=569)
              ->  Seq Scan on david_45 david_46  (cost=0.00..1859.98 rows=21398 width=569)
              ->  Seq Scan on david_46 david_47  (cost=0.00..1882.58 rows=21658 width=569)
              ->  Seq Scan on david_47 david_48  (cost=0.00..1791.05 rows=20605 width=569)
              ->  Seq Scan on david_48 david_49  (cost=0.00..1925.52 rows=22152 width=569)
              ->  Seq Scan on david_49 david_50  (cost=0.00..1953.77 rows=22477 width=569)
              ->  Seq Scan on david_50 david_51  (cost=0.00..1797.83 rows=20683 width=569)
              ->  Seq Scan on david_51 david_52  (cost=0.00..1680.31 rows=19331 width=569)
              ->  Seq Scan on david_52 david_53  (cost=0.00..1626.07 rows=18707 width=569)
              ->  Seq Scan on david_53 david_54  (cost=0.00..2003.49 rows=23049 width=569)
              ->  Seq Scan on david_54 david_55  (cost=0.00..1771.84 rows=20384 width=569)
              ->  Seq Scan on david_55 david_56  (cost=0.00..1700.65 rows=19565 width=569)
              ->  Seq Scan on david_56 david_57  (cost=0.00..1931.17 rows=22217 width=569)
              ->  Seq Scan on david_57 david_58  (cost=0.00..1833.99 rows=21099 width=569)
              ->  Seq Scan on david_58 david_59  (cost=0.00..1918.74 rows=22074 width=569)
              ->  Seq Scan on david_59 david_60  (cost=0.00..1885.97 rows=21697 width=569)
              ->  Seq Scan on david_60 david_61  (cost=0.00..4095.12 rows=47112 width=569)
              ->  Seq Scan on david_61 david_62  (cost=0.00..2076.94 rows=23894 width=569)
              ->  Seq Scan on david_62 david_63  (cost=0.00..2876.98 rows=33098 width=569)
              ->  Seq Scan on david_63 david_64  (cost=0.00..1647.54 rows=18954 width=569)
              ->  Seq Scan on david_64 david_65  (cost=0.00..1653.19 rows=19019 width=569)
              ->  Seq Scan on david_65 david_66  (cost=0.00..1684.83 rows=19383 width=569)
              ->  Seq Scan on david_66 david_67  (cost=0.00..1863.37 rows=21437 width=569)
              ->  Seq Scan on david_67 david_68  (cost=0.00..1717.60 rows=19760 width=569)
              ->  Seq Scan on david_68 david_69  (cost=0.00..1847.55 rows=21255 width=569)
              ->  Seq Scan on david_69 david_70  (cost=0.00..2235.14 rows=25714 width=569)
              ->  Seq Scan on david_70 david_71  (cost=0.00..2273.56 rows=26156 width=569)
              ->  Seq Scan on david_71 david_72  (cost=0.00..1745.85 rows=20085 width=569)
              ->  Seq Scan on david_72 david_73  (cost=0.00..1861.11 rows=21411 width=569)
              ->  Seq Scan on david_73 david_74  (cost=0.00..1856.59 rows=21359 width=569)
              ->  Seq Scan on david_74 david_75  (cost=0.00..1885.97 rows=21697 width=569)
              ->  Seq Scan on david_75 david_76  (cost=0.00..1665.62 rows=19162 width=569)
              ->  Seq Scan on david_76 david_77  (cost=0.00..1870.15 rows=21515 width=569)
              ->  Seq Scan on david_77 david_78  (cost=0.00..1776.36 rows=20436 width=569)
              ->  Seq Scan on david_78 david_79  (cost=0.00..1766.19 rows=20319 width=569)
              ->  Seq Scan on david_79 david_80  (cost=0.00..1812.52 rows=20852 width=569)
              ->  Seq Scan on david_80 david_81  (cost=0.00..1995.58 rows=22958 width=569)
              ->  Seq Scan on david_81 david_82  (cost=0.00..1701.78 rows=19578 width=569)
              ->  Seq Scan on david_82 david_83  (cost=0.00..1658.84 rows=19084 width=569)
              ->  Seq Scan on david_83 david_84  (cost=0.00..1840.77 rows=21177 width=569)
              ->  Seq Scan on david_84 david_85  (cost=0.00..1688.22 rows=19422 width=569)
              ->  Seq Scan on david_85 david_86  (cost=0.00..1918.74 rows=22074 width=569)
              ->  Seq Scan on david_86 david_87  (cost=0.00..2963.99 rows=34099 width=569)
              ->  Seq Scan on david_87 david_88  (cost=0.00..2075.81 rows=23881 width=569)
              ->  Seq Scan on david_88 david_89  (cost=0.00..1783.14 rows=20514 width=569)
              ->  Seq Scan on david_89 david_90  (cost=0.00..1765.06 rows=20306 width=569)
              ->  Seq Scan on david_90 david_91  (cost=0.00..1950.38 rows=22438 width=569)
              ->  Seq Scan on david_91 david_92  (cost=0.00..1840.77 rows=21177 width=569)
              ->  Seq Scan on david_92 david_93  (cost=0.00..1783.14 rows=20514 width=569)
              ->  Seq Scan on david_93 david_94  (cost=0.00..1705.17 rows=19617 width=569)
              ->  Seq Scan on david_94 david_95  (cost=0.00..1817.04 rows=20904 width=569)
              ->  Seq Scan on david_95 david_96  (cost=0.00..1977.50 rows=22750 width=569)
              ->  Seq Scan on david_96 david_97  (cost=0.00..1946.99 rows=22399 width=569)
              ->  Seq Scan on david_97 david_98  (cost=0.00..1814.78 rows=20878 width=569)
              ->  Seq Scan on david_98 david_99  (cost=0.00..1844.16 rows=21216 width=569)
              ->  Seq Scan on david_99 david_100  (cost=0.00..1769.58 rows=20358 width=569)
        ->  Hash  (cost=147650973.90..147650973.90 rows=1653953593 width=18)
              ->  Append  (cost=0.00..147650973.90 rows=1653953593 width=18)
                    ->  Seq Scan on goliath_0 ca_1  (cost=0.00..11177255.56 rows=150300456 width=18)
                    ->  Seq Scan on goliath_1 ca_2  (cost=0.00..1234238.22 rows=14780522 width=18)
                    ->  Seq Scan on goliath_2 ca_3  (cost=0.00..1323160.42 rows=15336142 width=18)
                    ->  Seq Scan on goliath_3 ca_4  (cost=0.00..1256666.46 rows=15029146 width=18)
                    ->  Seq Scan on goliath_4 ca_5  (cost=0.00..1272324.75 rows=15157175 width=18)
                    ->  Seq Scan on goliath_5 ca_6  (cost=0.00..1270349.37 rows=15044037 width=18)
                    ->  Seq Scan on goliath_6 ca_7  (cost=0.00..1284810.74 rows=15261474 width=18)
                    ->  Seq Scan on goliath_7 ca_8  (cost=0.00..1263715.41 rows=15020741 width=18)
                    ->  Seq Scan on goliath_8 ca_9  (cost=0.00..1265121.73 rows=14953673 width=18)
                    ->  Seq Scan on goliath_9 ca_10  (cost=0.00..1309331.70 rows=15314570 width=18)
                    ->  Seq Scan on goliath_10 ca_11  (cost=0.00..1269041.02 rows=15086702 width=18)
                    ->  Seq Scan on goliath_11 ca_12  (cost=0.00..1268299.98 rows=15042498 width=18)
                    ->  Seq Scan on goliath_12 ca_13  (cost=0.00..1294069.08 rows=15206708 width=18)
                    ->  Seq Scan on goliath_13 ca_14  (cost=0.00..1344155.97 rows=15480897 width=18)
                    ->  Seq Scan on goliath_14 ca_15  (cost=0.00..1258529.41 rows=15007641 width=18)
                    ->  Seq Scan on goliath_15 ca_16  (cost=0.00..1247612.99 rows=14801699 width=18)
                    ->  Seq Scan on goliath_16 ca_17  (cost=0.00..1398973.15 rows=15833115 width=18)
                    ->  Seq Scan on goliath_17 ca_18  (cost=0.00..1234430.89 rows=14907189 width=18)
                    ->  Seq Scan on goliath_18 ca_19  (cost=0.00..1491068.89 rows=16395989 width=18)
                    ->  Seq Scan on goliath_19 ca_20  (cost=0.00..1241254.74 rows=14743874 width=18)
                    ->  Seq Scan on goliath_20 ca_21  (cost=0.00..1308537.31 rows=15139031 width=18)
                    ->  Seq Scan on goliath_21 ca_22  (cost=0.00..1274257.03 rows=15133903 width=18)
                    ->  Seq Scan on goliath_22 ca_23  (cost=0.00..1348582.00 rows=15415900 width=18)
                    ->  Seq Scan on goliath_23 ca_24  (cost=0.00..1245613.99 rows=14770499 width=18)
                    ->  Seq Scan on goliath_24 ca_25  (cost=0.00..1232552.90 rows=14592890 width=18)
                    ->  Seq Scan on goliath_25 ca_26  (cost=0.00..1237272.86 rows=14785186 width=18)
                    ->  Seq Scan on goliath_26 ca_27  (cost=0.00..1395925.80 rows=15794380 width=18)
                    ->  Seq Scan on goliath_27 ca_28  (cost=0.00..1243112.59 rows=14888659 width=18)
                    ->  Seq Scan on goliath_28 ca_29  (cost=0.00..1261176.05 rows=15014705 width=18)
                    ->  Seq Scan on goliath_29 ca_30  (cost=0.00..1375287.81 rows=15912981 width=18)
                    ->  Seq Scan on goliath_30 ca_31  (cost=0.00..1236320.19 rows=14789519 width=18)
                    ->  Seq Scan on goliath_31 ca_32  (cost=0.00..1278375.97 rows=15203897 width=18)
                    ->  Seq Scan on goliath_32 ca_33  (cost=0.00..1263550.43 rows=14860643 width=18)
                    ->  Seq Scan on goliath_33 ca_34  (cost=0.00..1299830.39 rows=15186239 width=18)
                    ->  Seq Scan on goliath_34 ca_35  (cost=0.00..1352761.61 rows=15664361 width=18)
                    ->  Seq Scan on goliath_35 ca_36  (cost=0.00..1323724.61 rows=15543061 width=18)
                    ->  Seq Scan on goliath_36 ca_37  (cost=0.00..1508080.05 rows=16098705 width=18)
                    ->  Seq Scan on goliath_37 ca_38  (cost=0.00..1247180.20 rows=15092220 width=18)
                    ->  Seq Scan on goliath_38 ca_39  (cost=0.00..1265121.63 rows=14913063 width=18)
                    ->  Seq Scan on goliath_39 ca_40  (cost=0.00..1263161.74 rows=15008274 width=18)
                    ->  Seq Scan on goliath_40 ca_41  (cost=0.00..1422028.56 rows=15874056 width=18)
                    ->  Seq Scan on goliath_41 ca_42  (cost=0.00..1276259.24 rows=15052824 width=18)
                    ->  Seq Scan on goliath_42 ca_43  (cost=0.00..1331700.23 rows=15499623 width=18)
                    ->  Seq Scan on goliath_43 ca_44  (cost=0.00..1246053.10 rows=14665010 width=18)
                    ->  Seq Scan on goliath_44 ca_45  (cost=0.00..1275255.85 rows=15143785 width=18)
                    ->  Seq Scan on goliath_45 ca_46  (cost=0.00..1305362.83 rows=15361783 width=18)
                    ->  Seq Scan on goliath_46 ca_47  (cost=0.00..1280577.37 rows=15247837 width=18)
                    ->  Seq Scan on goliath_47 ca_48  (cost=0.00..1251285.15 rows=14806015 width=18)
                    ->  Seq Scan on goliath_48 ca_49  (cost=0.00..1351232.48 rows=15433548 width=18)
                    ->  Seq Scan on goliath_49 ca_50  (cost=0.00..1347924.50 rows=15296550 width=18)
                    ->  Seq Scan on goliath_50 ca_51  (cost=0.00..1357086.54 rows=15541854 width=18)
                    ->  Seq Scan on goliath_51 ca_52  (cost=0.00..1216370.11 rows=14562711 width=18)
                    ->  Seq Scan on goliath_52 ca_53  (cost=0.00..1358864.91 rows=15543891 width=18)
                    ->  Seq Scan on goliath_53 ca_54  (cost=0.00..1303103.21 rows=15233521 width=18)
                    ->  Seq Scan on goliath_54 ca_55  (cost=0.00..1247450.04 rows=14984504 width=18)
                    ->  Seq Scan on goliath_55 ca_56  (cost=0.00..1265316.35 rows=14879835 width=18)
                    ->  Seq Scan on goliath_56 ca_57  (cost=0.00..1256864.72 rows=14942272 width=18)
                    ->  Seq Scan on goliath_57 ca_58  (cost=0.00..1234443.50 rows=14857950 width=18)
                    ->  Seq Scan on goliath_58 ca_59  (cost=0.00..1293245.96 rows=15297596 width=18)
                    ->  Seq Scan on goliath_59 ca_60  (cost=0.00..1234137.56 rows=14820356 width=18)
                    ->  Seq Scan on goliath_60 ca_61  (cost=0.00..1561333.77 rows=16903277 width=18)
                    ->  Seq Scan on goliath_61 ca_62  (cost=0.00..1289386.58 rows=15273158 width=18)
                    ->  Seq Scan on goliath_62 ca_63  (cost=0.00..1375996.18 rows=15783618 width=18)
                    ->  Seq Scan on goliath_63 ca_64  (cost=0.00..1318835.42 rows=15393842 width=18)
                    ->  Seq Scan on goliath_64 ca_65  (cost=0.00..1279811.42 rows=15025442 width=18)
                    ->  Seq Scan on goliath_65 ca_66  (cost=0.00..1238623.43 rows=14795243 width=18)
                    ->  Seq Scan on goliath_66 ca_67  (cost=0.00..1305470.50 rows=15230950 width=18)
                    ->  Seq Scan on goliath_67 ca_68  (cost=0.00..1261104.19 rows=14894319 width=18)
                    ->  Seq Scan on goliath_68 ca_69  (cost=0.00..1283365.18 rows=15239718 width=18)
                    ->  Seq Scan on goliath_69 ca_70  (cost=0.00..1314101.63 rows=15292263 width=18)
                    ->  Seq Scan on goliath_70 ca_71  (cost=0.00..1253906.46 rows=14978446 width=18)
                    ->  Seq Scan on goliath_71 ca_72  (cost=0.00..1294493.71 rows=15331171 width=18)
                    ->  Seq Scan on goliath_72 ca_73  (cost=0.00..1286198.09 rows=15418009 width=18)
                    ->  Seq Scan on goliath_73 ca_74  (cost=0.00..1289391.43 rows=15229243 width=18)
                    ->  Seq Scan on goliath_74 ca_75  (cost=0.00..1242385.13 rows=14670113 width=18)
                    ->  Seq Scan on goliath_75 ca_76  (cost=0.00..1263916.51 rows=14982751 width=18)
                    ->  Seq Scan on goliath_76 ca_77  (cost=0.00..1294899.69 rows=15200869 width=18)
                    ->  Seq Scan on goliath_77 ca_78  (cost=0.00..1269406.36 rows=14930436 width=18)
                    ->  Seq Scan on goliath_78 ca_79  (cost=0.00..1235383.65 rows=14800365 width=18)
                    ->  Seq Scan on goliath_79 ca_80  (cost=0.00..1236916.73 rows=14710273 width=18)
                    ->  Seq Scan on goliath_80 ca_81  (cost=0.00..1421834.11 rows=16753911 width=18)
                    ->  Seq Scan on goliath_81 ca_82  (cost=0.00..1335250.54 rows=15351354 width=18)
                    ->  Seq Scan on goliath_82 ca_83  (cost=0.00..1243153.05 rows=14688205 width=18)
                    ->  Seq Scan on goliath_83 ca_84  (cost=0.00..1250345.33 rows=14741833 width=18)
                    ->  Seq Scan on goliath_84 ca_85  (cost=0.00..1301004.00 rows=14956400 width=18)
                    ->  Seq Scan on goliath_85 ca_86  (cost=0.00..1289135.50 rows=15009950 width=18)
                    ->  Seq Scan on goliath_86 ca_87  (cost=0.00..1488060.65 rows=16380865 width=18)
                    ->  Seq Scan on goliath_87 ca_88  (cost=0.00..1265255.27 rows=15004127 width=18)
                    ->  Seq Scan on goliath_88 ca_89  (cost=0.00..1291141.46 rows=15112446 width=18)
                    ->  Seq Scan on goliath_89 ca_90  (cost=0.00..1236365.37 rows=14733237 width=18)
                    ->  Seq Scan on goliath_90 ca_91  (cost=0.00..1276548.94 rows=14938294 width=18)
                    ->  Seq Scan on goliath_91 ca_92  (cost=0.00..1375334.48 rows=16060648 width=18)
                    ->  Seq Scan on goliath_92 ca_93  (cost=0.00..1257210.14 rows=14934014 width=18)
                    ->  Seq Scan on goliath_93 ca_94  (cost=0.00..1268048.19 rows=14998419 width=18)
                    ->  Seq Scan on goliath_94 ca_95  (cost=0.00..1277075.92 rows=15219292 width=18)
                    ->  Seq Scan on goliath_95 ca_96  (cost=0.00..1351108.41 rows=15454141 width=18)
                    ->  Seq Scan on goliath_96 ca_97  (cost=0.00..1283608.49 rows=15091049 width=18)
                    ->  Seq Scan on goliath_97 ca_98  (cost=0.00..1277019.32 rows=15265132 width=18)
                    ->  Seq Scan on goliath_98 ca_99  (cost=0.00..1258056.96 rows=15006496 width=18)
                    ->  Seq Scan on goliath_99 ca_100  (cost=0.00..1221325.89 rows=14612389 width=18)

Re: Merge David and Goliath tables efficiently

From

Alvaro Herrera

Date:

19 June 2023, 07:46:46

I came here to talk about partitionwise join, but then noticed you have
already thought of that:

On 2023-Jun-18, nicolas paris wrote:

> Note that both plan acome from the same partitioned by hash table with
> 100 parts, with a unique index on the list_id + hash_key. For strategy
> 2.1, I turned on enable_partitionwise_join, since david table has the
> same partitioning scheme as goliath including unique indexe. In both
> case the query is:

Hmm, I suppose the reason partitionwise join isn't having any effect is
that the presence of WHEN NOT MATCHED clauses force an outer join, which
probably disarms partitionwise joining, since each join pair would
require to match for nulls, so there would be two matching partitions at
the other end.  A quick test for this hypothesis might be to try the
MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join
works better.

Maybe Tom L's new outer-join infrastructure in 16 allows to improve on
this, not sure.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

19 June 2023, 11:34:45


On 6/18/23 22:57, nicolas paris wrote:
>> ...
>> Well, I kinda doubt you have 17023331531230 rows (not even physically
>> possible with 2TB disk), so that's immediately suspicious.
> 
> Below is the full plan for the strategy 2.1 (Indeed the previous email
> plan was truncated and wrong, sorry for that). 
> 

None of the plans has estimates anywhere close to 17023331531230, so
where did that come from?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

19 June 2023, 11:53:50

On 6/19/23 09:46, Alvaro Herrera wrote:
> I came here to talk about partitionwise join, but then noticed you have
> already thought of that:
> 
> On 2023-Jun-18, nicolas paris wrote:
> 
>> Note that both plan acome from the same partitioned by hash table with
>> 100 parts, with a unique index on the list_id + hash_key. For strategy
>> 2.1, I turned on enable_partitionwise_join, since david table has the
>> same partitioning scheme as goliath including unique indexe. In both
>> case the query is:
> 
> Hmm, I suppose the reason partitionwise join isn't having any effect is
> that the presence of WHEN NOT MATCHED clauses force an outer join, which
> probably disarms partitionwise joining, since each join pair would
> require to match for nulls, so there would be two matching partitions at
> the other end.  A quick test for this hypothesis might be to try the
> MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join
> works better.
> 
> Maybe Tom L's new outer-join infrastructure in 16 allows to improve on
> this, not sure.
> 

Not sure why would that disarm partitionwise join - attached is a simple
reproducer, generating two tables, loading 10000000 and 10000 rows into
them, and then doing explain on a simple merge.

IMHO the thing that breaks it is the ORDER BY in the merge, which likely
acts as an optimization fence and prevents all sorts of smart things
including the partitionwise join. I'd bet that if Nicolas replaces

  MERGE INTO "goliath" ca
  USING (SELECT * FROM "david" ORDER BY "list_id") AS t
  ..

with

  MERGE INTO "goliath" ca
  USING "david" AS t
  ...

it'll start doing the working much better.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

repro.sql

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

19 June 2023, 12:09:37

On Mon, 2023-06-19 at 13:34 +0200, Tomas Vondra wrote:
>
>
> On 6/18/23 22:57, nicolas paris wrote:
> > > ...
> > > Well, I kinda doubt you have 17023331531230 rows (not even
> > > physically
> > > possible with 2TB disk), so that's immediately suspicious.
> >
> > Below is the full plan for the strategy 2.1 (Indeed the previous
> > email
> > plan was truncated and wrong, sorry for that). 
> >
>
> None of the plans has estimates anywhere close to 17023331531230, so
> where did that come from?
>

Well this was an old plan where there was an issue: the david table did
not have the same partitioning scheme as goliath. It was partitioned by
an other column.

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

19 June 2023, 12:20:26

> IMHO the thing that breaks it is the ORDER BY in the merge, which
> likely
> acts as an optimization fence and prevents all sorts of smart things
> including the partitionwise join. I'd bet that if Nicolas replaces
>
>   MERGE INTO "goliath" ca
>   USING (SELECT * FROM "david" ORDER BY "list_id") AS t
>   .

Sorry if it was not clear, however there is no order by in the 2.1
strategy. Then this cannot be the reason of not triggering the optim.

For information I do enable partition join feature with jdbc call just
before the merge:
set enable_partitionwise_join=true

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

19 June 2023, 13:13:39

On 6/19/23 14:20, nicolas paris wrote:
>> IMHO the thing that breaks it is the ORDER BY in the merge, which
>> likely
>> acts as an optimization fence and prevents all sorts of smart things
>> including the partitionwise join. I'd bet that if Nicolas replaces
>>
>>   MERGE INTO "goliath" ca
>>   USING (SELECT * FROM "david" ORDER BY "list_id") AS t
>>   .
> 
> Sorry if it was not clear, however there is no order by in the 2.1
> strategy. Then this cannot be the reason of not triggering the optim.
> 
> For information I do enable partition join feature with jdbc call just
> before the merge:
> set enable_partitionwise_join=true
> 

But you wrote that in both cases the query is:

    MERGE INTO "goliath" ca
    USING (SELECT * FROM "david" ORDER BY "list_id") AS t
    ON t."list_id" = ca."list_id"
    WHEN MATCHED THEN
    UPDATE SET ...
    WHEN NOT MATCHED THEN
    INSERT (...)
    VALUES (...)

With all due respect, I'm getting a bit tired of having to speculate
about what exactly you're doing etc. based on bits of information.

I'm willing to continue to investigate, but only if you prepare a
reproducer, i.e. a SQL script that demonstrates the issue - I don't
think preparing that should be difficult, something like the SQL script
I shared earlier today should do the trick.

I suggest you do that directly, not through JDBC. Perhaps the JDBC
connection pool does something funny (like connection pooling and
resetting settings).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

19 June 2023, 15:45:06

> But you wrote that in both cases the query is:

that was indeed yet another tipo, hope to do better in the future.


> I'm willing to continue to investigate, but only if you prepare a
> reproducer,

Thanks for your starter script. Please find attached 2 scripts which
now illustrates two troubles.

repro1.sql is a slight evolution of yours. When I play with david size
(as described in the comments) you will see plan going from nested loop
to sequential scan. Also note that the partition wise join is likely
working. This illustrate my initial problem: the sequential scan is not
going to work fine on my workload (david too large). How to suggest
postgres to use a nested loop here ? (suspect playing with costs should
help)


repro2.sql  now I changed the table layout (similar to my setup) to
reproduce the partition wise join which does not triggers. I added a
partition column, and a unique index to be able to mimic a primary key.
Now partition wise (in my local docker vanilla postgres 15.3) does not
work. Eventually, if I do small batch, then the merge is working fast.
That's an other, unrelated problem.


> I suggest you do that directly, not through JDBC. Perhaps the JDBC
> connection pool does something funny (like connection pooling and
> resetting settings).

I can tell jdbc was working, and likely the reason might be in my
current table setup.

Attachment

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

19 June 2023, 23:25:52

On 6/19/23 17:45, nicolas paris wrote:
>> But you wrote that in both cases the query is:
> 
> that was indeed yet another tipo, hope to do better in the future.
> 
> 
>> I'm willing to continue to investigate, but only if you prepare a
>> reproducer, 
> 
> Thanks for your starter script. Please find attached 2 scripts which
> now illustrates two troubles.
> 
> repro1.sql is a slight evolution of yours. When I play with david size
> (as described in the comments) you will see plan going from nested loop
> to sequential scan. Also note that the partition wise join is likely
> working. This illustrate my initial problem: the sequential scan is not
> going to work fine on my workload (david too large). How to suggest
> postgres to use a nested loop here ? (suspect playing with costs should
> help)
> 

In general, this behavior is expected. The overall idea is that nested
loops are efficient for small row counts, but the cost raises quickly
exactly because they do a lot of random I/O (due to the index scan). At
some point it's cheaper to switch to plan does sequential I/O. Which is
exactly what's happening here - we switch from a plan doing a lot of
random I/O on goliath

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=0.29..7888.69 rows=0 width=0)
     Merge on goliath_0 goliath_1
     ...
     Merge on goliath_99 goliath_100
     ->  Append  (cost=0.29..7888.69 rows=9998 width=47)
           ->  Nested Loop Left Join  (cost=0.29..93.89 rows=120 ...
                 ->  Seq Scan on david_0 david_1  (cost=0.00..2.20 ...
                 ->  Index Scan using goliath_0_pkey on goliath_0 ...
                       Index Cond: (id = david_1.id)
           ->  Nested Loop Left Join  (cost=0.29..77.10 rows=98 ...
                 ->  Seq Scan on david_1 david_2  (cost=0.00..1.98 ...
                 ->  Index Scan using goliath_1_pkey on goliath_1 ...
                       Index Cond: (id = david_2.id)
           ...
           ->  Nested Loop Left Join  (cost=0.29..74.58 rows=95 ...
                 ->  Seq Scan on david_99 david_100  (cost=0.00..1.95
                 ->  Index Scan using goliath_99_pkey on goliath_99 ...
                       Index Cond: (id = david_100.id)
  (502 rows)

to a plan that does a lot of sequential I/O

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=293.44..264556.47 rows=0 width=0)
     Merge on goliath_0 goliath_1
     ...
     Merge on goliath_99 goliath_100
     ->  Append  (cost=293.44..264556.47 rows=951746 width=47)
           ->  Hash Right Join  (cost=293.44..2597.05 rows=9486 ...
                 Hash Cond: (goliath_1.id = david_1.id)
                 ->  Seq Scan on goliath_0 goliath_1  (cost=0.00..
                 ->  Hash  (cost=174.86..174.86 rows=9486 width=37)
                       ->  Seq Scan on david_0 david_1  (cost=0.00..
           ->  Hash Right Join  (cost=295.62..2613.90 rows=9583 width=
                 Hash Cond: (goliath_2.id = david_2.id)
                 ->  Seq Scan on goliath_1 goliath_2  (cost=0.00..1845
                 ->  Hash  (cost=175.83..175.83 rows=9583 width=37)
                       ->  Seq Scan on david_1 david_2  (cost=0.00..
           ...
           ->  Hash Right Join  (cost=288.33..2593.16 rows=9348 width
                 Hash Cond: (goliath_100.id = david_100.id)
                 ->  Seq Scan on goliath_99 goliath_100  (cost=0.00..
                 ->  Hash  (cost=171.48..171.48 rows=9348 width=37)
                       ->  Seq Scan on david_99 david_100  (cost=0.00..
  (602 rows)

That's expected, because the cost if I force the Nested Loop with the
higher row cound in "david" looks like this:

                                  QUERY PLAN
  ----------------------------------------------------------------------
   Merge on goliath  (cost=0.29..331253.00 rows=0 width=0)
   ...

Of course, the question is at which point the switch should happen. You
can try setting enable_hashjoin=off, which should push the optimizer to
use the first plan. If it does, you'll know the cost difference between
the two plans.

If you run it and it's actually faster than the "default" plan with a
hashjoin/seqscans, you can try lowering random_page_cost, which is
likely the main input. The default is 4, in general it should be higher
than seq_page_cost=1.0 (because random I/O is more expensive).

In any case, there's a point when the nested loops get terrible. I mean,
imagine the "david" has 100000 rows, and "goliath" hash 100000 pages
(800MB). It's just cheaper to do seqscan 100k pages than randomly scan
the same 100k pages. You can tune where the plan flips, ofc.

> 
> repro2.sql  now I changed the table layout (similar to my setup) to
> reproduce the partition wise join which does not triggers. I added a
> partition column, and a unique index to be able to mimic a primary key.
> Now partition wise (in my local docker vanilla postgres 15.3) does not
> work. Eventually, if I do small batch, then the merge is working fast.
> That's an other, unrelated problem.
> 

This is absolutely expected. If you partition by hash (id, part_key),
you can't join on (id) and expect partitionwise join to work. To quote
the enable_partitionwise_join documentation [1]:

    Enables or disables the query planner's use of partitionwise join,
    which allows a join between partitioned tables to be performed by
    joining the matching partitions. Partitionwise join currently
    applies only when the join conditions include all the partition
    keys, which must be of the same data type and have one-to-one
    matching sets of child partitions.

So the fact that

    merge into goliath using david on david.id = goliath.id
    when matched then update set val = david.val
    when not matched then insert (id, val) values (david.id, david.val);

does not work is absolutely expected. You need to join on part_col too.

This is because the partition is determined by hash of both columns, a
bit like md5(id+part_col) so knowing just the "id" is not enough to
determine the partition.

Even if you fix that, the last query won't use partitionwise join
because of the ORDER BY subquery, which serves as an optimization fence
(which means the merge does not actually see the underlying table is
partitioned).

If you get rid of that and add the part_col to the join, it translates
to the first issue with setting costs to flip to the sequential scan at
the right point.

[1]
https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-ENABLE-PARTITIONWISE-JOIN

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Merge David and Goliath tables efficiently

From

nicolas paris

Date:

20 June 2023, 10:02:17

> This is absolutely expected. If you partition by hash (id, part_key),
> you can't join on (id) and expect partitionwise join to work. To
> quote
> the enable_partitionwise_join documentation [1]:
>
>     Enables or disables the query planner's use of partitionwise
> join,
>     which allows a join between partitioned tables to be performed by
>     joining the matching partitions. Partitionwise join currently
>     applies only when the join conditions include all the partition
>     keys, which must be of the same data type and have one-to-one
>     matching sets of child partitions.
>
> So the fact that
>
>     merge into goliath using david on david.id = goliath.id
>     when matched then update set val = david.val
>     when not matched then insert (id, val) values (david.id,
> david.val);
>
> does not work is absolutely expected. You need to join on part_col
> too.

Definitely this makes sense to add the part_col in the join columns.
Also it helps the planner to choose a better plan, since now it goes
with per partition nested loop without having to trick the costs
(either enable_hashjoin/random_page_cost), with my current workload so
far.



Thanks you goliath


-- david

Re: Merge David and Goliath tables efficiently

From

Tomas Vondra

Date:

20 June 2023, 12:45:30

On 6/20/23 12:02, nicolas paris wrote:
>...
>
> Definitely this makes sense to add the part_col in the join columns.
> Also it helps the planner to choose a better plan, since now it goes
> with per partition nested loop without having to trick the costs
> (either enable_hashjoin/random_page_cost), with my current workload so
> far.
>

Right. With non-partitionwise join the nestloop inner lookup has to do
indexscan on every partition (it can't decide which of the partitions
will have a match, and for costing we assume there's at least 1 row in
each lookup). Which essentially amplifies the amount of random I/O by a
factor of 100x (or whatever the number of partitions is).

That is, instead of doing 100x nested loops like this:

    ->  Nested Loop Left Join  (cost=0.29..33.42 rows=8 width=47)
          ->  Seq Scan on david_98 david_99  (cost=0.00..1.08
          ->  Index Scan using goliath_98_id_part_col_idx on
                Index Cond: ((id = david_99.id) AND ...)

we end up doing one nested loop with an inner lookup like this

    ->  Append  (cost=0.29..557.63 rows=100 width=14)
         ->  Index Scan using ... goliath_1  (cost=0.29..5.57 ...
                     Index Cond: (id = david.id)
         ...

And this is per-loop, of which there'll be 500 (because the small david
table has 500 rows).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company