Re: Conflict detection for update_deleted in logical replication - Mailing list pgsql-hackers
From | Nisha Moond |
---|---|
Subject | Re: Conflict detection for update_deleted in logical replication |
Date | |
Msg-id | CABdArM5kvA7mPLLwy6XEDkHi0MNs1RidvAcYmm2uVd95U=yzwQ@mail.gmail.com Whole thread Raw |
In response to | Re: Conflict detection for update_deleted in logical replication (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Wed, Jul 9, 2025 at 5:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 8, 2025 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > I think these performance regressions occur because at some point the > > subscriber can no longer keep up with the changes occurring on the > > publisher. This is because the publisher runs multiple transactions > > simultaneously, while the Subscriber applies them with one apply > > worker. When retain_conflict_info = on, the performance of the apply > > worker deteriorates because it retains dead tuples, and as a result it > > gradually cannot keep up with the publisher, the table bloats, and the > > TPS of pgbench executed on the subscriber is also affected. This > > happened when only 40 clients (or 15 clients according to the results > > of test 4?) were running simultaneously. > > > > I think here the primary reason is the speed of one apply worker vs. > 15 or 40 clients working on the publisher, and all the data is being > replicated. We don't see regression at 3 clients, which suggests apply > worker is able to keep up with that much workload. Now, we have > checked that if the workload is slightly different such that fewer > clients (say 1-3) work on same set of tables and then we make > different set of pub-sub pairs for all such different set of clients > (for example, 3 clients working on tables t1 and t2, other 3 clients > working on tables t3 and t4; then we can have 2 pub-sub pairs, one for > tables t1, t2, and other for t3-t4 ) then there is almost negligible > regression after enabling retain_conflict_info. Additionally, for very > large transactions that can be parallelized, we shouldn't see any > regression because those can be applied in parallel. > Yes, in test case-03 [1], the performance drop(~50%) observed on the subscriber side was primarily due to a single apply worker handling changes from 40 concurrent clients on the publisher, which led to the accumulation of dead tuples. To validate this and simulate a more realistic workload, designed a test as suggested above, where multiple clients update different tables, and multiple subscriptions exist on the subscriber (one per table set). A custom pgbench script was created to run pgbench on the publisher, with each client updating a unique set of tables. On the subscriber side, created one subscription per set of tables. Each publication-subscription pair handles a distinct table set. Highlights ========== - Two tests were done with two different workloads - 15 and 45 concurrent clients, respectively. - No regression was observed when publisher changes were processed by multiple apply workers on the subscriber. Used source =========== pgHead commit 62a17a92833 + v47 patch set Machine details =============== Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM 01. pgbench on both sides (with 15 clients) ===================================== Setup: - Publisher and Subscriber nodes are created with configurations: autovacuum = false shared_buffers = '30GB' -- Also, worker and logical replication related parameters were increased as per requirement (see attached scripts for details). Workload: - The publisher has 15 sets of pgbench tables: Each set includes four tables: pgbench_accounts, pgbench_tellers, pgbench_branches, and pgbench_history, named as: pgbench_accounts_0, pgbench_tellers_0, ..., pgbench_accounts_14, pgbench_tellers_14, etc. - Ran pgbench with 15 clients for the *both side*. -- On publisher, each client updates *only one* set of pgbench tables: e.g., client '0' updates the pgbench_xx_0 tables, client '1' updates pgbench_xx_1 tables, and so on. -- On Subscriber, there exists one subscription per set of tables of the publisher, i.e, there is one apply worker consuming changes corresponding to each client. So, #subscriptions on subscriber(15) = #clients on publisher(15). - On subscriber, the default pgbench workload is also run with 15 clients. - The duration was 5 minutes, and the measurement was repeated 3 times. Test Scenarios & Results: Publisher: - pgHead : Median TPS = 10386.93507 - pgHead + patch : Median TPS = 10187.0887 (TPS reduced ~2%) Subscriber: - pgHead : Median TPS = 10006.3903 - pgHead + patch : Median TPS = 9986.269682 (TPS reduced ~0.2%) Observation: - No performance regression was observed on either the publisher or subscriber with the patch applied. - The TPS drop was under 2% on both sides, within expected case to case variation range. Detailed Results Table: On publisher: #run pgHEAD pgHead+patch(ON) 1 10477.26438 10029.36155 2 10261.63429 10187.0887 3 10386.93507 10750.86231 median 10386.93507 10187.0887 On subscriber: #run pgHEAD pgHead+patch(ON) 1 10261.63429 9813.114002 2 9962.914457 9986.269682 3 10006.3903 10580.13015 median 10006.3903 9986.269682 ~~~~ 02. pgbench on both sides (with 45 clients) ===================================== Setup: - same as case 01. Workload: - Publisher has the same 15 sets of pgbench tables as in case-01 and 3 clients will be updating one set of tables. - Ran pgbench with 45 clients for the *both side*. -- On publisher, each client updates *three* set of pgbench tables: e.g., clients '0','15' and '30' update pgbench_xx_0 tables, clients '1', '16', and '31' update pgbench_xx_1 tables, and so on. -- On Subscriber, there exists one subscription per set of tables of the publisher, i.e, there is one apply worker consuming changes corresponding to *three* clients of the publisher. - On subscriber, the default pgbench workload is also run with 45 clients. - The duration was 5 minutes, and the measurement was repeated 3 times. Test Scenarios & Results: Publisher: - pgHead : Median TPS = 13845.7381 - pgHead + patch : Median TPS = 13553.682 (TPS reduced ~2%) Subscriber: - pgHead : Median TPS = 10080.54686 - pgHead + patch : Median TPS = 9908.304381 (TPS reduced ~1.7%) Observation: - No significant performance regression observed on either the publisher or subscriber with the patch applied. - The TPS drop was under 2% on both sides, within expected case to case variation range. Detailed Results Table: On publisher: #run pgHEAD pgHead+patch(ON) 1 14446.62404 13616.81375 2 12988.70504 13425.22938 3 13845.7381 13553.682 median 13845.7381 13553.682 On subscriber: #run pgHEAD pgHead+patch(ON) 1 10505.47481 9908.304381 2 9963.119531 9843.280308 3 10080.54686 9987.983147 median 10080.54686 9908.304381 ~~~~ The scripts used to perform above tests are attached. [1] https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com
Attachment
pgsql-hackers by date: