Re: Logical replication prefetch - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Logical replication prefetch |
Date | |
Msg-id | 26dcc7a3-c3c1-44a4-87e0-bfc68fe7901d@garret.ru Whole thread Raw |
In response to | Logical replication prefetch (Konstantin Knizhnik <knizhnik@garret.ru>) |
List | pgsql-hackers |
On 08/07/2025 2:51 pm, Amit Kapila wrote: > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> There is well known Postgres problem that logical replication subscriber >> can not caught-up with publisher just because LR changes are applied by >> single worker and at publisher changes are made by >> multiple concurrent backends. The problem is not logical replication >> specific: physical replication stream is also handled by single >> walreceiver. But for physical replication Postgres now implements >> prefetch: looking at WAL record blocks it is quite easy to predict which >> pages will be required for redo and prefetch them. With logical >> replication situation is much more complicated. >> >> My first idea was to implement parallel apply of transactions. But to do >> it we need to track dependencies between transactions. Right now >> Postgres can apply transactions in parallel, but only if they are >> streamed (which is done only for large transactions) and serialize them >> by commits. It is possible to enforce parallel apply of short >> transactions using `debug_logical_replication_streaming` but then >> performance is ~2x times slower than in case of sequential apply by >> single worker. >> > What is the reason of such a large slow down? Is it because the amount > of network transfer has increased without giving any significant > advantage because of the serialization of commits? It is not directly related with subj, but I do not understand this code: ``` /* * Stop the worker if there are enough workers in the pool. * * XXX Additionally, we also stop the worker if the leader apply worker * serialize part of the transaction data due to a send timeout. This is * because the message could be partially written to the queue and there * is no way to clean the queue other than resending the message until it * succeeds. Instead of trying to send the data which anyway would have * been serialized and then letting the parallel apply worker deal with * the spurious message, we stop the worker. */ if (winfo->serialize_changes || list_length(ParallelApplyWorkerPool) > (max_parallel_apply_workers_per_subscription / 2)) { logicalrep_pa_worker_stop(winfo); pa_free_worker_info(winfo); return; } ``` It stops worker if number fo workers in pool is more than half of `max_parallel_apply_workers_per_subscription`. What I see is that `pa_launch_parallel_worker` spawns new workers and after completion of transaction it is immediately terminated. Actually this leads to awful slowdown of apply process. If I just disable and all `max_parallel_apply_workers_per_subscription`are actually used for applying transactions, then time of parallel apply with 4 workers is 6 minutes comparing with 10 minutes fr applying all transactions by main workers. It is still not so larger improvement, but at least it is improvement and not degradation.
pgsql-hackers by date: