Re: Perform streaming logical transactions by background workers and parallel apply - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Perform streaming logical transactions by background workers and parallel apply |
Date | |
Msg-id | CAD21AoDm3224e=se7=ZYt=R+v0_ZJ4E9dd5y2816_rTTCV+G+Q@mail.gmail.com Whole thread Raw |
In response to | Re: Perform streaming logical transactions by background workers and parallel apply (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Perform streaming logical transactions by background workers and parallel apply
|
List | pgsql-hackers |
On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 7, 2022 at 8:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Oct 6, 2022 at 9:04 PM houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > > > > I think the root reason for this kind of deadlock problems is the table > > > structure difference between publisher and subscriber(similar to the unique > > > difference reported earlier[1]). So, I think we'd better disallow this case. For > > > example to avoid the reported problem, we could only support parallel apply if > > > pubviaroot is false on publisher and replicated tables' types(relkind) are the > > > same between publisher and subscriber. > > > > > > Although it might restrict some use cases, but I think it only restrict the > > > cases when the partitioned table's structure is different between publisher and > > > subscriber. User can still use parallel apply for cases when the table > > > structure is the same between publisher and subscriber which seems acceptable > > > to me. And we can also document that the feature is expected to be used for the > > > case when tables' structure are the same. Thoughts ? > > > > I'm concerned that it could be a big restriction for users. Having > > different partitioned table's structures on the publisher and the > > subscriber is quite common use cases. > > > > From the feature perspective, the root cause seems to be the fact that > > the apply worker does both receiving and applying changes. Since it > > cannot receive the subsequent messages while waiting for a lock on a > > table, the parallel apply worker also cannot move forward. If we have > > a dedicated receiver process, it can off-load the messages to the > > worker while another process waiting for a lock. So I think that > > separating receiver and apply worker could be a building block for > > parallel-apply. > > > > I think the disadvantage that comes to mind is the overhead of passing > messages between receiver and applier processes even for non-parallel > cases. Now, I don't think it is advisable to have separate handling > for non-parallel cases. The other thing is that we need to someway > deal with feedback messages which helps to move synchronous replicas > and update subscriber's progress which in turn helps to keep the > restart point updated. These messages also act as heartbeat messages > between walsender and walapply process. > > To deal with this, one idea is that we can have two connections to > walsender process, one with walreceiver and the other with walapply > process which according to me could lead to a big increase in resource > consumption and it will bring another set of complexities in the > system. Now, in this, I think we have two possibilities, (a) The first > one is that we pass all messages to the leader apply worker and then > it decides whether to execute serially or pass it to the parallel > apply worker. However, that can again deadlock in the truncate > scenario we discussed because the main apply worker won't be able to > receive new messages once it is blocked at the truncate command. (b) > The second one is walreceiver process itself takes care of passing > streaming transactions to parallel apply workers but if we do that > then walreceiver needs to wait at the transaction end to maintain > commit order which means it can also lead to deadlock in case the > truncate happens in a streaming xact. I imagined (b) but I had missed the point of preserving the commit order. Separating the receiver and apply worker cannot resolve this problem. > > The other alternative is that we allow walreceiver process to wait for > apply process to finish transaction and send the feedback but that > seems to be again an overhead if we have to do it even for small > transactions, especially it can delay sync replication cases. Even, if > we don't consider overhead, it can still lead to a deadlock because > walreceiver won't be able to move in the scenario we are discussing. > > About your point that having different partition structures for > publisher and subscriber, I don't know how common it will be once we > have DDL replication. Also, the default value of > publish_via_partition_root is false which doesn't seem to indicate > that this is a quite common case. So how can we consider these concurrent issues that could happen only when streaming = 'parallel'? Can we restrict some use cases to avoid the problem or can we have a safeguard against these conflicts? We could find a new problematic scenario in the future and if it happens, logical replication gets stuck, it cannot be resolved only by apply workers themselves. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: