Re: Perform streaming logical transactions by background workers and parallel apply - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Perform streaming logical transactions by background workers and parallel apply |
Date | |
Msg-id | CAD21AoBLPDPCE4jj16ZRE9J8AoxLtvo3yasToaP5kRp6++=yyg@mail.gmail.com Whole thread Raw |
In response to | RE: Perform streaming logical transactions by background workers and parallel apply ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>) |
Responses |
RE: Perform streaming logical transactions by background workers and parallel apply
Re: Perform streaming logical transactions by background workers and parallel apply |
List | pgsql-hackers |
On Tue, Sep 27, 2022 at 9:26 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Saturday, September 24, 2022 7:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Sep 22, 2022 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Thu, Sep 22, 2022 at 8:59 AM wangw.fnst@fujitsu.com > > > <wangw.fnst@fujitsu.com> wrote: > > > > > > > > > > Few comments on v33-0001 > > > ======================= > > > > > > > Some more comments on v33-0001 > > ============================= > > 1. > > + /* Information from the corresponding LogicalRepWorker slot. */ > > + uint16 logicalrep_worker_generation; > > + > > + int logicalrep_worker_slot_no; > > +} ParallelApplyWorkerShared; > > > > Both these variables are read/changed by leader/parallel workers without > > using any lock (mutex). It seems currently there is no problem because of the > > way the patch is using in_parallel_apply_xact but I think it won't be a good idea > > to rely on it. I suggest using mutex to operate on these variables and also check > > if the slot_no is in a valid range after reading it in parallel_apply_free_worker, > > otherwise error out using elog. > > Changed. > > > 2. > > static void > > apply_handle_stream_stop(StringInfo s) > > { > > - if (!in_streamed_transaction) > > + ParallelApplyWorkerInfo *winfo = NULL; TransApplyAction apply_action; > > + > > + if (!am_parallel_apply_worker() && > > + (!in_streamed_transaction && !stream_apply_worker)) > > ereport(ERROR, > > (errcode(ERRCODE_PROTOCOL_VIOLATION), > > errmsg_internal("STREAM STOP message without STREAM START"))); > > > > This check won't be able to detect missing stream start messages for parallel > > apply workers apart from the first pair of start/stop. I thought of adding > > in_remote_transaction check along with > > am_parallel_apply_worker() to detect the same but that also won't work > > because the parallel worker doesn't reset it at the stop message. > > Another possibility is to introduce yet another variable for this but that doesn't > > seem worth it. I would like to keep this check simple. > > Can you think of any better way? > > I feel we can reuse the in_streamed_transaction in parallel apply worker to > simplify the check there. I tried to set this flag in parallel apply worker > when stream starts and reset it when stream stop so that we can directly check > this flag for duplicate stream start message and other related things. > > > 3. I think we can skip sending start/stop messages from the leader to the > > parallel worker because unlike apply worker it will process only one > > transaction-at-a-time. However, it is not clear whether that is worth the effort > > because it is sent after logical_decoding_work_mem changes. For now, I have > > added a comment for this in the attached patch but let me if I am missing > > something or if I am wrong. > > I the suggested comments look good. > > > 4. > > postgres=# select pid, leader_pid, application_name, backend_type from > > pg_stat_activity; > > pid | leader_pid | application_name | backend_type > > -------+------------+------------------+------------------------------ > > 27624 | | | logical replication launcher > > 17336 | | psql | client backend > > 26312 | | | logical replication worker > > 26376 | | psql | client backend > > 14004 | | | logical replication worker > > > > Here, the second worker entry is for the parallel worker. Isn't it better if we > > distinguish this by keeping type as a logical replication parallel worker? I think > > for this you need to change bgw_type in logicalrep_worker_launch(). > > Changed. > > > 5. Can we name parallel_apply_subxact_info_add() as > > parallel_apply_start_subtrans()? > > > > Apart from the above, I have added/edited a few comments and made a few > > other cosmetic changes in the attached. > While looking at v35 patch, I realized that there are some cases where the logical replication gets stuck depending on partitioned table structure. For instance, there are following tables, publication, and subscription: * On publisher create table p (c int) partition by list (c); create table c1 partition of p for values in (1); create table c2 (c int); create publication test_pub for table p, c1, c2 with (publish_via_partition_root = 'true'); * On subscriber create table p (c int) partition by list (c); create table c1 partition of p for values In (2); create table c2 partition of p for values In (1); create subscription test_sub connection 'port=5551 dbname=postgres' publication test_pub with (streaming = 'parallel', copy_data = 'false'); Note that while both the publisher and the subscriber have the same name tables the partition structure is different and rows go to a different table on the subscriber (eg, row c=1 will go to c2 table on the subscriber). If two current transactions are executed as follows, the apply worker (ig, the leader apply worker) waits for a lock on c2 held by its parallel apply worker: * TX-1 BEGIN; INSERT INTO p SELECT 1 FROM generate_series(1, 10000); --- changes are streamed * TX-2 BEGIN; TRUNCATE c2; --- wait for a lock on c2 * TX-1 INSERT INTO p SELECT 1 FROM generate_series(1, 10000); COMMIT; This might not be a common case in practice but it could mean that there is a restriction on how partitioned tables should be structured on the publisher and the subscriber when using streaming = 'parallel'. When this happens, since the logical replication cannot move forward the users need to disable parallel-apply mode or increase logical_decoding_work_mem. We could describe this limitation in the doc but it would be hard for users to detect problematic table structure. BTW, when the leader apply worker waits for a lock on c2 in the above example, the parallel apply worker is in a busy-loop, which should be fixed. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: