Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date | |
Msg-id | CAFiTN-udJ-hMy5mBTOx3n7ejAs2cVsnfjboy1QmRgoZRf=KbNQ@mail.gmail.com Whole thread Raw |
In response to | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
|
List | pgsql-hackers |
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > 9. > > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > > > > { > > > > > > .. > > > > > > + ReorderBufferToastReset(rb, txn); > > > > > > + if (specinsert != NULL) > > > > > > + ReorderBufferReturnChange(rb, specinsert); > > > > > > .. > > > > > > } > > > > > > > > > > > > Why do we need to do these here when we wouldn't have been done for > > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK? > > > > > > > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK" > > > > > gracefully and we are continuing with further decoding so we need to > > > > > return this change back. > > > > > > > > > > > > > Okay, then I suggest we should do these before calling stream_stop and > > > > also move ReorderBufferResetTXN after calling stream_stop to follow a > > > > pattern similar to try block unless there is a reason for not doing > > > > so. Also, it would be good if we can initialize specinsert with NULL > > > > after returning the change as we are doing at other places. > > > > > > Okay > > > > > > > > > 10. I have got the below failure once. I have not investigated this > > > > > > in detail as the patch is still under progress. See, if you have any > > > > > > idea? > > > > > > # Failed test 'check extra columns contain local defaults' > > > > > > # at t/013_stream_subxact_ddl_abort.pl line 81. > > > > > > # got: '2|0' > > > > > > # expected: '1000|500' > > > > > > # Looks like you failed 1 test of 2. > > > > > > make[2]: *** [check] Error 1 > > > > > > make[1]: *** [check-subscription-recurse] Error 2 > > > > > > make[1]: *** Waiting for unfinished jobs.... > > > > > > make: *** [check-world-src/test-recurse] Error 2 > > > > > > > > > > Even I got the failure once and after that, it did not reproduce. I > > > > > have executed it multiple time but it did not reproduce again. Are > > > > > you able to reproduce it consistently? > > > > > > > > > > > > > No, I am also not able to reproduce it consistently but I think this > > > > can fail if a subscriber sends the replay_location before actually > > > > replaying the changes. First, I thought that extra send_feedback we > > > > have in apply_handle_stream_commit might have caused this but I guess > > > > that can't happen because we need the commit time location for that > > > > and we are storing the same at the end of apply_handle_stream_commit > > > > after applying all messages. I am not sure what is going on here. I > > > > think we somehow need to reproduce this or some variant of this test > > > > consistently to find the root cause. > > > > > > And I think it appeared first time for me, so maybe either induced > > > from past few versions so some changes in the last few versions might > > > have exposed it. I have noticed that almost 50% of the time I am able > > > to reproduce after the clean build so I can trace back from which > > > version it started appearing that way it will be easy to narrow down. > > > > I think the reason for the failure is that we are not setting > > remote_final_lsn, in the streaming mode. I have put multiple logs and > > executed in log and from logs it appeared that some of the logical wal > > did not get replayed due to below check in > > should_apply_changes_for_rel. > > return (rel->state == SUBREL_STATE_READY || (rel->state == > > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); > > > > I still need to do the detailed analysis that why does this fail in > > some cases, basically, most of the time the rel->state is > > SUBREL_STATE_READY so this check passes but whenever the state is > > SUBREL_STATE_SYNCDONE it failed because we never update > > remote_final_lsn. I will try to set this value in > > apply_handle_stream_commit and see whether it ever fails or not. > > I have verified that after setting the remote_final_lsn in the > apply_handle_stream_commit, I don't see that regression failure in > over 70 runs whereas without that change it failed 6 times in 50 runs. > Apart from this, I have noticed one more thing related to the same > point. Basically, in the apply_handle_commit, we are calling > process_syncing_tables whereas we are not calling the same in > apply_handle_stream_commit. I have set the remote_final_lsn as well as called process_syncing_tables, in apply_handle_stream_commit. Please see the latest patch set v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: