Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Single transaction in the tablesync worker? |
Date | |
Msg-id | CAHut+PsvDaSktxfXhkMNuh5j5K=Ze81kdBrmmcBXrKcRhPi0hA@mail.gmail.com Whole thread Raw |
In response to | Re: Single transaction in the tablesync worker? (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Single transaction in the tablesync worker?
|
List | pgsql-hackers |
On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > FYI, I was able to reproduce this case in debugger. PSA logs showing details. > > > > > > > Thanks for reproducing as I was worried about exactly this case. I > > have one question related to logs: > > > > ## > > ## ALTER SUBSCRIPTION to REFRESH the publication > > > > ## This blocks on some latch until the tablesync worker dies, then it continues > > ## > > > > Did you check which exact latch or lock blocks this? > > > > I have checked this myself and the command is waiting on the drop of > origin till the tablesync worker is finished because replorigin_drop() > requires state->acquired_by to be 0 which will only be true once the > tablesync worker exits. I think this is the reason you might have > noticed that the command can't be finished until the tablesync worker > died. So this can't be an interlock between ALTER SUBSCRIPTION .. > REFRESH command and tablesync worker and to that end it seems you have > below Fixme's in the patch: I have also seen this same blocking reason before in the replorigin_drop(). However, back when I first tested/reproduced the refresh issue [test-refresh] that AlterSubscription_refresh was still *original* unchanged code, so at that time it did not have any replorigin_drop() in at all. In any case in the latest code [v14] the AlterSubscription is immediately stopping the workers so this question may be moot. > > + * FIXME - Usually this cleanup would be OK, but will not > + * always be OK because the logicalrep_worker_stop_at_commit > + * only "flags" the worker to be stopped in the near future > + * but meanwhile it may still be running. In this case there > + * could be a race between the tablesync worker and this code > + * to see who will succeed with the tablesync drop (and the > + * loser will ERROR). > + * > + * FIXME - Also, checking the state is also not guaranteed > + * correct because state might be TCOPYDONE when we checked > + * but has since progressed to SYNDONE > + */ > + > + if (state == SUBREL_STATE_TCOPYDONE) > + { > > I feel this was okay for an earlier code but now we need to stop the > tablesync workers before trying to drop the slot as we do in > DropSubscription. Now, if we do that then that would fix the race > conditions mentioned in Fixme but still, there are few more things I > am worried about: (a) What if the launcher again starts the tablesync > worker? One idea could be to acquire AccessExclusiveLock on > SubscriptionRelationId as we do in DropSubscription which is not a > very good idea but I can't think of any other good way. (b) the patch > is just checking SUBREL_STATE_TCOPYDONE before dropping the > replication slot but the slot could be created even before that (in > SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the > slot and if we are not able to drop then we can simply continue > assuming it didn't exist. The code was modified in the latest patch [v14] something like as suggested. The workers for removed tables are now immediately stopped (like DropSubscription does). Although I did include the AccessExclusiveLock as (a) suggested, AFAIK this was actually ineffective at preventing the workers relaunching. Instead, I am using logicalrep_worker_stop_at_commit to do this - testing shows it as working ok. Please see the code and latest test logs [v14] for details. Also, now the Drop/AlterSubscription will only give WARNING if unable to drop slots, a per suggestion (b). This is also tested [v14]. > > One minor comment: > 1. > + SpinLockAcquire(&MyLogicalRepWorker->relmutex); > MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE; > MyLogicalRepWorker->relstate_lsn = current_lsn; > - > > Spurious line removal. Fixed in latest patch [v14] ---- [v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com [test-refresh] https://www.postgresql.org/message-id/CAHut%2BPv7YW7AyO_Q_nf9kzogcJcDFQNe8FBP6yXdzowMz3dY_Q%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: