Home > mailing lists

Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

From	Peter Smith
Subject	Re: Single transaction in the tablesync worker?
Date	January 12, 2021 12:47:09
Msg-id	CAHut+PsvDaSktxfXhkMNuh5j5K=Ze81kdBrmmcBXrKcRhPi0hA@mail.gmail.com Whole thread Raw
In response to	Re: Single transaction in the tablesync worker? (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Single transaction in the tablesync worker?
List	pgsql-hackers

Tree view

On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> > >
> >
> > Thanks for reproducing as I was worried about exactly this case. I
> > have one question related to logs:
> >
> > ##
> > ## ALTER SUBSCRIPTION to REFRESH the publication
> >
> > ## This blocks on some latch until the tablesync worker dies, then it continues
> > ##
> >
> > Did you check which exact latch or lock blocks this?
> >
>
> I have checked this myself and the command is waiting on the drop of
> origin till the tablesync worker is finished because replorigin_drop()
> requires state->acquired_by to be 0 which will only be true once the
> tablesync worker exits. I think this is the reason you might have
> noticed that the command can't be finished until the tablesync worker
> died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> REFRESH command and tablesync worker and to that end it seems you have
> below Fixme's in the patch:

I have also seen this same blocking reason before in the replorigin_drop().
However, back when I first tested/reproduced the refresh issue
[test-refresh] that
AlterSubscription_refresh was still *original* unchanged code, so at
that time it did not
have any replorigin_drop() in at all. In any case in the latest code
[v14] the AlterSubscription is
immediately stopping the workers so this question may be moot.

>
> + * FIXME - Usually this cleanup would be OK, but will not
> + * always be OK because the logicalrep_worker_stop_at_commit
> + * only "flags" the worker to be stopped in the near future
> + * but meanwhile it may still be running. In this case there
> + * could be a race between the tablesync worker and this code
> + * to see who will succeed with the tablesync drop (and the
> + * loser will ERROR).
> + *
> + * FIXME - Also, checking the state is also not guaranteed
> + * correct because state might be TCOPYDONE when we checked
> + * but has since progressed to SYNDONE
> + */
> +
> + if (state == SUBREL_STATE_TCOPYDONE)
> + {
>
> I feel this was okay for an earlier code but now we need to stop the
> tablesync workers before trying to drop the slot as we do in
> DropSubscription. Now, if we do that then that would fix the race
> conditions mentioned in Fixme but still, there are few more things I
> am worried about: (a) What if the launcher again starts the tablesync
> worker? One idea could be to acquire AccessExclusiveLock on
> SubscriptionRelationId as we do in DropSubscription which is not a
> very good idea but I can't think of any other good way. (b) the patch
> is just checking SUBREL_STATE_TCOPYDONE before dropping the
> replication slot but the slot could be created even before that (in
> SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> slot and if we are not able to drop then we can simply continue
> assuming it didn't exist.

The code was modified in the latest patch [v14] something like as suggested.

The workers for removed tables are now immediately stopped (like
DropSubscription does). Although I did include the AccessExclusiveLock
as (a) suggested, AFAIK this was actually ineffective at preventing
the workers relaunching. Instead, I am using
logicalrep_worker_stop_at_commit to do this - testing shows it as
working ok. Please see the code and latest test logs [v14] for
details.

Also, now the Drop/AlterSubscription will only give WARNING if unable
to drop slots, a per suggestion (b). This is also tested [v14].

>
> One minor comment:
> 1.
> + SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>   MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
>   MyLogicalRepWorker->relstate_lsn = current_lsn;
> -
>
> Spurious line removal.

Fixed in latest patch [v14]

----
[v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[test-refresh]
https://www.postgresql.org/message-id/CAHut%2BPv7YW7AyO_Q_nf9kzogcJcDFQNe8FBP6yXdzowMz3dY_Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

pgsql-hackers by date:

From: Masahiko Sawada
Date: 12 January 2021, 12:40:53
Subject: Re: Key management with tests

From: Amit Kapila
Date: 12 January 2021, 13:13:46
Subject: Re: Track replica origin progress for Rollback Prepared

Re: Single transaction in the tablesync worker? - Mailing list pgsql-hackers

Previous

Next