Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION |
Date | |
Msg-id | CAHGQGwGyXNY1LWXY21W1GWsSafLLJNzxNk1EJ8DqW9_-6oH6tw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION (Petr Jelinek <petr.jelinek@2ndquadrant.com>) |
Responses |
Re: [HACKERS] Cannot shutdown subscriber after DROP SUBSCRIPTION
|
List | pgsql-hackers |
On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 06/02/17 17:33, Fujii Masao wrote: >> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >>> On 03/02/17 19:38, Fujii Masao wrote: >>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI >>>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg@mail.gmail.com> >>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier >>>>>>> <michael.paquier@gmail.com> wrote: >>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: >>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that >>>>>>>>>> some functions called from there can throw exceptions. >>>>>>>>> >>>>>>>>> Yes, but all LWLocks should be released by normal error recovery. >>>>>>>>> It should not be necessary for this code to clean that up by hand. >>>>>>>>> If it were necessary, there would be TRY-CATCH around every single >>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and >>>>>>>>> unmaintainable system. Please don't add a TRY-CATCH unless it's >>>>>>>>> *necessary* -- and you haven't explained why this one is. >>>>>>> >>>>>>> Yes. >>>>>> >>>>>> Thank you for the suggestion. I minunderstood that. >>>>>> >>>>>>>> Putting hands into the code and at the problem, I can see that >>>>>>>> dropping a subscription on a node makes it unresponsive in case of a >>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in >>>>>>>> the patch attached. A try/catch problem should not be necessary. >>>>>>> >>>>>>> Thanks for the patch! >>>>>>> >>>>>>> With the patch, LogicalRepLauncherLock is released at the end of >>>>>>> DropSubscription(). But ISTM that the lock should be released just after >>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of >>>>>>> replication slot with the lock. >>>>>> >>>>>> That's true. logicalrep_worker_stop returns after confirmig that >>>>>> worker->proc is cleard, so no false relaunch cannot be caused. >>>>>> After all, logicalrep_worker_stop is surrounded by >>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton >>>>>> and make the lock secrion to be more narrower. >>>> >>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be >>>> removed and the comment for logicalrep_worker_stop() should be updated. >>>> >>>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock >>>> while holding LogicalRepLauncherLock. OTOH, with your approach, >>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding >>>> LogicalRepWorkerLock. >>>> >>>> Therefore I pushed the simple patch which adds LWLockRelease() just after >>>> logicalrep_worker_stop(). >>>> >>>> Another problem that I found while reading the code is that the launcher can >>>> start up the worker with the subscription that DROP SUBSCRIPTION just removed. >>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription, >>>> but the launcher can see it and start new worker until the transaction for >>>> DROP has been committed. >>>> >>> >>> That was the reason why DropSubscription didn't release the lock in the >>> first place. It was supposed to be released at the end of the >>> transaction though. >> >> OK, I understood why you used the lock in that way. But using LWLock >> for that purpose is not valid. >> > > Yeah, I just tried to avoid what we are doing now really hard :) > >>>> To fix this issue, I think that DROP SUBSCRIPTION should take >>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock, >>>> so that the launcher cannot see the entry to be being removed. >>>> >>> >>> The whole point of having LogicalRepLauncherLock is to avoid having to >>> do this, so if we do this we could probably get rid of it. >> >> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription >> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION. >> Attached patch does this. >> > > Okay, looks reasonable to me. Thanks for the review! But ISMT that I should suspend committing the patch until we fix the issue that Sawada reported in other thread. That bugfix may change the related code and design very much. https://www.postgresql.org/message-id/CAD21AoD+VO93zZ4ZQtZQb-jZ_wMko3OgGdx1MXO4T+8q_zHDDA@mail.gmail.com Regards, -- Fujii Masao
pgsql-hackers by date: