Thread: Possible duplicate release of buffer lock.

Possible duplicate release of buffer lock.

From

Kyotaro HORIGUCHI

Date:

03 August 2016, 08:31:37

Hello, 

I had an inquiry about the following log messages.

2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,1,,2016-07-20 10:16:50 JST,30/75,0,LOG,00000,"no left sibling
(concurrentdeletion?) in ""some_index_rel""",,,,,,,,"_bt_unlink_halfdead_page, nbtpage.c:1643",""
 
2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,2,,2016-07-20 10:16:50 JST,30/75,0,ERROR,XX000,"lock main 13879 is not
held",,,,,"automaticvacuum of table ""db.nsp.tbl""",,,"LWLockRelease, lwlock.c:1137","" 
 

These are gotten after pg_upgrade from 9.1.13 to 9.4.

The first line is emitted for simultaneous deletion of a index
page, which is impossible by design in a consistent state so the
complained situation should be the result of an index corruption
before upgading, specifically, inconsistent sibling pointers
around a deleted page.

I noticed the following part in nbtpage.c related to this. It is
the same still in the master.

nbtpage.c:1635@9.4.8:

>  while (P_ISDELETED(opaque) || opaque->btpo_next != target)
>  {
>          /* step right one page */
>          leftsib = opaque->btpo_next;
>          _bt_relbuf(rel, lbuf);
>          if (leftsib == P_NONE)
>          {
>                  elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
>                           RelationGetRelationName(rel));
>                  return false;

With the condition for the while loop, if the just left sibling
of target is (mistakenly, of course) in deleted state (and the
target is somehow pointing to the deleted page as left sibling),
lbuf finally goes beyond to right side of the target. This seems
to result in unintentional releasing of the lock on target and
the second log message.


My point here is that if concurrent deletion can't be perfomed by
the current implement, this while loop could be removed and
immediately error out or log a message,


> if (P_ISDELETED(opaque) || opaque->btpo_next != target)
> {
>    elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..


or, the while loop at least should stop before overshooting the
target.

> while (P_ISDELETED(opaque) || opaque->btpo_next != target)
> {
>    /* step right one page */
>    leftsib = opaque->btpo_next;
>    _bt_relbuf(rel, lbuf);
>    if (leftsib == target || leftsib == P_NONE)
>    {
>      elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..


I'd like to propose to do the former since the latter still is
not perfect for such situations, anyway.


Any thoughts or opinions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible duplicate release of buffer lock.

From

Tom Lane

Date:

03 August 2016, 16:00:45

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,

>> if (P_ISDELETED(opaque) || opaque->btpo_next != target)
>> {
>> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..

That would certainly break things: there are valid cases for the
loop to need to iterate, specifically where the left sibling got
split before we could acquire lock on it.

> or, the while loop at least should stop before overshooting the
> target.

>> while (P_ISDELETED(opaque) || opaque->btpo_next != target)
>> {
>> /* step right one page */
>> leftsib = opaque->btpo_next;
>> _bt_relbuf(rel, lbuf);
>> if (leftsib == target || leftsib == P_NONE)
>> {
>> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..

Huh?  Surely that added test condition could never be true because
of the second part of the while() test.
        regards, tom lane

Re: Possible duplicate release of buffer lock.

From

Kyotaro HORIGUCHI

Date:

04 August 2016, 00:08:07

Sorry. That's partially wrong.

At Wed, 03 Aug 2016 17:31:16 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp>
> I had an inquiry about the following log messages.
> 
> 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,1,,2016-07-20 10:16:50 JST,30/75,0,LOG,00000,"no left sibling
(concurrentdeletion?) in ""some_index_rel""",,,,,,,,"_bt_unlink_halfdead_page, nbtpage.c:1643",""
 
> 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,2,,2016-07-20 10:16:50 JST,30/75,0,ERROR,XX000,"lock main 13879 is
notheld",,,,,"automatic vacuum of table ""db.nsp.tbl""",,,"LWLockRelease, lwlock.c:1137","" 
 
> 
> These are gotten after pg_upgrade from 9.1.13 to 9.4.
> 
> The first line is emitted for simultaneous deletion of a index
> page, which is impossible by design in a consistent state so the
> complained situation should be the result of an index corruption
> before upgading, specifically, inconsistent sibling pointers
> around a deleted page.
> 
> I noticed the following part in nbtpage.c related to this. It is
> the same still in the master.
> 
> nbtpage.c:1635@9.4.8:
> 
> >  while (P_ISDELETED(opaque) || opaque->btpo_next != target)
> >  {
> >          /* step right one page */
> >          leftsib = opaque->btpo_next;
> >          _bt_relbuf(rel, lbuf);
> >          if (leftsib == P_NONE)
> >          {
> >                  elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
> >                           RelationGetRelationName(rel));
> >                  return false;
> 
> With the condition for the while loop, if the just left sibling
> of target is (mistakenly, of course) in deleted state (and the
> target is somehow pointing to the deleted page as left sibling),
> lbuf finally goes beyond to right side of the target. This seems
> to result in unintentional releasing of the lock on target and
> the second log message.
> 
> 
> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,
> 
> 
> > if (P_ISDELETED(opaque) || opaque->btpo_next != target)
> > {
> >    elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..

The above is the result of forgetting the main object of this
loop. Please forget it.

Still it seems right to stop before target.

> or, the while loop at least should stop before overshooting the
> target.
> 
> > while (P_ISDELETED(opaque) || opaque->btpo_next != target)
> > {
> >    /* step right one page */
> >    leftsib = opaque->btpo_next;
> >    _bt_relbuf(rel, lbuf);
> >    if (leftsib == target || leftsib == P_NONE)
> >    {
> >      elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible duplicate release of buffer lock.

From

Peter Geoghegan

Date:

04 August 2016, 23:09:24

On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> The first line is emitted for simultaneous deletion of a index
> page, which is impossible by design in a consistent state so the
> complained situation should be the result of an index corruption
> before upgading, specifically, inconsistent sibling pointers
> around a deleted page.

> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,

Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
has commit efada2b8, which fixes a bug with page deletion, that never
made it into earlier branches?

It is worth noting that anything that can make VACUUM in particular
raise an error is considered problematic. For example, we do not allow
VACUUM to finish incomplete B-Tree page splits, even though we could,
because there might be no disk space for even one extra page at the
worst possible time (we VACUUM in part to free disk space, of course).
We really want to let VACUUM finish, if at all possible, even if it
sees something that indicates data corruption.

I remember that during the review of the similar B-Tree page split
patch (not efada2b8, but rather commit 40dae7ec, which was also in
9.4), I found bugs due to functions not being very clear about whether
a buffer lock and/or pin were held upon returning from a function in
all cases, including very rare cases or "can't happen" cases. So, this
seems kind of familiar.

Code comments in the relevant function say this:
* Returns 'false' if the page could not be unlinked (shouldn't happen).* If the (new) right sibling of the page is
empty,*rightsib_empty is set* to true.*/

static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{

What we see here is that this "shouldn't happen" condition does, in
fact, happen very rarely, including in cases where the target is not a
leaf page.

I think there might well be a real bug here, because we are not
careful enough in ensuring that the _bt_unlink_halfdead_page()
"contract" holds when something that "can't happen" does, in fact,
happen:
   /*    * Release the target, if it was not the leaf block.  The leaf is always    * kept locked.    */   if (target
!=leafblkno)       _bt_relbuf(rel, buf);

   return true;
}

(Assuming you can call this code at the end of
_bt_unlink_halfdead_page() its "contract".)

It's not good at all if the code is stuck in a way that prevents
VACUUM from finishing, as I said. That seems to be the case here,
judging from Horiguchi-san's log output, which shows an ERROR from
"lock main 13879 is not held". The problem is not so much that a
"can't happen" thing happens (data corruption will sometimes make "the
impossible" possible, after all -- we know this, and write code
defensively because of this). The bug is that there is any ERROR for
VACUUM that isn't absolutely unavoidable (the damage has been done
anyway -- the index is already corrupt).

I'll need to think about this some more, when I have more time.
Perhaps tomorrow.

-- 
Peter Geoghegan

Re: Possible duplicate release of buffer lock.

From

Kyotaro HORIGUCHI

Date:

05 August 2016, 05:55:56

At Wed, 03 Aug 2016 12:00:33 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <26373.1470240033@sss.pgh.pa.us>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> > My point here is that if concurrent deletion can't be perfomed by
> > the current implement, this while loop could be removed and
> > immediately error out or log a message,
> 
> >> if (P_ISDELETED(opaque) || opaque->btpo_next != target)
> >> {
> >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..
> 
> That would certainly break things: there are valid cases for the
> loop to need to iterate, specifically where the left sibling got
> split before we could acquire lock on it.

Sorry for the garbage. It indeed is just a breakage.

> > or, the while loop at least should stop before overshooting the
> > target.
> 
> >> while (P_ISDELETED(opaque) || opaque->btpo_next != target)
> >> {
> >> /* step right one page */
> >> leftsib = opaque->btpo_next;
> >> _bt_relbuf(rel, lbuf);
> >> if (leftsib == target || leftsib == P_NONE)
> >> {
> >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",..
> 
> Huh?  Surely that added test condition could never be true because
> of the second part of the while() test.

It can be masked by P_ISDELETED(opaque), this story is for the
case that the "deleted" page has the right link to the
target. More precisely, the case where a deleted page is seen
during traversing sibling-link even though starting from a live
page (or target, specifically).

As Peter's reply, this is in the case of curruption, which could
be saved just in order to continue performing VACUUM. Or a
violation on a "contract" about page locks. If it's useless, the
P_ISDELETED is also useless unless any corruption, or concurrent
deletion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible duplicate release of buffer lock.

From

Kyotaro HORIGUCHI

Date:

05 August 2016, 05:57:57

Hello,

At Thu, 4 Aug 2016 16:09:17 -0700, Peter Geoghegan <pg@heroku.com> wrote in
<CAM3SWZS2+ALNdDjWyeRLpjvss_MYU+n_9DQAL9fg4i-KPc9Q0A@mail.gmail.com>
> On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > The first line is emitted for simultaneous deletion of a index
> > page, which is impossible by design in a consistent state so the
> > complained situation should be the result of an index corruption
> > before upgading, specifically, inconsistent sibling pointers
> > around a deleted page.
> 
> > My point here is that if concurrent deletion can't be perfomed by
> > the current implement, this while loop could be removed and
> > immediately error out or log a message,
> 
> Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
> has commit efada2b8, which fixes a bug with page deletion, that never
> made it into earlier branches?

_bt_unlink_halfdead_page in nbtpage.c of 9.4 that I visited this
time is after the commit. And it doesn't change the condition for
the "no left sibling" message. If I misunderstood you.

> It is worth noting that anything that can make VACUUM in particular
> raise an error is considered problematic. For example, we do not allow
> VACUUM to finish incomplete B-Tree page splits, even though we could,
> because there might be no disk space for even one extra page at the
> worst possible time (we VACUUM in part to free disk space, of course).
> We really want to let VACUUM finish, if at all possible, even if it
> sees something that indicates data corruption.

Yes, It seems to me that the "no left sibling" is ignorable since
skipping for the case doesn't more harm.

For this inquiry case, VACUUM stopped by a seemingly
double-releasing of a lock just after "no left sibling". I
suspect that it is caused by skipping deleted page just left of
the target page and it can happen for certain type of
corruption. (writing out pages in certain order then crash and
incomplete crash recovery after that would cause this..)

If concurrent deletion (marking half-dead, specifically) can't
happen, P_ISDELETED() won't be seen in the loop but we may
consider it as a skippable condition to continue VACUUM. But
still I think we shouldn't release the lock on the target page
there. But it doesn't harm except that VACUUM stops there so I
don't mind it will be as it is. So I'd like to have opinions of
others about this.

> I remember that during the review of the similar B-Tree page split
> patch (not efada2b8, but rather commit 40dae7ec, which was also in
> 9.4), I found bugs due to functions not being very clear about whether
> a buffer lock and/or pin were held upon returning from a function in
> all cases, including very rare cases or "can't happen" cases. So, this
> seems kind of familiar.

I think the point is that is a danger leads to wrong result or
crash of server, or not. On that criteria, this doesn't harm,
maybe.

> Code comments in the relevant function say this:
> 
>  * Returns 'false' if the page could not be unlinked (shouldn't happen).
>  * If the (new) right sibling of the page is empty, *rightsib_empty is set
>  * to true.
>  */
> static bool
> _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
> {
> 
> What we see here is that this "shouldn't happen" condition does, in
> fact, happen very rarely, including in cases where the target is not a
> leaf page.
> 
> I think there might well be a real bug here, because we are not
> careful enough in ensuring that the _bt_unlink_halfdead_page()
> "contract" holds when something that "can't happen" does, in fact,
> happen:
> 
>     /*
>      * Release the target, if it was not the leaf block.  The leaf is always
>      * kept locked.
>      */
>     if (target != leafblkno)
>         _bt_relbuf(rel, buf);
> 
>     return true;
> }
> 
> (Assuming you can call this code at the end of
> _bt_unlink_halfdead_page() its "contract".)
> 
> It's not good at all if the code is stuck in a way that prevents
> VACUUM from finishing, as I said. That seems to be the case here,
> judging from Horiguchi-san's log output, which shows an ERROR from
> "lock main 13879 is not held". The problem is not so much that a
> "can't happen" thing happens (data corruption will sometimes make "the
> impossible" possible, after all -- we know this, and write code
> defensively because of this). The bug is that there is any ERROR for
> VACUUM that isn't absolutely unavoidable (the damage has been done
> anyway -- the index is already corrupt).

Agreed. My thought is almost the as this, it seems to be better
to save such 'not absolutely ununvoidable' case unless it's too
complicated.

> I'll need to think about this some more, when I have more time.
> Perhaps tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible duplicate release of buffer lock.

From

Amit Kapila

Date:

06 August 2016, 06:19:09

On Fri, Aug 5, 2016 at 11:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> If concurrent deletion (marking half-dead, specifically) can't
> happen, P_ISDELETED() won't be seen in the loop but we may
> consider it as a skippable condition to continue VACUUM. But
> still I think we shouldn't release the lock on the target page
> there. But it doesn't harm except that VACUUM stops there so I
> don't mind it will be as it is. So I'd like to have opinions of
> others about this.
>

Isn't the problem here is that due to some reason (like concurrent
split), the code in question (loop --
while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases
the lock on target buffer and the caller again tries to release the
lock on target buffer when false is returned.  So, rather than
changing the loop in a way suggested by you, why can't we ensure that
we don't release the lock on target buffer in that loop or we can
check in the caller such that if target buffer is valid (for a failure
case), then release it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Possible duplicate release of buffer lock.

From

Tom Lane

Date:

06 August 2016, 16:45:49

Amit Kapila <amit.kapila16@gmail.com> writes:
> Isn't the problem here is that due to some reason (like concurrent
> split), the code in question (loop --
> while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases
> the lock on target buffer and the caller again tries to release the
> lock on target buffer when false is returned.

Yeah.  I do not think there is anything wrong with the loop-to-find-
current-leftsib logic.  The problem is lack of care about what
resources are held on failure return.  The sole caller thinks it
should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on
what _bt_unlink_halfdead_page calls the leafbuf.  But that function
already dropped the lock (line 1582 in 9.4), and won't have reacquired
it unless target != leafblkno.  Another problem is that if the target
is different from leafblkno, we'll hold a pin on the target page, which
is leaked entirely in this code path.

The caller knows nothing of the "target" block so it can't reasonably
deal with all these cases.  I'm inclined to change the function's API
spec to say that on failure return, it's responsible for dropping
lock and pin on the passed buffer.  We could alternatively reacquire
lock before returning, but that seems pointless.

Another thing that looks fishy is at line 1611:
    leftsib = opaque->btpo_prev;

At this point we already released lock on leafbuf so it seems pretty
unsafe to fetch its left-link.  And it's unnecessary cause we already
did; the line should be just
    leftsib = leafleftsib;
        regards, tom lane

Re: Possible duplicate release of buffer lock.

From

Kyotaro HORIGUCHI

Date:

08 August 2016, 07:38:55

Hello,

At Sat, 06 Aug 2016 12:45:32 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <16748.1470501932@sss.pgh.pa.us>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Isn't the problem here is that due to some reason (like concurrent
> > split), the code in question (loop --
> > while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases
> > the lock on target buffer and the caller again tries to release the
> > lock on target buffer when false is returned.
> 
> Yeah.  I do not think there is anything wrong with the loop-to-find-
> current-leftsib logic.  The problem is lack of care about what
> resources are held on failure return.  The sole caller thinks it
> should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on
> what _bt_unlink_halfdead_page calls the leafbuf.  But that function
> already dropped the lock (line 1582 in 9.4), and won't have reacquired
> it unless target != leafblkno.  Another problem is that if the target
> is different from leafblkno, we'll hold a pin on the target page, which
> is leaked entirely in this code path.
> 
> The caller knows nothing of the "target" block so it can't reasonably
> deal with all these cases.  I'm inclined to change the function's API
> spec to say that on failure return, it's responsible for dropping
> lock and pin on the passed buffer. We could alternatively reacquire
> lock before returning, but that seems pointless.

Agreed.

> Another thing that looks fishy is at line 1611:
> 
>         leftsib = opaque->btpo_prev;
> 
> At this point we already released lock on leafbuf so it seems pretty
> unsafe to fetch its left-link.  And it's unnecessary cause we already
> did; the line should be just
> 
>         leftsib = leafleftsib;

Right. And I found that it is already committed. Thanks.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center