Thread: Possible duplicate release of buffer lock.
Hello, I had an inquiry about the following log messages. 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,1,,2016-07-20 10:16:50 JST,30/75,0,LOG,00000,"no left sibling (concurrentdeletion?) in ""some_index_rel""",,,,,,,,"_bt_unlink_halfdead_page, nbtpage.c:1643","" 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,2,,2016-07-20 10:16:50 JST,30/75,0,ERROR,XX000,"lock main 13879 is not held",,,,,"automaticvacuum of table ""db.nsp.tbl""",,,"LWLockRelease, lwlock.c:1137","" These are gotten after pg_upgrade from 9.1.13 to 9.4. The first line is emitted for simultaneous deletion of a index page, which is impossible by design in a consistent state so the complained situation should be the result of an index corruption before upgading, specifically, inconsistent sibling pointers around a deleted page. I noticed the following part in nbtpage.c related to this. It is the same still in the master. nbtpage.c:1635@9.4.8: > while (P_ISDELETED(opaque) || opaque->btpo_next != target) > { > /* step right one page */ > leftsib = opaque->btpo_next; > _bt_relbuf(rel, lbuf); > if (leftsib == P_NONE) > { > elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"", > RelationGetRelationName(rel)); > return false; With the condition for the while loop, if the just left sibling of target is (mistakenly, of course) in deleted state (and the target is somehow pointing to the deleted page as left sibling), lbuf finally goes beyond to right side of the target. This seems to result in unintentional releasing of the lock on target and the second log message. My point here is that if concurrent deletion can't be perfomed by the current implement, this while loop could be removed and immediately error out or log a message, > if (P_ISDELETED(opaque) || opaque->btpo_next != target) > { > elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. or, the while loop at least should stop before overshooting the target. > while (P_ISDELETED(opaque) || opaque->btpo_next != target) > { > /* step right one page */ > leftsib = opaque->btpo_next; > _bt_relbuf(rel, lbuf); > if (leftsib == target || leftsib == P_NONE) > { > elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. I'd like to propose to do the former since the latter still is not perfect for such situations, anyway. Any thoughts or opinions? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > My point here is that if concurrent deletion can't be perfomed by > the current implement, this while loop could be removed and > immediately error out or log a message, >> if (P_ISDELETED(opaque) || opaque->btpo_next != target) >> { >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. That would certainly break things: there are valid cases for the loop to need to iterate, specifically where the left sibling got split before we could acquire lock on it. > or, the while loop at least should stop before overshooting the > target. >> while (P_ISDELETED(opaque) || opaque->btpo_next != target) >> { >> /* step right one page */ >> leftsib = opaque->btpo_next; >> _bt_relbuf(rel, lbuf); >> if (leftsib == target || leftsib == P_NONE) >> { >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. Huh? Surely that added test condition could never be true because of the second part of the while() test. regards, tom lane
Sorry. That's partially wrong. At Wed, 03 Aug 2016 17:31:16 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160803.173116.111915228.horiguchi.kyotaro@lab.ntt.co.jp> > I had an inquiry about the following log messages. > > 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,1,,2016-07-20 10:16:50 JST,30/75,0,LOG,00000,"no left sibling (concurrentdeletion?) in ""some_index_rel""",,,,,,,,"_bt_unlink_halfdead_page, nbtpage.c:1643","" > 2016-07-20 10:16:58.294 JST,,,3240,,578ed102.ca8,2,,2016-07-20 10:16:50 JST,30/75,0,ERROR,XX000,"lock main 13879 is notheld",,,,,"automatic vacuum of table ""db.nsp.tbl""",,,"LWLockRelease, lwlock.c:1137","" > > These are gotten after pg_upgrade from 9.1.13 to 9.4. > > The first line is emitted for simultaneous deletion of a index > page, which is impossible by design in a consistent state so the > complained situation should be the result of an index corruption > before upgading, specifically, inconsistent sibling pointers > around a deleted page. > > I noticed the following part in nbtpage.c related to this. It is > the same still in the master. > > nbtpage.c:1635@9.4.8: > > > while (P_ISDELETED(opaque) || opaque->btpo_next != target) > > { > > /* step right one page */ > > leftsib = opaque->btpo_next; > > _bt_relbuf(rel, lbuf); > > if (leftsib == P_NONE) > > { > > elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"", > > RelationGetRelationName(rel)); > > return false; > > With the condition for the while loop, if the just left sibling > of target is (mistakenly, of course) in deleted state (and the > target is somehow pointing to the deleted page as left sibling), > lbuf finally goes beyond to right side of the target. This seems > to result in unintentional releasing of the lock on target and > the second log message. > > > My point here is that if concurrent deletion can't be perfomed by > the current implement, this while loop could be removed and > immediately error out or log a message, > > > > if (P_ISDELETED(opaque) || opaque->btpo_next != target) > > { > > elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. The above is the result of forgetting the main object of this loop. Please forget it. Still it seems right to stop before target. > or, the while loop at least should stop before overshooting the > target. > > > while (P_ISDELETED(opaque) || opaque->btpo_next != target) > > { > > /* step right one page */ > > leftsib = opaque->btpo_next; > > _bt_relbuf(rel, lbuf); > > if (leftsib == target || leftsib == P_NONE) > > { > > elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > The first line is emitted for simultaneous deletion of a index > page, which is impossible by design in a consistent state so the > complained situation should be the result of an index corruption > before upgading, specifically, inconsistent sibling pointers > around a deleted page. > My point here is that if concurrent deletion can't be perfomed by > the current implement, this while loop could be removed and > immediately error out or log a message, Perhaps I have misunderstood, but I must ask: Are you aware that 9.4 has commit efada2b8, which fixes a bug with page deletion, that never made it into earlier branches? It is worth noting that anything that can make VACUUM in particular raise an error is considered problematic. For example, we do not allow VACUUM to finish incomplete B-Tree page splits, even though we could, because there might be no disk space for even one extra page at the worst possible time (we VACUUM in part to free disk space, of course). We really want to let VACUUM finish, if at all possible, even if it sees something that indicates data corruption. I remember that during the review of the similar B-Tree page split patch (not efada2b8, but rather commit 40dae7ec, which was also in 9.4), I found bugs due to functions not being very clear about whether a buffer lock and/or pin were held upon returning from a function in all cases, including very rare cases or "can't happen" cases. So, this seems kind of familiar. Code comments in the relevant function say this: * Returns 'false' if the page could not be unlinked (shouldn't happen).* If the (new) right sibling of the page is empty,*rightsib_empty is set* to true.*/ static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) { What we see here is that this "shouldn't happen" condition does, in fact, happen very rarely, including in cases where the target is not a leaf page. I think there might well be a real bug here, because we are not careful enough in ensuring that the _bt_unlink_halfdead_page() "contract" holds when something that "can't happen" does, in fact, happen: /* * Release the target, if it was not the leaf block. The leaf is always * kept locked. */ if (target !=leafblkno) _bt_relbuf(rel, buf); return true; } (Assuming you can call this code at the end of _bt_unlink_halfdead_page() its "contract".) It's not good at all if the code is stuck in a way that prevents VACUUM from finishing, as I said. That seems to be the case here, judging from Horiguchi-san's log output, which shows an ERROR from "lock main 13879 is not held". The problem is not so much that a "can't happen" thing happens (data corruption will sometimes make "the impossible" possible, after all -- we know this, and write code defensively because of this). The bug is that there is any ERROR for VACUUM that isn't absolutely unavoidable (the damage has been done anyway -- the index is already corrupt). I'll need to think about this some more, when I have more time. Perhaps tomorrow. -- Peter Geoghegan
At Wed, 03 Aug 2016 12:00:33 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <26373.1470240033@sss.pgh.pa.us> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > > My point here is that if concurrent deletion can't be perfomed by > > the current implement, this while loop could be removed and > > immediately error out or log a message, > > >> if (P_ISDELETED(opaque) || opaque->btpo_next != target) > >> { > >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. > > That would certainly break things: there are valid cases for the > loop to need to iterate, specifically where the left sibling got > split before we could acquire lock on it. Sorry for the garbage. It indeed is just a breakage. > > or, the while loop at least should stop before overshooting the > > target. > > >> while (P_ISDELETED(opaque) || opaque->btpo_next != target) > >> { > >> /* step right one page */ > >> leftsib = opaque->btpo_next; > >> _bt_relbuf(rel, lbuf); > >> if (leftsib == target || leftsib == P_NONE) > >> { > >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in \"%s\"",.. > > Huh? Surely that added test condition could never be true because > of the second part of the while() test. It can be masked by P_ISDELETED(opaque), this story is for the case that the "deleted" page has the right link to the target. More precisely, the case where a deleted page is seen during traversing sibling-link even though starting from a live page (or target, specifically). As Peter's reply, this is in the case of curruption, which could be saved just in order to continue performing VACUUM. Or a violation on a "contract" about page locks. If it's useless, the P_ISDELETED is also useless unless any corruption, or concurrent deletion. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, At Thu, 4 Aug 2016 16:09:17 -0700, Peter Geoghegan <pg@heroku.com> wrote in <CAM3SWZS2+ALNdDjWyeRLpjvss_MYU+n_9DQAL9fg4i-KPc9Q0A@mail.gmail.com> > On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > The first line is emitted for simultaneous deletion of a index > > page, which is impossible by design in a consistent state so the > > complained situation should be the result of an index corruption > > before upgading, specifically, inconsistent sibling pointers > > around a deleted page. > > > My point here is that if concurrent deletion can't be perfomed by > > the current implement, this while loop could be removed and > > immediately error out or log a message, > > Perhaps I have misunderstood, but I must ask: Are you aware that 9.4 > has commit efada2b8, which fixes a bug with page deletion, that never > made it into earlier branches? _bt_unlink_halfdead_page in nbtpage.c of 9.4 that I visited this time is after the commit. And it doesn't change the condition for the "no left sibling" message. If I misunderstood you. > It is worth noting that anything that can make VACUUM in particular > raise an error is considered problematic. For example, we do not allow > VACUUM to finish incomplete B-Tree page splits, even though we could, > because there might be no disk space for even one extra page at the > worst possible time (we VACUUM in part to free disk space, of course). > We really want to let VACUUM finish, if at all possible, even if it > sees something that indicates data corruption. Yes, It seems to me that the "no left sibling" is ignorable since skipping for the case doesn't more harm. For this inquiry case, VACUUM stopped by a seemingly double-releasing of a lock just after "no left sibling". I suspect that it is caused by skipping deleted page just left of the target page and it can happen for certain type of corruption. (writing out pages in certain order then crash and incomplete crash recovery after that would cause this..) If concurrent deletion (marking half-dead, specifically) can't happen, P_ISDELETED() won't be seen in the loop but we may consider it as a skippable condition to continue VACUUM. But still I think we shouldn't release the lock on the target page there. But it doesn't harm except that VACUUM stops there so I don't mind it will be as it is. So I'd like to have opinions of others about this. > I remember that during the review of the similar B-Tree page split > patch (not efada2b8, but rather commit 40dae7ec, which was also in > 9.4), I found bugs due to functions not being very clear about whether > a buffer lock and/or pin were held upon returning from a function in > all cases, including very rare cases or "can't happen" cases. So, this > seems kind of familiar. I think the point is that is a danger leads to wrong result or crash of server, or not. On that criteria, this doesn't harm, maybe. > Code comments in the relevant function say this: > > * Returns 'false' if the page could not be unlinked (shouldn't happen). > * If the (new) right sibling of the page is empty, *rightsib_empty is set > * to true. > */ > static bool > _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty) > { > > What we see here is that this "shouldn't happen" condition does, in > fact, happen very rarely, including in cases where the target is not a > leaf page. > > I think there might well be a real bug here, because we are not > careful enough in ensuring that the _bt_unlink_halfdead_page() > "contract" holds when something that "can't happen" does, in fact, > happen: > > /* > * Release the target, if it was not the leaf block. The leaf is always > * kept locked. > */ > if (target != leafblkno) > _bt_relbuf(rel, buf); > > return true; > } > > (Assuming you can call this code at the end of > _bt_unlink_halfdead_page() its "contract".) > > It's not good at all if the code is stuck in a way that prevents > VACUUM from finishing, as I said. That seems to be the case here, > judging from Horiguchi-san's log output, which shows an ERROR from > "lock main 13879 is not held". The problem is not so much that a > "can't happen" thing happens (data corruption will sometimes make "the > impossible" possible, after all -- we know this, and write code > defensively because of this). The bug is that there is any ERROR for > VACUUM that isn't absolutely unavoidable (the damage has been done > anyway -- the index is already corrupt). Agreed. My thought is almost the as this, it seems to be better to save such 'not absolutely ununvoidable' case unless it's too complicated. > I'll need to think about this some more, when I have more time. > Perhaps tomorrow. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Aug 5, 2016 at 11:27 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > If concurrent deletion (marking half-dead, specifically) can't > happen, P_ISDELETED() won't be seen in the loop but we may > consider it as a skippable condition to continue VACUUM. But > still I think we shouldn't release the lock on the target page > there. But it doesn't harm except that VACUUM stops there so I > don't mind it will be as it is. So I'd like to have opinions of > others about this. > Isn't the problem here is that due to some reason (like concurrent split), the code in question (loop -- while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases the lock on target buffer and the caller again tries to release the lock on target buffer when false is returned. So, rather than changing the loop in a way suggested by you, why can't we ensure that we don't release the lock on target buffer in that loop or we can check in the caller such that if target buffer is valid (for a failure case), then release it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > Isn't the problem here is that due to some reason (like concurrent > split), the code in question (loop -- > while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases > the lock on target buffer and the caller again tries to release the > lock on target buffer when false is returned. Yeah. I do not think there is anything wrong with the loop-to-find- current-leftsib logic. The problem is lack of care about what resources are held on failure return. The sole caller thinks it should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on what _bt_unlink_halfdead_page calls the leafbuf. But that function already dropped the lock (line 1582 in 9.4), and won't have reacquired it unless target != leafblkno. Another problem is that if the target is different from leafblkno, we'll hold a pin on the target page, which is leaked entirely in this code path. The caller knows nothing of the "target" block so it can't reasonably deal with all these cases. I'm inclined to change the function's API spec to say that on failure return, it's responsible for dropping lock and pin on the passed buffer. We could alternatively reacquire lock before returning, but that seems pointless. Another thing that looks fishy is at line 1611: leftsib = opaque->btpo_prev; At this point we already released lock on leafbuf so it seems pretty unsafe to fetch its left-link. And it's unnecessary cause we already did; the line should be just leftsib = leafleftsib; regards, tom lane
Hello, At Sat, 06 Aug 2016 12:45:32 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <16748.1470501932@sss.pgh.pa.us> > Amit Kapila <amit.kapila16@gmail.com> writes: > > Isn't the problem here is that due to some reason (like concurrent > > split), the code in question (loop -- > > while (P_ISDELETED(opaque) || opaque->btpo_next != target)) releases > > the lock on target buffer and the caller again tries to release the > > lock on target buffer when false is returned. > > Yeah. I do not think there is anything wrong with the loop-to-find- > current-leftsib logic. The problem is lack of care about what > resources are held on failure return. The sole caller thinks it > should do "_bt_relbuf(rel, buf)" --- that is, drop lock and pin on > what _bt_unlink_halfdead_page calls the leafbuf. But that function > already dropped the lock (line 1582 in 9.4), and won't have reacquired > it unless target != leafblkno. Another problem is that if the target > is different from leafblkno, we'll hold a pin on the target page, which > is leaked entirely in this code path. > > The caller knows nothing of the "target" block so it can't reasonably > deal with all these cases. I'm inclined to change the function's API > spec to say that on failure return, it's responsible for dropping > lock and pin on the passed buffer. We could alternatively reacquire > lock before returning, but that seems pointless. Agreed. > Another thing that looks fishy is at line 1611: > > leftsib = opaque->btpo_prev; > > At this point we already released lock on leafbuf so it seems pretty > unsafe to fetch its left-link. And it's unnecessary cause we already > did; the line should be just > > leftsib = leafleftsib; Right. And I found that it is already committed. Thanks. regards, -- Kyotaro Horiguchi NTT Open Source Software Center