Re: Race condition in b-tree page deletion - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Race condition in b-tree page deletion |
Date | |
Msg-id | CA+TgmoYShtMdL-qikpYmHzC+94sb_oPRWGpCYfU-fBxnBmM=uw@mail.gmail.com Whole thread Raw |
In response to | Re: Race condition in b-tree page deletion (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Race condition in b-tree page deletion
|
List | pgsql-hackers |
On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 10.11.2013 01:47, Robert Haas wrote: >> I think we've tried pretty hard to avoid algorithms where the maximum >> number of lwlocks that must be held at one time is not a constant, and >> I think we're in for a bad time of it if we start to deviate from that >> principal. I'm not sure what to do about this problem, but I think >> locking N levels of the tree at once, where N can be as large as the >> tree is deep, is probably a bad plan, whether the number of locks >> required is N or 3N. > > I think I found a solution that accomplishes that. It's actually not that > complicated: > > Like we currently do, first climb up the tree to check that it's safe to > delete, ie. the downlink in the first non-empty parent is not the rightmost > entry. But when we reach the level where the parent is non-empty - I'll call > that the "topmost" parent - we keep that page locked. The leaf page is kept > locked while we climb. > > This is enough to plug the race condition. As long as we hold a lock on the > topmost parent containing the downlink to the branch of pages we're about to > delete, it cannot become the rightmost entry. And as long as we hold a lock > on the leaf page, no new insertions can happen on any of the internal pages > in the branch, as insertions to internal pages only happen when a child is > split. However, the rest of the algorithm needs to be slightly modified, as > we cannot re-lock the lower-level pages until we release the lock on the > topmost page, to avoid deadlock. > > So at this point, we hold two locks: the leaf page, and the topmost parent > containing the downlink to the branch we're deleting. Next, we remove the > downlink from the topmost parent, and mark the leaf page as half-dead in one > atomic operation. Also, a pointer to the highest internal page in the branch > we're deleting - the one the removed downlink pointed to - is put on the > leaf page. We can now release the locks. > > At this point, searches and inserts work fine. The leaf page has been marked > as half-dead, so any insertions to the deleted page's keyspace will go to > its right sibling. The downlink is to the top of the branch is gone, so even > if the right sibling is split many times, any keys in the transferred > keyspace that propagate to the higher levels won't be out-of-order. > > All that is left is to unlink the all the lingering pages in the branch > we're deleting from their left and right siblings. This can be done at any > later time, and if we error out or crash for some reason, next vacuum that > comes along can finish the job. This is done one level at a time. Lock the > leaf page, and the internal page the leaf page points to, and the internal > page's left and right siblings (in the right order, not this order). Change > the left and right sibling's right- and left-links, mark the internal page > as deleted, and update the pointer in the leaf page to point to the child of > the deleted internal page. Then recurse to the child, until we reach the > leaf level. > > This has the nice extra property that we don't need the incomplete-action > tracking in WAL recovery. I'd like to get rid of that anyway. I am not very knowledgeable of this code, but at least with my limited knowledge I can't spot any problems with that approach. The only thing that seems a little unfortunate is that the next vacuum may need to finish cleaning things up; I thought at one point you were trying to get rid of the amvacuumcleanup stuff. But maybe I'm misremembering, and anyway it still seems like a step forward. > I'm not sure what to do about stable branches. This could be back-patched, > with the caveat that this introduces new WAL record types so standbys need > to be upgraded before the master. But given the lack of field reports in the > ten years this race condition has existed, I'm not sure it's worth the > hassle. In the absence of at least one (and maybe several) reports from the field, -1 for back-patching this. At this point, it seems like far more people will be confused by the need to upgrade things in order than will benefit from the fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: