Re: Failure while inserting parent tuple to B-tree is not fun - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Failure while inserting parent tuple to B-tree is not fun |
Date | |
Msg-id | 20131022183442.GG7435@awork2.anarazel.de Whole thread Raw |
In response to | Re: Failure while inserting parent tuple to B-tree is not fun (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Failure while inserting parent tuple to B-tree is not fun
|
List | pgsql-hackers |
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote: > On 22.10.2013 21:25, Andres Freund wrote: > >On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote: > >>Splitting a B-tree page is a two-stage process: First, the page is split, > >>and then a downlink for the new right page is inserted into the parent > >>(which might recurse to split the parent page, too). What happens if > >>inserting the downlink fails for some reason? I tried that out, and it turns > >>out that it's not nice. > >> > >>I used this to cause a failure: > >> > >>>--- a/src/backend/access/nbtree/nbtinsert.c > >>>+++ b/src/backend/access/nbtree/nbtinsert.c > >>>@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, > >>> _bt_relbuf(rel, pbuf); > >>> } > >>> > >>>+ elog(ERROR, "fail!"); > >>>+ > >>> /* get high key from left page == lowest key on new right page */ > >>> ritem = (IndexTuple) PageGetItem(page, > >>> PageGetItemId(page, P_HIKEY)); > >> > >>postgres=# create table foo (i int4 primary key); > >>CREATE TABLE > >>postgres=# insert into foo select generate_series(1, 10000); > >>ERROR: fail! > >> > >>That's not surprising. But when I removed that elog again and restarted the > >>server, I still can't insert. The index is permanently broken: > >> > >>postgres=# insert into foo select generate_series(1, 10000); > >>ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5 > >> > >>In real life, you would get a failure like this e.g if you run out of memory > >>or disk space while inserting the downlink to the parent. Although rare in > >>practice, it's no fun if it happens. > > > >Why doesn't the incomplete split mechanism prevent this? Because we do > >not delay checkpoints on the primary and a checkpoint happened just > >befor your elog(ERROR) above? > > Because there's no recovery involved. The failure I injected (or an > out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC, > just an ERROR that rolls back the current transaction, nothing more. > > We could put a critical section around the whole recursion that inserts the > downlinks, so that you would get a PANIC and the incomplete split mechanism > would fix it at recovery. But that would hardly be an improvement. You were talking about restarting the server, that's why I assumed recovery had been involved... But you just were talking about removing the elog() again. For me this clearly *has* to be in a critical section with the current code. I had always assumed all multi-part actions would be. Do you forsee the fix with ignoring missing downlinks to be back-patchable? FWIW, I think I might have seen real-world cases of this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: