btree split logic is fragile in the presence of large index items - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | btree split logic is fragile in the presence of large index items |
Date | |
Msg-id | 18788.963953289@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: btree split logic is fragile in the presence of large
index items
|
List | pgsql-hackers |
Jan sent me a test case that produces a "btree: failed to add item to the page" failure message. The cause is that a btree index page needs to be split to make room for a new index item, but the split point is chosen in such a way that there still isn't room for the new item on the new page it has to go into. This appears to be a long-standing bug, but it's a problem that's not very likely to occur unless you are dealing with large index items (approaching the pagesize/3 limit). In the test case, the initial contents of the index page look like itemTID length 0,7,1 1792 ("high key" from next page) 0,14,1 40 0,85,1 1848 0,41,1 1656 0,86,1 1464 (There are actually only four "real" keys on this page; the first entry is a copy of the lowest key from the next index page. BTW I believe these "high keys" are the source of the extra index TOAST references that Jan was wondering about.) After bt_split executes, I find the left page has itemTID length 0,85,1 1848 ("high key" from newly-inserted page) 0,14,1 40 and the right page has itemTID length 0,7,1 1792 (still the high key from the next page) 0,85,1 1848 0,41,1 1656 0,86,1 1464 Unfortunately, the incoming item of size 1416 has to go onto the right page, and it still doesn't fit. In this particular situation the choice of the bad split point seems to be due to sloppy logic in _bt_findsplitloc --- it decides to split the page "in the middle" but its idea of the middle isfirstindex + (lastindex - firstindex) / 2 which is off by one in this case, and is fundamentally wrong anyhow because middle by item count is not necessarily middle by size. This could be repaired with some work in findsplitloc, but I'm afraid there is a more serious problem. The comments before findsplitloc say * In order to guarantee the proper handling of searches for duplicate* keys, the first duplicate in the chain musteither be the first* item on the page after the split, or the entire chain must be on* one of the two pages. That is,* [1 2 2 2 3 4 5]* must become* [1] [2 2 2 3 4 5]* or* [12 2 2] [3 4 5]* but not* [1 2 2] [2 3 4 5].* However,* [2 2 2 2 2 3 4]* may besplit as* [2 2 2 2] [2 3 4]. If this is accurate, then there will be cases where it is impossible to split a page in a way that allows insertion of the new item where it needs to go. For instance, suppose in the above example that the three large items had been all equal keys and that the incoming item also has that same key value. Under the rules given in this comment the only legal split would be like [A] [B B B B] where A represents the 40-byte key and the B's indicate the four equal keys. That will not fit. However, I'm not sure I believe the comment anymore: it has not changed since Postgres95 and I can see that quite a bit of work has been done on the duplicate-key logic since then. Furthermore findsplitloc itself sometimes ignores the claimed requirement: when it does the split-in-the-middle case quoted above, it does not pay attention to whether it is splitting in the middle of a group of duplicates. (But that path is taken infrequently enough that it's possible it's just plain broken, and we haven't noticed.) Does anyone know whether this comment still describes the btree equal-key logic accurately? If so, one possible solution is to make _bt_insertonpg smart enough to detect that there's not room to insert after making a legal split, and then recursively split again. That looks fairly ticklish however, especially if we want to preserve the existing guarantees of no deadlock in concurrent insertions. A more radical way out is to do what Vadim's been saying we should do eventually: redo the btree logic so that there are never "equal" keys (ie, use the item TID as a tiebreaker when ordering items). That would fix our performance problems with many equal keys as well as simplify the code. But it'd be a good deal of work, I fear. Comments, opinions, better ideas? regards, tom lane
pgsql-hackers by date: