Re: btree split logic is fragile in the presence of large index items - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: btree split logic is fragile in the presence of large index items |
Date | |
Msg-id | 19541.963964069@sss.pgh.pa.us Whole thread Raw |
In response to | RE: btree split logic is fragile in the presence of lar ge index items ("Mikheev, Vadim" <vmikheev@SECTORBASE.COM>) |
Responses |
RE: btree split logic is fragile in the presence of large index items
|
List | pgsql-hackers |
I've been chewing some more on the duplicate-key btree issue, and digging through CVS logs and Postgres v4.2 to try to understand the history of the code (and boy, this code does have a lot of history doesn't it?) What I think I now understand is this: 1. The PG 4.2 code used an OID appended to the key to ensure that btree items are unique. This fixes the Lehman and Yao algorithm's assumption of unique keys *for insertions*, since the item being inserted must have an OID different from existing items even if the key values are equal. However there is still an issue for *lookups*, since we are necessarily doing a lookup with just a key value, and we want to find all index items with that same key value regardless of OID. 2. The solution used in the PG 4.2 code, and still present now, handles equal-keys for lookup by mandating that a sequence of equal-keyed items not cross page boundaries unless the first key in that sequence is at the start of an index page. This ensures that that first key is present in the parent node and so a lookup for that key will descend to the page in which the sequence starts. Without this restriction the lookup would descend to the first continuation page of the sequence and so miss the first few matching items. However we now see that this restriction can cause insertion failures by preventing a page from being split where it needs to be split to make room for the incoming item. 3. Awhile back Vadim removed the added-OID code and added a bunch of logic for explicit management of chains of duplicate keys. In retrospect this change was probably a mistake. For btree items that point to heap tuples we can use the tuple TID as tie-breaker (since an index should never contain two items with the same key and TID). Btree internal pages need an added field since they have to consider the TID as part of the key (and the field that is the TID in a leaf page is used as the down-link pointer in non-leaf pages). I believe that the PG 4.2 solution for equal-key lookups is also a mistake, and that the correct answer is simple: split pages wherever you want, but when descending through a non-leaf page, consider the target key to be *less than* index items that actually have equal keys. In this way, if we are looking at a downlink pointer that exactly matches the target key, we go to the page one left of the page the pointer references, and thereby find any equal-keyed items that may be lurking at the end of that page. (We could possibly implement this by treating the key being searched for as having an attached TID of minus infinity. Note we already have more or less this same idea in place for scanning a multi-column index using a partial key.) This might sound klugy since it means a different comparison rule for descending the tree than for actually deciding whether we should return a particular item. But *we have to make such a distinction anyway* in order to handle NULL items. (The equality check must always say FALSE when comparing nulls, but we have to consider NULLs as sortable data values when entering them into the tree.) I'm starting to feel that the right fix is to bite the bullet and redo the index logic this way, rather than add another band-aid. One thing I'm still not too clear on is how we handle backwards indexscans. After looking at Lehman and Yao's paper it seems like only forward scans are guaranteed to work when other processes are busy splitting pages. Anybody know how that's handled? regards, tom lane
pgsql-hackers by date: