On-disk compatibility for nbtree-unique-key enhancement - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | On-disk compatibility for nbtree-unique-key enhancement |
Date | |
Msg-id | CAH2-WzmjgBz-RL2-nyPc+NRZnU73YSLGEwZRyB2DhUQEdkEujg@mail.gmail.com Whole thread Raw |
Responses |
Re: On-disk compatibility for nbtree-unique-key enhancement
|
List | pgsql-hackers |
I would like to get some feedback on dealing with on-disk compatibility for my nbtree patch, which is in the current CF [1]. I thought I'd start a new thread on these questions, since it's probably the most sensitive aspect of the project, and can be discussed in a fairly contained way. My patch to make nbtree index tuple keys uniquely identifiable by using heap TID as a differentiating last attribute/part of the key space complicates space management within routines like _bt_findsplitloc(). Though, ideally, not very much. In general, we may need to make new "pivot" tuples [2] (all of which originate as the new high key added on the left side of a leaf page split) have an additional, explicit representation of heap TID. I'm MAXALIGN()'ing ItemPointerData to fit the "extra" heap TID attribute once an actual page split in underway, so that's an extra 8 bytes that I might have to append to a nonpivot leaf tuple when a new high key/pivot is generated during a leaf page split. This 8 bytes needs to be accounted for. I want to make a general pessimistic assumption that it's always necessary to have an extra 8 bytes to fit a heap TID, stolen from the "1/3 of a buffer" limit that we already enforce as the largest possible nbtree tuple size. Prior to the current _bt_findsplitloc() coding (circa 1999), there were bugs about a failure to find a split point, and I'm keen to avoid more problems along those lines. Actually, I think that it's essential to keep the space management logic as simple as possible, with everything ideally confined to a single page (and not reliant on things that are true across pages). So I feel reasonably confident that the best thing to do is impose certain theoretical costs on users in the short term, rather than doing something fancy and brittle that might be much harder to stabilize in the long term. More concretely: * My patch changes the definition of BTMaxItemSize() for indexes on the proposed new BTREE_VERSION, v4. We insist that new tuples must be less than or equal to 2704 bytes in size, including all IndexTuple overhead, rather than the current 2712 bytes. * Under this scheme, the "internal" limit will remain 2712 bytes, so that pivot tuples that are already at the proposed new user-visible maximum of 2704 can safely have another 8 bytes added. * This means that there is a compatibility issue for anyone that is already right on the threshold -- we *really* don't want to see a REINDEX fail, but that seems like a possibility that we need to talk about now. We've documented that there is this 1/3 of a page restriction, but specify that it's "approximate", so I don't think our documentation as written would need to be changed (though a release note item would certainly be in order). Also, since the vagaries of how well TOAST can compress are very hard to understand, I have to imagine that almost nobody relies on having the last 8 bytes -- if they do, then they're playing with fire. Not being able to REINDEX is still really awful, however unrealistic the scenario may be. I can imagine ways of avoiding it, but I'd really rather not go there, since the cure is worse than the disease. I'll still throw out some ideas for a cure, though: * Maybe we could rely on the fact that internal pages are the only pages that can contain more than one maybe-extra-sized 2720 byte pivot tuple, since in practice they also always have one "negative infinity" item. Those are always 16 bytes, so it works out. We've done something a bit like this before [3], though I can see big downsides to going further with it. It seems pretty grotty to me. I actually would like to *not* truncate outside of the leftmost page on the level, so that I can use the not-actually-negative-infinity first item in internal pages as a "low key". This would make low-overhead prefix compression within internal pages work. (This is not to be confused with leaf level prefix compression, which the DBA would have to enable.) * Alternatively, we could lift the general restriction on out-of-line storage within indexes, but I think that that would probably break certain things in a way that would go undetected for a long time. It would break amcheck's heapallindexed check, for one thing, and it might have implications for buffer locking protocols. In general, out-of-line storage within indexes seems like it works against what an index is supposed to do. And, it's a lot of effort for a small to nonexistent reward. Does it seem like I have the right general idea here? [1] https://commitfest.postgresql.org/19/1787/ [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=3680e69b89a8458d58f6c3361eb612788fa33786;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l612 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/itup.h;h=bd3a702380954bc69c9536876094249430cb7c72;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l139 -- Peter Geoghegan
pgsql-hackers by date: