Re: [PROPOSAL] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [PROPOSAL] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | CAM3SWZTQrSdzB_D_hEpuZyObNmr1VXjjYmQz_Jz0bLrgniRevA@mail.gmail.com Whole thread Raw |
In response to | [PROPOSAL] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Responses |
Re: [PROPOSAL] Effective storage of duplicates in B-tree
index.
|
List | pgsql-hackers |
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Now new B-tree index tuple must be inserted for each table row that we > index. > It can possibly cause page split. Because of MVCC even unique index could > contain duplicates. > Storing duplicates in posting list/tree helps to avoid superfluous splits. I'm glad someone is thinking about this, because it is certainly needed. I thought about working on it myself, but there is always something else to do. I should be able to assist with review, though. > So it seems to be very useful improvement. Of course it requires a lot of > changes in B-tree implementation, so I need approval from community. > > 1. Compatibility. > It's important to save compatibility with older index versions. > I'm going to change BTREE_VERSION to 3. > And use new (posting) features for v3, saving old implementation for v2. > Any objections? It might be better to just have a flag bit for pages that are compressed -- there are IIRC 8 free bits in the B-Tree page special area flags variable. But no real opinion on this from me, yet. You have plenty of bitspace to work with to mark B-Tree pages, in any case. > 2. There are several tricks to handle non-unique keys in B-tree. > More info in btree readme (chapter - Differences to the Lehman & Yao > algorithm). > In the new version they'll become useless. Am I right? I think that the L&Y algorithm makes assumptions for the sake of simplicity, rather than because they really believed that there were real problems. For example, they say that deletion can occur offline or something along those lines, even though that's clearly impractical. They say that because they didn't want to write a paper about deletion within B-Trees, I suppose. See also, my opinion of how they claim to not need read locks [1]. Also, note that despite the fact that the GIN README mentions "Lehman & Yao style right links", it doesn't actually do the L&Y trick of avoiding lock coupling -- the whole point of L&Y -- so that remark is misleading. This must be why B-Tree has much better concurrency than GIN in practice. Anyway, the way that I always imagined this would work is a layer "below" the current implementation. In other words, you could easily have prefix compression with a prefix that could end at a point within a reference IndexTuple. It could be any arbitrary point in the second or subsequent attribute, and would not "care" about the structure of the IndexTuple when it comes to where attributes begin and end, etc (although, in reality, in probably would end up caring, because of the complexity -- not caring is the ideal only, at least to me). As Alexander pointed out, GIN does not care about composite keys. That seems quite different to a GIN posting list (something that I know way less about, FYI). So I'm really talking about a slightly different thing -- prefix compression, rather than handling duplicates. Whether or not you should do prefix compression instead of deduplication is certainly not clear to me, but it should be considered. Also, I always imagined that prefix compression would use the highkey as the thing that is offset for each "real" IndexTuple, because it's there anyway, and that's simple. However, I suppose that that means that duplicate handling can't really work in a way that makes duplicates have a fixed cost, which may be a particularly important property to you. > 3. Microvacuum. > Killed items are marked LP_DEAD and could be deleted from separate page at > time of insertion. > Now it's fine, because each item corresponds with separate TID. But posting > list implementation requires another way. I've got two ideas: > First is to mark LP_DEAD only those tuples where all TIDs are not visible. > Second is to add LP_DEAD flag to each TID in posting list(tree). This way > requires a bit more space, but allows to do microvacuum of posting > list/tree. No real opinion on this point, except that I agree that doing something is necessary. Couple of further thoughts on this general topic: * Currently, B-Tree must be able to store at least 3 items on each page, for the benefit of the L&Y algorithm. You need room for 1 "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree page is redundant if you cannot get to any child page based on the scanKey value differing one way or the other (so 2 downlinks are a sensible minimum), plus a highkey is usually needed (just not on the rightmost page). As you probably know, we enforce this by making sure every IndexTuple is no more than 1/3 of the size that will fit. You should start thinking about how to deal with this in a world where the physical size could actually be quite variable. The solution is probably to simply pretend that every IndexTuple is its original size. This applies to both prefix compression and duplicate suppression, I suppose. * Since everything is aligned within B-Tree, it's probably worth considering the alignment boundaries when doing prefix compression, if you want to go that way. We can probably imagine a world where alignment is not required for B-Tree, which would work on x86 machines, but I can't see it happening soon. It isn't worth compressing unless it compresses enough to cross an "alignment boundary", where we're not actually obliged to store as much data on disk. This point may be obvious, not sure. [1] http://www.postgresql.org/message-id/flat/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com#CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: