Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | CAH2-WznOg8C1XXnYO_pJVWm1r-WHuxLg96cJ5t7-gt6rPddGgA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Responses |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
|
List | pgsql-hackers |
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Now the algorithm is the following: > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. > > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I think that the way this works within nbtsplitloc.c is too complicated. In v5, the only thing that nbtsplitloc.c knew about deduplication was that it could be sure that suffix truncation would at least make a posting list into a single heap TID in the worst case. This consideration was mostly about suffix truncation, not deduplication, which seemed like a good thing to me. _bt_split() and _bt_findsplitloc() should know as little as possible about posting lists. Obviously it will sometimes be necessary to deal with the case where a posting list is about to become too big (i.e. it's about to go over BTMaxItemSize()), and so must be split. Less often, a page split will be needed because of one of these posting list splits. These are two complicated areas (posting list splits and page splits), and it would be a good idea to find a way to separate them as much as possible. Remember, nbtsplitloc.c works by pretending that the new item that cannot fit on the page is already on its own imaginary version of the page that *can* fit the new item, along with everything else from the original/actual page. That gets *way* too complicated when it has to deal with the fact that the new item is being merged with an existing item. Perhaps nbtsplitloc.c could also "pretend" that the new item is always a plain tuple, without knowing anything about posting lists. Almost like how it worked in v5. We always want posting lists to be as close to the BTMaxItemSize() size as possible, because that helps with space utilization. In v5 of the patch, this was what happened, because, in effect, we didn't try to do anything complicated with the new item. This worked well, apart from the crash safety issue. Maybe we can simulate the v5 approach, giving us the best of all worlds (good space utilization, simplicity, and crash safety). Something like this: * Posting list splits should always result in one posting list that is at or just under BTMaxItemSize() in size, plus one plain tuple to its immediate right on the page. This is similar to the more common case where we cannot add additional tuples to a posting list due to the BTMaxItemSize() restriction, and so end up with a single tuple (or a smaller posting list with the same value) to the right of a BTMaxItemSize()-sized posting list tuple. I don't see a reason to split a posting list in the middle -- we should always split to the right, leaving the posting list as large as possible. * When there is a simple posting list split, with no page split, the logic required is fairly straightforward: We rewrite the posting list in-place so that our new item goes wherever it belongs in the existing posting list on the page (we memmove() the posting list to make space for the new TID, basically). The old last/rightmost TID in the original posting list becomes a new, plain tuple. We may need a new WAL record for this, but it's not that different to a regular leaf page insert. * When this happens to result in a page split, we then have a "fake" new item -- the right half of the posting list that we split, which is always a plain item. Obviously we need to be a bit careful with the WAL logging, but the space accounting within _bt_split() and _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can work like it did in v5, when the only thing it knew about posting lists was that _bt_truncate() always removes them, maybe leaving a single TID behind in the new high key. (Note also that it's not okay to remove the conservative assumption about at least having space for one heap TID within _bt_recsplitloc() -- that needs to be restored to its v5 state in the next version of the patch.) Because deduplication is lazy, there is little value in doing deduplication of the new item (which may or may not be the fake new item). The nbtsplitloc.c logic will "trap" duplicates on the same page today, so we can just let deduplication of the new item happen at a later time. _bt_split() can almost pretend that posting lists don't exist, and nbtsplitloc.c needs to know nothing about posting lists (apart from the way that _bt_truncate() behaves with posting lists). We "lie" to _bt_findsplitloc(), and tell it that the new item is our fake new item -- it doesn't do anything that will be broken by that lie, because it doesn't care about the actual content of posting lists. And, we can fix the "fake new item is not actually real new item" issue at one point within _bt_split(), just as we're about to WAL log. What do you think of that approach? -- Peter Geoghegan
pgsql-hackers by date: