Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
List | pgsql-hackers |
Attached is v4. I have two goals in mind for this revision, goals that are of great significance to the project as a whole: * Making better choices around leaf page split points, in order to maximize suffix truncation and thereby maximize fan-out. This is important when there are mostly-distinct index tuples on each leaf page (i.e. most of the time). Maximizing the effectiveness of suffix truncation needs to be weighed against the existing/main consideration: evenly distributing space among each half of a page split. This is tricky. * Not regressing the logic that lets us pack leaf pages full when there are a great many logical duplicates. That is, I still want to get the behavior I described on the '"Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes' thread [1]. This is not something that happens as a consequence of thinking about suffix truncation specifically, and seems like a fairly distinct thing to me. It's actually a bit similar to the rightmost 90/10 page split case. v4 adds significant new logic to make us do better on the first goal, without hurting the second goal. It's easy to regress one while focussing on the other, so I've leaned on a custom test suite throughout development. Previous versions mostly got the first goal wrong, but got the second goal right. For the time being, I'm focussing on index size, on the assumption that I'll be able to demonstrate a nice improvement in throughput or latency later. I can get the main TPC-C order_line pkey about 7% smaller after an initial bulk load with the new logic added to get the first goal (note that the benefits with a fresh CREATE INDEX are close to zero). The index is significantly smaller, even though the internal page index tuples can themselves never be any smaller due to alignment -- this is all about not restricting what can go on each leaf page by too much. 7% is not as dramatic as the "get tired" case, which saw something like a 50% decrease in bloat for one pathological case, but it's still clearly well worth having. The order_line primary key is the largest TPC-C index, and I'm merely doing a standard bulk load to get this 7% shrinkage. The TPC-C order_line primary key happens to be kind of adversarial or pathological to B-Tree space management in general, but it's still fairly realistic. For the first goal, page splits now weigh what I've called the "distance" between tuples, with a view to getting the most discriminating split point -- the leaf split point that maximizes the effectiveness of suffix truncation, within a range of acceptable split points (acceptable from the point of view of not implying a lopsided page split). This is based on probing IndexTuple contents naively when deciding on a split point, without regard for the underlying opclass/types. We mostly just use char integer comparisons to probe, on the assumption that that's a good enough proxy for using real insertion scankey comparisons (only actual truncation goes to those lengths, since that's a strict matter of correctness). This distance business might be considered a bit iffy by some, so I want to get early feedback. This new "distance" code clearly needs more work, but I felt that I'd gone too long without posting a new version. For the second goal, I've added a new macro that can be enabled for debugging purposes. This has the implementation sort heap TIDs in ASC order, rather than DESC order. This nicely demonstrates how my two goals for v4 are fairly independent; uncommenting "#define BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many duplicates are inserted, but won't regress things like the TPC-C indexes. (Note that BTREE_ASC_HEAP_TID will break the regression tests, though in a benign way can safely be ignored.) Open items: * Do more traditional benchmarking. * Add pg_upgrade support. * Simplify _bt_findsplitloc() logic. [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan
Attachment
pgsql-hackers by date: