Re: [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [WIP] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | CAM3SWZS1TYq_AD=9JPMjyhUwLzPHrAkhsWC6r3BOQE++iTFd=Q@mail.gmail.com Whole thread Raw |
In response to | Re: [WIP] Effective storage of duplicates in B-tree index. (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I think we should pack the TIDs more tightly, like GIN does with the varbyte > encoding. It's tempting to commit this without it for now, and add the > compression later, but I'd like to avoid having to deal with multiple > binary-format upgrades, so let's figure out the final on-disk format that we > want, right from the beginning. While the idea of duplicate storage is pretty obviously compelling, there could be other, non-obvious benefits. I think that it could bring further benefits if we could use duplicate storage to change this property of nbtree (this is from the README): """ Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page. This does not work for nonunique keys (for example, if we have enough equal keys to spread across several leaf pages, there *must* be some equal bounding keys in the first level up). Therefore we assume Ki <= v <= Ki+1 instead. A search that finds exact equality to a bounding key in an upper tree level must descend to the left of that key to ensure it finds any equal keys in the preceding page. An insertion that sees the high key of its target page is equal to the key to be inserted has a choice whether or not to move right, since the new key could go on either page. (Currently, we try to find a page where there is room for the new key without a split.) """ If we could *guarantee* that all keys in the index are unique, then we could maintain the keyspace as L&Y originally described. The practical benefits to this would be: * We wouldn't need to take the extra step described above -- finding a bounding key/separator key that's fully equal to our scankey would no longer necessitate a probably-useless descent to the left of that key. (BTW, I wonder if we could get away with not inserting a downlink into parent when a leaf page split finds an identical IndexTuple in parent, *without* changing the keyspace invariant I mention -- if we're always going to go to the left of an equal-to-scankey key in an internal page, why even have more than one?) * This would make suffix truncation of internal index tuples easier, and that's important. The traditional reason why suffix truncation is important is that it can keep the tree a lot shorter than it would otherwise be. These days, that might not seem that important, because even if you have twice the number of internal pages than strictly necessary, that still isn't that many relative to typical main memory size (and even CPU cache sizes, perhaps). The reason I think it's important these days is that not having suffix truncation makes our "separator keys" overly prescriptive about what part of the keyspace is owned by each internal page. With a pristine index (following REINDEX), this doesn't matter much. But, I think that we get much bigger problems with index bloat due to the poor fan-out that we sometimes see due to not having suffix truncation, *combined* with the page deletion algorithms restriction on deleting internal pages (it can only be done for internal pages with *no* children). Adding another level or two to the B-Tree makes it so that your workload's "sparse deletion patterns" really don't need to be that sparse in order to bloat the B-Tree badly, necessitating a REINDEX to get back to acceptable performance (VACUUM won't do it). To avoid this, we should make the internal pages represent the key space in the least restrictive way possible, by applying suffix truncation so that it's much more likely that things will *stay* balanced as churn occurs. This is probably a really bad problem with things like composite indexes over text columns, or indexes with many NULL values. -- Peter Geoghegan
pgsql-hackers by date: