Thread: [WIP] [B-Tree] Keep indexes sorted by heap physical location
The attached patch tries to maintain the initial status of B-Tree indexes, which are created with equal-key runs in physical order, during the whole life of the B-Tree, and make key-tid pairs efficiently searchable in the process. The patch doesn't add an interface to perform the key-tid pairs, nor does it address all potential issues that could arise from the implementation, or backward compatibility with earlier-format indexes, but passes all tests, and seems to work well with the workloads I have tried until now, which include: - dump/restore/compare (with lots of indexes, both unique and non) - sanity check of index lookups - running a complex application on top of it, no crashes or corruption (detected) Recovery hasn't been tested at all, and I'd expect it to have issues - though I have tried to make the necessary changes I may have missed more than one. Nor have I tested high-concurrency workloads (although the application mentioned does produce some of it), so this is an early WIP in search of feedback on the overall approach. Still, it should be enough to measure the merits of the idea, namely index size, performance and code impact. A couple of points make me uneasy about this patch, yet I can think of no better alternative, so I seek feedback: - introducing a different format for inner index tuples makes for an invasive patch and quite difficult-to-reason code (it's easy to forget whether a page is leaf or inner and that now causes assertion failures or segfaults) - the ascent-descent to check for uniqueness when there are large dead tuple runs could have locking subtleties that escape me. Perhaps there's better ways to solve this. - the patch changes the output of regression tests in nondeterministic ways, making index scan results always follow physical order. While that's not only not incorrect but desirable, lots of tests depended on the almost deterministic order that resulted from the way a suitable insertion point was selected before. The first patch addresses that, but at the expense of some "coverage" (ignoring some lines of reference output, the minimum I could ignore). - while I don't see why it would cause more bloat under normal usage than the previous method, there are probably new/different pathological cases to account for, and maybe I'm missing one that is a common pattern - it's btree, it doesn't get any more critical than that, it really needs a heapload of testing before going in, and some of it (crash recovery) may be tough to get right On the performance side, I don't see a significant impact. If anything, it has the potential to speed up things, since scans over long runs of equal keys will now be done in physical order, and insertions will be evenly spread around the index leaf pages (due to the spread of heap tuples), but I haven't done any benchmarks beyond pgbench to back that up. It is also an important enabling feature to make vacuum more efficient, and to implement WARM tuples in a more general way, so that has to be accounted for when evaluating the potential benefit from this patch. So, the numbers. The size impact I measured on a dump of a test database, and seems under the noise, which is what I expected since it only affects non-leaf pages: patched db: 23G hourly_full: 562M + 790M (118M + 118M + 554M) bid_logs: 591M + 268M (134M + 134M) uas: 285M + 496M (20M + 138M + 159M + 20M + 159M) unpatched db: 23G hourly_full: 562M + 789M (118M + 118M + 553M) bid_logs: 591M + 268M (134M + 134M) uas: 285M + 496M (20M + 138M + 159M + 20M + 159M) Here, numbers are HEAP + INDEXES (i1 + i2 + i3 ...), for tables that are a mix of wide and narrow, wide indexed columns (varchar) and narrow (int). Only a few indexes seem to have grown, but this is before any bloat accumulates (I haven't devised a test yet that produces the same amount of bloat on both databases to make them comparable). pgbench, in-memory, patched: transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 13271118 latency average: 0.181 ms tps = 44237.026822 (including connections establishing) tps = 44237.227678 (excluding connections establishing) pgbench, in-memory, unpatched: transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 13336128 latency average: 0.180 ms tps = 44453.718954 (including connections establishing) tps = 44453.895436 (excluding connections establishing) pgbench, bigger-than-RAM, patched: transaction type: <builtin: select only> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 71194 latency average: 33.711 ms tps = 237.269144 (including connections establishing) tps = 237.270122 (excluding connections establishing) transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 26748 latency average: 89.726 ms tps = 89.043800 (including connections establishing) tps = 89.044218 (excluding connections establishing) pgbench, bigger-than-RAM, unpatched: transaction type: <builtin: select only> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 86495 latency average: 27.747 ms tps = 288.256959 (including connections establishing) tps = 288.258202 (excluding connections establishing) transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 27145 latency average: 88.414 ms tps = 90.461600 (including connections establishing) tps = 90.462105 (excluding connections establishing) Which is expected since pgbench exercises none of the potentially improved paths. I/O here (just a laptop) is horrible so the difference is probably just noise. The drop in select-only seems strangely high, I think that's also noise, but I haven't run the tests again yet.
Attachment
On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > The attached patch tries to maintain the initial status of B-Tree > indexes, which are created with equal-key runs in physical order, > during the whole life of the B-Tree, and make key-tid pairs > efficiently searchable in the process. I have two pretty important questions that doesn't seem to be addressed in your email. 1. How does it do that? 2. What's the benefit? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 18, 2016 at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> The attached patch tries to maintain the initial status of B-Tree >> indexes, which are created with equal-key runs in physical order, >> during the whole life of the B-Tree, and make key-tid pairs >> efficiently searchable in the process. > > I have two pretty important questions that doesn't seem to be > addressed in your email. I believe I addressed that in the README, but if I wasn't clear, let me know and I'll expand there too. Summarizing here: > 1. How does it do that? It adds the block number and the offset number of the index tuple's t_tid (that is, the pointer into the heap) as a final (implicit) sort key, as is done already in tuplesort. It just keeps doing that while comparing keys in the B-Tree when searching the leaf page for insertion, so the insertion point now points to the exact point where the insertion has to happen to maintain that condition, and not just to the first valid position within the same-key run. In fact, that's why non-leaf index tuples need a different format, because while leaf index tuples contain the heap pointer already, non-leaf ones contain only the downlink, not the pointer into the heap. To be able to do comparisons and pick the right downlink, the original heap pointer in the leaf index tuple is copied into the downlink index tuple when splitting pages into an additional IndexTupleData header that is prepended only to non-leaf index tuples. This representation is chosen to minimize code changes, such that (itup+1) is usable as before, and also because it's reasonably compact. But it *is* necessary to check whether it's a leaf or non-leaf tuple to choose whether to use (itup+1) or just itup, which is why the patch requires so many changes in so many places. > 2. What's the benefit? The idea came up in the discussion about WARM updates. Basically, in various situations it is convenient to be able to find the index tuples that point to a specific heap tuple. Without this, doing so isn't efficient - the only way to find such index tuples is to find the leftmost index tuple that has the same key, and walk the index right links until the right tid is found. For non-unique indexes, but also for unique indexes with heavy bloat, this could involve a large amount of I/O, so it isn't efficient in several contexts. Think vacuum and WARM updates mostly (like HOT updates, but where only some indexes don't need updating, and some do). Having the ability to find a heap tuple by key-ctid pair is thus beneficial because it allows efficient implementations for those operations. It may benefit other parts of the code, but those are the ones that come to mind. It also makes index scans of the form SELECT * FROM t WHERE some_col = some_const; Scan the heap in sequential order, even if some_col has low cardinality (ie: the query returns lots of tuples), which is a nice performance side effect.
On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > It also makes index scans of the form > > SELECT * FROM t WHERE some_col = some_const; > > Scan the heap in sequential order, even if some_col has low > cardinality (ie: the query returns lots of tuples), which is a nice > performance side effect. Speaking of performance side effects, does this avoid O(N^2) performance on index tuple insertion with duplicate values, for all insertion orderings? For example, does it descend directly to the right leaf page for the insert rather than starting at the front of the block of duplicate values and scanning to the right for a block with space, with a random chance to split a full block on each page it moves through? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote: > On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > >> It also makes index scans of the form >> >> SELECT * FROM t WHERE some_col = some_const; >> >> Scan the heap in sequential order, even if some_col has low >> cardinality (ie: the query returns lots of tuples), which is a nice >> performance side effect. > > Speaking of performance side effects, does this avoid O(N^2) > performance on index tuple insertion with duplicate values, for all > insertion orderings? For example, does it descend directly to the > right leaf page for the insert rather than starting at the front of > the block of duplicate values and scanning to the right for a > block with space, with a random chance to split a full block on > each page it moves through? Yes, but only on non-unique indexes. Unique indexes still need to scan all duplicates to check visibility and will become O(N^2) there. The code with the random chance to split is still there, but it should never have a chance to run because the comparison against the heap tuple pointer would stop it very quickly (before walking a single offset I believe). I thought about cleaning that up, but to make review easier I thought I'd leave it there for now. Removing it would add a lot of diff noise.
On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > In fact, that's why non-leaf index tuples need a different format, > because while leaf index tuples contain the heap pointer already, > non-leaf ones contain only the downlink, not the pointer into the > heap. To be able to do comparisons and pick the right downlink, the > original heap pointer in the leaf index tuple is copied into the > downlink index tuple when splitting pages into an additional > IndexTupleData header that is prepended only to non-leaf index tuples. I think that this is a bad idea. We need to implement suffix truncation of internal page index tuples at some point, to make them contain less information from the original leaf page index tuple. That's an important optimization, because it increases fan-in. This seems like a move in the opposite direction. ISTM that the way to address this problem is with a duplicate list and/or prefix compression in leaf pages. -- Peter Geoghegan
Claudio Freire wrote: > Unique indexes still need to scan all duplicates to check visibility > and will become O(N^2) there. That scenario doesn't matter, because on unique indexes there aren't many duplicate values anyway -- only one can be a live tuple. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> In fact, that's why non-leaf index tuples need a different format, >> because while leaf index tuples contain the heap pointer already, >> non-leaf ones contain only the downlink, not the pointer into the >> heap. To be able to do comparisons and pick the right downlink, the >> original heap pointer in the leaf index tuple is copied into the >> downlink index tuple when splitting pages into an additional >> IndexTupleData header that is prepended only to non-leaf index tuples. > > I think that this is a bad idea. We need to implement suffix > truncation of internal page index tuples at some point, to make them > contain less information from the original leaf page index tuple. > That's an important optimization, because it increases fan-in. This > seems like a move in the opposite direction. I see that. I could try to measure average depth to measure the impact this had on fan-in. While it should cut it in half for narrow indexes, half of very high is still high. Wide indexes, which are are the ones that would suffer from poor fan-in, would feel this far less, since the overhead is constant. Even if it does have an impact, I don't see an alternative, without also implementing suffix truncation. Perhaps I could try to avoid adding the leaf tid header if it isn't necessary, though I would have to use up the last available flag bit in t_info for that. > ISTM that the way to address this problem is with a duplicate list > and/or prefix compression in leaf pages. Prefix compression is another one I will be looking into eventually, but last time I tried it was far more invasive so I abandoned until I could find a less invasive way to do it.
Claudio Freire <klaussfreire@gmail.com> writes: > On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote: >> Speaking of performance side effects, does this avoid O(N^2) >> performance on index tuple insertion with duplicate values, for all >> insertion orderings? For example, does it descend directly to the >> right leaf page for the insert rather than starting at the front of >> the block of duplicate values and scanning to the right for a >> block with space, with a random chance to split a full block on >> each page it moves through? > Yes, but only on non-unique indexes. How's that work if the existing entries aren't in TID order (which they will not be, in a pre-existing index)? Or are you assuming you can blow off on-disk compatibility of indexes? regards, tom lane
On Thu, Aug 18, 2016 at 6:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner <kgrittn@gmail.com> wrote: >>> Speaking of performance side effects, does this avoid O(N^2) >>> performance on index tuple insertion with duplicate values, for all >>> insertion orderings? For example, does it descend directly to the >>> right leaf page for the insert rather than starting at the front of >>> the block of duplicate values and scanning to the right for a >>> block with space, with a random chance to split a full block on >>> each page it moves through? > >> Yes, but only on non-unique indexes. > > How's that work if the existing entries aren't in TID order (which they > will not be, in a pre-existing index)? Or are you assuming you can blow > off on-disk compatibility of indexes? > > regards, tom lane This patch does blow off on-disk compatibility, but the plan is to re-add it later on. A flag in the meta page would indicate whether it's a sorted index or not, and the only way to get a sorted index would be with a reindex. The code would have to support both for a while until whatever point we'd figure we can drop support for old format indexes. Since this is a sort order change I see no alternative, either the whole index is sorted by TID or not.
On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I see that. I could try to measure average depth to measure the impact > this had on fan-in. > > While it should cut it in half for narrow indexes, half of very high > is still high. Wide indexes, which are are the ones that would suffer > from poor fan-in, would feel this far less, since the overhead is > constant. You'll also have to consider the impact on the choice of split point, FWIW. There is currently leeway about what page an index tuple can land on, if and when it happens to be equal to the high-key. I'm not sure how important that is, but it's a consideration. > Even if it does have an impact, I don't see an alternative, without > also implementing suffix truncation. Perhaps I could try to avoid > adding the leaf tid header if it isn't necessary, though I would have > to use up the last available flag bit in t_info for that. To be clear, I'm particularly worried about painting ourselves into a corner, suffix-truncation-wise. It's a very common optimization, that we should really have. >> ISTM that the way to address this problem is with a duplicate list >> and/or prefix compression in leaf pages. > > Prefix compression is another one I will be looking into eventually, > but last time I tried it was far more invasive so I abandoned until I > could find a less invasive way to do it. Unfortunately, sometimes it is necessary to be very ambitious in order to solve a problem. The understandable and usually well-reasoned approach of making progress as incremental as possible occasionally works against contributors. It's worth considering if this is such a case. -- Peter Geoghegan
On Thu, Aug 18, 2016 at 6:38 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I see that. I could try to measure average depth to measure the impact >> this had on fan-in. >> >> While it should cut it in half for narrow indexes, half of very high >> is still high. Wide indexes, which are are the ones that would suffer >> from poor fan-in, would feel this far less, since the overhead is >> constant. > > You'll also have to consider the impact on the choice of split point, > FWIW. There is currently leeway about what page an index tuple can > land on, if and when it happens to be equal to the high-key. I'm not > sure how important that is, but it's a consideration. When I read the code, it seemed generic enough, using ItemIdGetLength (which works on non-leaf index tuples correctly, reporting the right size), so it should still work. But the choice of split point may make a difference for future insertions, so I'll look into that. >> Even if it does have an impact, I don't see an alternative, without >> also implementing suffix truncation. Perhaps I could try to avoid >> adding the leaf tid header if it isn't necessary, though I would have >> to use up the last available flag bit in t_info for that. > > To be clear, I'm particularly worried about painting ourselves into a > corner, suffix-truncation-wise. It's a very common optimization, that > we should really have. Well, page version numbers could be used to switch between semantically similar page formats later on. So if another format is necessary (say, prefix compression, suffix truncation, etc), it can be changed on-the-fly on existing indexes by checking page version numbers and doing the proper switch. So we won't be painting ourselves into a corner, but picking the wrong format may indeed be a headache. I can go the way of the flag in t_info if that's preferrable. Both would work. It's just that it's the last flag available, and that would be painting ourselves into a corner regarding flags. To avoid that, the flag itself could be "has extra data" (instead of has leaf tid), and add a whole set of flags in the extra data. That could also help for preffix compression and suffix truncation. >>> ISTM that the way to address this problem is with a duplicate list >>> and/or prefix compression in leaf pages. >> >> Prefix compression is another one I will be looking into eventually, >> but last time I tried it was far more invasive so I abandoned until I >> could find a less invasive way to do it. > > Unfortunately, sometimes it is necessary to be very ambitious in order > to solve a problem. The understandable and usually well-reasoned > approach of making progress as incremental as possible occasionally > works against contributors. It's worth considering if this is such a > case. I'd agree if this regressed performance considerably without the other improvements, so I guess this will depend on the fan-in measurements.
On Thu, Aug 18, 2016 at 5:04 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > But the choice of split point may make a difference for future > insertions, so I'll look into that. A database product I worked on a long time ago had an interesting optimization for indexes that had multiple insertion points, each of which was increasing. (Picture, for example 100 case types, with case numbers inserted in sequence within each case type -- so the first three felony cases would be CF000001, CF000002, and CF000003; while the first three small claims cases would be SC000001, SC000002, and SC000003.) That's not a terribly rare usage pattern, in my experience. An index on such data is most efficient if you split at the point of the index tuple being inserted -- making it the last tuple on the left-hand page. I don't know whether we might want to consider making that an option somehow -- it just came to mind because of this discussion. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> In fact, that's why non-leaf index tuples need a different format, >>> because while leaf index tuples contain the heap pointer already, >>> non-leaf ones contain only the downlink, not the pointer into the >>> heap. To be able to do comparisons and pick the right downlink, the >>> original heap pointer in the leaf index tuple is copied into the >>> downlink index tuple when splitting pages into an additional >>> IndexTupleData header that is prepended only to non-leaf index tuples. >> >> I think that this is a bad idea. We need to implement suffix >> truncation of internal page index tuples at some point, to make them >> contain less information from the original leaf page index tuple. >> That's an important optimization, because it increases fan-in. This >> seems like a move in the opposite direction. > > I see that. I could try to measure average depth to measure the impact > this had on fan-in. > > While it should cut it in half for narrow indexes, half of very high > is still high. Wide indexes, which are are the ones that would suffer > from poor fan-in, would feel this far less, since the overhead is > constant. > > Even if it does have an impact, I don't see an alternative, without > also implementing suffix truncation. Perhaps I could try to avoid > adding the leaf tid header if it isn't necessary, though I would have > to use up the last available flag bit in t_info for that. So, I did some tests. TL;DR version is, as expected, there is a big impact on narrow indexes, but I don't believe it's something to be worried about. **Begin LONG version** (skip to end long version if not interested) Regardless of the fanout (or fanin, if you look at it that way), what matters is tree depth, so I used pageinspect to check how depth of indexes changed after patching. This is all on pristine recently built indexes. I will try to measure the same on bloated indexes later on, but the tests are painfully slow. I used the following query to inspect all user indexes, since pageinspect failed to work for toast indexes for some reason (it probably needed qualified relnames or something like that). Anyway user indexes should be enough. select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest from pg_class, bt_metap(relname) where relkind = 'i' and relnamespace = 2200 and relam = 403 group by level order by level desc; I tested it on both patched and unpached versions of both my test database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale 10000 (146GB). I believe pgbench is a worst case example, because it only has very narrow PK indexes, which are the ones that will suffer the most from this patch. The numbers: mat, patched level;count;biggest 3;18;"1369 MB" 2;60;"322 MB" 1;26;"808 kB" 0;50;"16 kB" mat, unpatched level;count;biggest 3;15;"1367 MB" 2;63;"334 MB" 1;26;"800 kB" 0;49;"16 kB" pgbench, scale 1000, patched level;count;biggest 3;1;"2145 MB" 1;2;"456 kB" pgbench, scale 1000, unpatched level;count;biggest 3;1;"2142 MB" 1;2;"456 kB" pgbench, scale 10000, patched level;count;biggest 3;1;"21 GB" 2;1;"2224 kB" 1;1;"240 kB" pgbench, scale 10000, unpatched level;count;biggest 3;1;"21 GB" 1;2;"2208 kB" So, clearly some indexes are pushed to the next level, but it isn't a widespread effect - it looks more like the threshold index size for needing a root split is slightly pushed downwards. It totally agrees with the math. The index tuple header is 8 bytes, and the datums need to be maxaligned so they will also be 8 bytes at least. That means the B in B-tree gets cut by 50% (2/3) at the worst but only for inner tuples, and so you get that if a * log_(B)(N/B) = log_(B/(3/2))(N/B) then for big enogh N a < (3/2)*log_(B)(N) - 2 Which means the depth of huge B-trees is increased by 50%, but the effects only begins to be seen at level 2, which is what we have here, and level 2 for such a narrow index seems to be about 2GB. So we're talking about very big indexes already. If level 2 can hold 2GB of index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have yet to see (even in very big databases) a 1TB index on a PK. If anyone has such an index, yes, this patch will hurt performance by 25% (4 page reads instead of 3). Bad, but since it only affects very extreme cases, doesn't seem so bad. **End LONG version** So, that said, I started mocking up a version of the patch that used a "has aux data" flag to signal another set of flags from where variable size attributes can be read, which would enable implementation of suffix truncation and prefix compression in the future with minimal data format changes, and was surprised to find that it seems not only more compact, but cleaner. So I will probably go that way, and do that.
On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > > A couple of points make me uneasy about this patch, yet I can think of > no better alternative, so I seek feedback: > > - introducing a different format for inner index tuples makes for an > invasive patch and quite difficult-to-reason code (it's easy to forget > whether a page is leaf or inner and that now causes assertion failures > or segfaults) > - the ascent-descent to check for uniqueness when there are large > dead tuple runs could have locking subtleties that escape me. Perhaps > there's better ways to solve this. I have tried to study this part of your patch and it seems somewhat non-trivial and risky part of proposal. + } else { + /* + * We found the actual first item on another block, so we have to perform + * a two-step search - first half, until our write-locked buffer, then another + * starting from our write-locked buffer. + */ + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + LockBuffer(buf, BT_WRITE); + + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false, + true, stack, BT_WRITE, NULL); + + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL); + + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf, first_offset, itup_scankey, + checkUnique, &is_unique, &speculativeToken); + + _bt_relbuf(rel, nbuf); + } The idea for uniqueness check is that we hold the write lock on buffer/page on which we are trying to operate (we do take read locks on the consecutive pages during this check). Here, in your changed proposal, you have two buffers (one to which the search led you and one buffer previous in the chain) and before checking uniqueness, on one of the buffers you seem to have write lock and on other you have read lock. This seems to change the way uniqueness check works till now, can you explain how this works (can we safely assume that all searches for uniqueness check will lead to the same buffer first). With this new mechanism, do we have two type of search interfaces where one would work for keys (as now) and other for key-ctid or it will be just a single interface which works both ways? I think either way there are pros and cons. Also, in the thread below you are talking about using the last bit in t_info, I want to bring it to your notice that there is a patch of mine [1] in which I have used it to avoid changing on-disk compatibility of hash indexes. I am not saying that we should not plan it for some other use, but just to keep in mind that there are other proposals to use it. We can decide what is best way to proceed, if we are aware of all the proposals that want to use it. [1] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> >> A couple of points make me uneasy about this patch, yet I can think of >> no better alternative, so I seek feedback: >> >> - introducing a different format for inner index tuples makes for an >> invasive patch and quite difficult-to-reason code (it's easy to forget >> whether a page is leaf or inner and that now causes assertion failures >> or segfaults) >> - the ascent-descent to check for uniqueness when there are large >> dead tuple runs could have locking subtleties that escape me. Perhaps >> there's better ways to solve this. > > I have tried to study this part of your patch and it seems somewhat > non-trivial and risky part of proposal. > > + } else { > + /* > + * We found the actual first item on another block, so we have to perform > + * a two-step search - first half, until our write-locked buffer, then another > + * starting from our write-locked buffer. > + */ > + LockBuffer(buf, BUFFER_LOCK_UNLOCK); > + LockBuffer(buf, BT_WRITE); > + > + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false, > + true, stack, BT_WRITE, NULL); > + > + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL); > + > + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf, > first_offset, itup_scankey, > + checkUnique, &is_unique, &speculativeToken); > + > + _bt_relbuf(rel, nbuf); > + } > > The idea for uniqueness check is that we hold the write lock on > buffer/page on which we are trying to operate (we do take read locks > on the consecutive pages during this check). Here, in your changed > proposal, you have two buffers (one to which the search led you and > one buffer previous in the chain) and before checking uniqueness, on > one of the buffers you seem to have write lock and on other you have > read lock. This seems to change the way uniqueness check works till > now, can you explain how this works (can we safely assume that all > searches for uniqueness check will lead to the same buffer first). All uniqueness checks will find the same "nbuf" (read-locked), which is the first one that matches the key. They could have different "buf" buffers (the write-locked) one. But since read locks cannot be held while a write-lock is held, I believe it doesn't matter. All uniqueness checks will try to read-lock nbuf unless the uniqueness check for that insertion is done. When they do, they will block until the other insertion is finished, and when that happens, either the insert happened, or it returned an xid to wait for and retry. If it succeeded, the check attempting the read-lock will see the inserted tuple and return the xid of the earlier transaction and wait on it. If it failed, no insert happened because there's a visible tuple there, and the read-lock will succeed, find the visible tuple, and the insert will fail too. In essence, it is still the case, I believe, that only one insert can succeed at a time, all the others will retry. It only happens that with the patch, the contention will be between a read lock and a write lock, and not two write locks, and there's a little less contention (some more work can be done in parallel). > With this new mechanism, do we have two type of search interfaces > where one would work for keys (as now) and other for key-ctid or it > will be just a single interface which works both ways? I think either > way there are pros and cons. The patch doesn't add another interface, but the idea is to add it. The implementation will probably fall on a common function that can do both, but some index ams can't implement the key-ctid operation (hash indexes), so they will have to be separate interfaces to be able to properly represent the separate capabilities (besides, other implementations may well use completely different paths for the two operations). > Also, in the thread below you are talking about using the last bit in > t_info, I want to bring it to your notice that there is a patch of > mine [1] in which I have used it to avoid changing on-disk > compatibility of hash indexes. I am not saying that we should not > plan it for some other use, but just to keep in mind that there are > other proposals to use it. We can decide what is best way to proceed, > if we are aware of all the proposals that want to use it. > > > [1] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com I wasn't aware of that particular thread, but there's another (the WARM one) where there's another proposed use of that bit. IMHO, any use of the bit that doesn't leave room for further extensions of the bit space is a bad idea, because it closes the door on (easy) further flags being added. And the fact that there's already several proposals to use that bit for different purposes only reinforces that idea. The idea I'm working on allows further extension, and, although it uses more space, I believe it's a better way. But we'll discuss that when I have the implementation I guess.
On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > All uniqueness checks will try to read-lock nbuf > unless the uniqueness check for that insertion is done. That should read "all uniqueness checks will try to read-lock the buf unless the uniqueness check for that insertion is done." (nbuf -> buf)
On Sat, Aug 20, 2016 at 10:23 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> >>> A couple of points make me uneasy about this patch, yet I can think of >>> no better alternative, so I seek feedback: >>> >>> - introducing a different format for inner index tuples makes for an >>> invasive patch and quite difficult-to-reason code (it's easy to forget >>> whether a page is leaf or inner and that now causes assertion failures >>> or segfaults) >>> - the ascent-descent to check for uniqueness when there are large >>> dead tuple runs could have locking subtleties that escape me. Perhaps >>> there's better ways to solve this. >> >> I have tried to study this part of your patch and it seems somewhat >> non-trivial and risky part of proposal. >> >> + } else { >> + /* >> + * We found the actual first item on another block, so we have to perform >> + * a two-step search - first half, until our write-locked buffer, then another >> + * starting from our write-locked buffer. >> + */ >> + LockBuffer(buf, BUFFER_LOCK_UNLOCK); >> + LockBuffer(buf, BT_WRITE); >> + >> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false, >> + true, stack, BT_WRITE, NULL); >> + >> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL); >> + >> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf, >> first_offset, itup_scankey, >> + checkUnique, &is_unique, &speculativeToken); >> + >> + _bt_relbuf(rel, nbuf); >> + } >> >> The idea for uniqueness check is that we hold the write lock on >> buffer/page on which we are trying to operate (we do take read locks >> on the consecutive pages during this check). Here, in your changed >> proposal, you have two buffers (one to which the search led you and >> one buffer previous in the chain) and before checking uniqueness, on >> one of the buffers you seem to have write lock and on other you have >> read lock. This seems to change the way uniqueness check works till >> now, can you explain how this works (can we safely assume that all >> searches for uniqueness check will lead to the same buffer first). > > All uniqueness checks will find the same "nbuf" (read-locked), which > is the first one that matches the key. > > They could have different "buf" buffers (the write-locked) one. But > since read locks cannot be held while a write-lock is held, I believe > it doesn't matter. All uniqueness checks will try to read-lock nbuf > unless the uniqueness check for that insertion is done. When they do, > they will block until the other insertion is finished, and when that > happens, either the insert happened, or it returned an xid to wait for > and retry. If it succeeded, the check attempting the read-lock will > see the inserted tuple and return the xid of the earlier transaction > and wait on it. If it failed, no insert happened because there's a > visible tuple there, and the read-lock will succeed, find the visible > tuple, and the insert will fail too. > > In essence, it is still the case, I believe, that only one insert can > succeed at a time, all the others will retry. It only happens that > with the patch, the contention will be between a read lock and a write > lock, and not two write locks, and there's a little less contention > (some more work can be done in parallel). > >> With this new mechanism, do we have two type of search interfaces >> where one would work for keys (as now) and other for key-ctid or it >> will be just a single interface which works both ways? I think either >> way there are pros and cons. > > The patch doesn't add another interface, but the idea is to add it. > The implementation will probably fall on a common function that can do > both > That makes sense, but this means there is a chance that the searches could lead to different buffers in case of uniqueness checks (the search with key-ctid could lead to a different buffer than the search with just key). I am not clear do we have requirement for doing this uniqueness check for key-ctid search API, because as I understand you want to do it mainly for vacuum and WARM tuples. Vacuum won't add new tuples, so is this required for WARM tuples? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> All uniqueness checks will find the same "nbuf" (read-locked), which >> is the first one that matches the key. >> >> They could have different "buf" buffers (the write-locked) one. But >> since read locks cannot be held while a write-lock is held, I believe >> it doesn't matter. All uniqueness checks will try to read-lock nbuf >> unless the uniqueness check for that insertion is done. When they do, >> they will block until the other insertion is finished, and when that >> happens, either the insert happened, or it returned an xid to wait for >> and retry. If it succeeded, the check attempting the read-lock will >> see the inserted tuple and return the xid of the earlier transaction >> and wait on it. If it failed, no insert happened because there's a >> visible tuple there, and the read-lock will succeed, find the visible >> tuple, and the insert will fail too. >> >> In essence, it is still the case, I believe, that only one insert can >> succeed at a time, all the others will retry. It only happens that >> with the patch, the contention will be between a read lock and a write >> lock, and not two write locks, and there's a little less contention >> (some more work can be done in parallel). >> >>> With this new mechanism, do we have two type of search interfaces >>> where one would work for keys (as now) and other for key-ctid or it >>> will be just a single interface which works both ways? I think either >>> way there are pros and cons. >> >> The patch doesn't add another interface, but the idea is to add it. >> The implementation will probably fall on a common function that can do >> both >> > > That makes sense, but this means there is a chance that the searches > could lead to different buffers in case of uniqueness checks (the > search with key-ctid could lead to a different buffer than the search > with just key). I am not clear do we have requirement for doing this > uniqueness check for key-ctid search API, because as I understand you > want to do it mainly for vacuum and WARM tuples. Vacuum won't add new > tuples, so is this required for WARM tuples? Well, I'm not realy sure what exactly would need to be done when doing the WARM conditional insert in the case of unique indexes, and that's a strong motivation to not add the interface for inserts just yet. Vacuum will only need to delete, and in the case of deletes the operation would be quite straight forward.
On Sat, Aug 20, 2016 at 9:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> That makes sense, but this means there is a chance that the searches >> could lead to different buffers in case of uniqueness checks (the >> search with key-ctid could lead to a different buffer than the search >> with just key). I am not clear do we have requirement for doing this >> uniqueness check for key-ctid search API, because as I understand you >> want to do it mainly for vacuum and WARM tuples. Vacuum won't add new >> tuples, so is this required for WARM tuples? > > Well, I'm not realy sure what exactly would need to be done when doing > the WARM conditional insert in the case of unique indexes, and that's > a strong motivation to not add the interface for inserts just yet. > Vacuum will only need to delete, and in the case of deletes the > operation would be quite straight forward. > Right. I think if you initially modify the interface only for deletes that will reduce the complexity of patch as well. We can later enhance it to handle WARM tuple case. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote: > I think that this is a bad idea. We need to implement suffix > truncation of internal page index tuples at some point, to make them > contain less information from the original leaf page index tuple. > That's an important optimization, because it increases fan-in. This > seems like a move in the opposite direction. Maybe I was too hasty, here. Can you rebase this patch, Claudio? It's possible that this idea could have further non-obvious benefits. I see some potential for this to help with nbtree page deletion, item recycling, etc. For example, maybe VACUUM can manage to do something much closer to a regular index scan when that seems to make sense -- Andres (CC'd) has talked about this before. And, maybe we get a benefit from localizing where the LP_DEAD bit can be set for the kill_prior_tuple stuff to work in UPDATE-heavy workloads. Naturally, there'd often be some kind of locality in terms of older row versions belonging earlier in the heap, versions which can be concentrated onto one leaf page with this patch. As newer updates make it look like the leaf page needs to be split, there is a concentration of LP_DEAD-set line items to reclaim space from, letting some UPDATEs (index tuple inserters) defer the split. There is a lot less value in the LP_DEAD stuff if reclaimable line items are sprayed all over the place, which seems quite possible to me on current versions. I also think that this might help with bitmap index scans. -- Peter Geoghegan
On Fri, Nov 18, 2016 at 10:05 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I think that this is a bad idea. We need to implement suffix >> truncation of internal page index tuples at some point, to make them >> contain less information from the original leaf page index tuple. >> That's an important optimization, because it increases fan-in. This >> seems like a move in the opposite direction. > > Maybe I was too hasty, here. Can you rebase this patch, Claudio? I've been changing the on-disk format considerably, to one that makes more sense. I haven't posted it because it still doesn't have suffix (and prefix) truncation, but it should be compatible with it (ie: it could be implemented as an optimization that doesn't change the on-disk format). However, I haven't tested the current state of the patch thoroughly. If a WIP is fine, I can post the what I have, rebased. > It's possible that this idea could have further non-obvious benefits. > I see some potential for this to help with nbtree page deletion, item > recycling, etc. For example, maybe VACUUM can manage to do something > much closer to a regular index scan when that seems to make sense -- That was on my mind too > I also think that this might help with bitmap index scans. How so? AFAIK it helps regular scans on low-cardinality indexes more than bitmap index scans. With low cardinality indexes, the resulting heap access pattern will be an unordered series of sequential range scans, which is better than the fully random scan the current layout causes. Bitmap index scans, however, deny that benefit.
On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I've been changing the on-disk format considerably, to one that makes > more sense. I can see how that would be necessary. I'm going to suggest some more things that need a new btree version number now, too. I realized today, quite suddenly, why I disliked your original patch: it didn't go far enough with embracing the idea of heap-item-pointer-as-key. In other words, I didn't like that you didn't change anything in the internal pages. Maybe you should put heap TIDs someplace in new internal pages, so that even there we treat them as part of the key. This may significantly benefit UPDATE-heavy workloads where HOT doesn't work out so well. Particularly for non-unique indexes, we currently have to wade through a lot of bloat from the leaf level, rather than jumping straight to the correct leaf page when updating or inserting. You might not want to do the same with unique indexes, where we must initially buffer lock "the first leaf page that a duplicate could be on" while we potentially look in one or more additional sibling pages (but bloat won't be so bad there, perhaps). Or, you might want to make sure that B-Tree tuple insertions still find that "first page" to check, despite still generally treating heap item pointer as part of the key proper (in unique indexes, it can still behave like NULL, and not participate in uniqueness checking, while still essentially being part of the key/sort order). Of course, it isn't great to add stuff to the internal pages, but if you're also implementing suffix truncation, there is probably no harm at all. You're only going to affect cases that will also greatly benefit from having that extra information, to guide insertions of new values to the right place more efficiently (by inserters, I mostly mean insertions from UPDATEs). It also occurs to me that our performance for updates in the event of many NULL values in indexes is probably really bad, and could be helped by this. You'll want to test all of this with a workload that isn't very sympathetic to HOT, of course -- most of these benefits are seen when HOT doesn't help. > However, I haven't tested the current state of the patch thoroughly. > > If a WIP is fine, I can post the what I have, rebased. I'm mostly curious about the effects on bloat. I now feel like you haven't sufficiently examined the potential benefits there, since you never made heap item pointer a first class part of the key space. Maybe you'd like to look into that yourself first? >> I also think that this might help with bitmap index scans. > > How so? I was thinking about teaching nbtree to start the scan at the level above the leaf level. If the heap TID was part of the key proper, one can imagine it being much cheaper to build a bitmap for a moderate cardinality value (e.g., NULL values needed for "WHERE foo IS NULL" index scans). The exact details would need to be worked out, but one can imagine building a bitmap that is much larger than necessary one level up from the leaf, then lazily unsetting bits as we descend the B-Tree to the leaf level and find pages whose bitmap bits can be unset. We might balance the avoidance of I/O during the scan against how much we might expect to save in a subsequent bitmap heap scan, and so on. This might be based on a selectivity estimate. That's all fairly hand-wavy, certainly, but I see significant potential along these lines. -- Peter Geoghegan
On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I've been changing the on-disk format considerably, to one that makes >> more sense. > > I can see how that would be necessary. I'm going to suggest some more > things that need a new btree version number now, too. > > I realized today, quite suddenly, why I disliked your original patch: > it didn't go far enough with embracing the idea of > heap-item-pointer-as-key. In other words, I didn't like that you > didn't change anything in the internal pages. But it did. In fact it only changed internal pages. > Maybe you should put > heap TIDs someplace in new internal pages, so that even there we treat > them as part of the key. That is indeed what's happening (both in the old and new version). The new version also opens up to the possibility of omitting the heap TID in inner pages, if they're redundant (ie: if two consecutive keys are different already, the heap TID part of the key is redundant, since it's not necessary to make tree descent decisions). > This may significantly benefit UPDATE-heavy > workloads where HOT doesn't work out so well. Particularly for > non-unique indexes, we currently have to wade through a lot of bloat > from the leaf level, rather than jumping straight to the correct leaf > page when updating or inserting. That is improved in the patch as well. The lookup for an insertion point for a heavily bloated (unique or not) index is done efficiently, instead of the O(N^2) way it used before. > You might not want to do the same with unique indexes, where we must > initially buffer lock "the first leaf page that a duplicate could be > on" while we potentially look in one or more additional sibling pages > (but bloat won't be so bad there, perhaps). It's done even for unique indexes. Locking in that case becomes complex, true, but I believe it's not an insurmountable problem. > Or, you might want to make > sure that B-Tree tuple insertions still find that "first page" to > check, despite still generally treating heap item pointer as part of > the key proper (in unique indexes, it can still behave like NULL, and > not participate in uniqueness checking, while still essentially being > part of the key/sort order). It behaves like that (even though it's not coded like that) > It also occurs to me that our performance for updates in the event of > many NULL values in indexes is probably really bad, and could be > helped by this. You'll want to test all of this with a workload that > isn't very sympathetic to HOT, of course -- most of these benefits are > seen when HOT doesn't help. I haven't really tested with an overabundance of NULLs, I'll add that to the tests. I have tested low-cardinality indexes though, but only for correctness. >> However, I haven't tested the current state of the patch thoroughly. >> >> If a WIP is fine, I can post the what I have, rebased. > > I'm mostly curious about the effects on bloat. I now feel like you > haven't sufficiently examined the potential benefits there, since you > never made heap item pointer a first class part of the key space. > Maybe you'd like to look into that yourself first? What do you mean with "first class part"? It's not included in the scankey for a variety of reasons, not the least of which is not breaking the interface for current uses, and because the tid type is quite limited in its capabilities ATM. Also, the heap TID is usually there on most AM calls that care about it (ie: insertions have it, of course), so adding it to the scankey also seemed redundant. If not adding to the scankey, what do you mean by "first class" then?
On Mon, Nov 21, 2016 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I realized today, quite suddenly, why I disliked your original patch: >> it didn't go far enough with embracing the idea of >> heap-item-pointer-as-key. In other words, I didn't like that you >> didn't change anything in the internal pages. > > But it did. In fact it only changed internal pages. Oh, okay. >> Maybe you should put >> heap TIDs someplace in new internal pages, so that even there we treat >> them as part of the key. > > That is indeed what's happening (both in the old and new version). Good. > The new version also opens up to the possibility of omitting the heap > TID in inner pages, if they're redundant (ie: if two consecutive keys > are different already, the heap TID part of the key is redundant, > since it's not necessary to make tree descent decisions). Makes sense; this is just another aspect of suffix truncation. >> This may significantly benefit UPDATE-heavy >> workloads where HOT doesn't work out so well. Particularly for >> non-unique indexes, we currently have to wade through a lot of bloat >> from the leaf level, rather than jumping straight to the correct leaf >> page when updating or inserting. > > That is improved in the patch as well. The lookup for an insertion > point for a heavily bloated (unique or not) index is done efficiently, > instead of the O(N^2) way it used before. Cool. > It's done even for unique indexes. Locking in that case becomes > complex, true, but I believe it's not an insurmountable problem. I actually don't think that that would be all that complicated. It's just one case where you have to mostly match the existing behavior. >> Or, you might want to make >> sure that B-Tree tuple insertions still find that "first page" to >> check, despite still generally treating heap item pointer as part of >> the key proper (in unique indexes, it can still behave like NULL, and >> not participate in uniqueness checking, while still essentially being >> part of the key/sort order). > > It behaves like that (even though it's not coded like that) Why not? What do you mean? What I'm interested in evaluating is an implementation where an index on (foo) has a sort order/key space that is precisely the same as an index on (foo, ctid), with zero exceptions. >> It also occurs to me that our performance for updates in the event of >> many NULL values in indexes is probably really bad, and could be >> helped by this. You'll want to test all of this with a workload that >> isn't very sympathetic to HOT, of course -- most of these benefits are >> seen when HOT doesn't help. > > I haven't really tested with an overabundance of NULLs, I'll add that > to the tests. I have tested low-cardinality indexes though, but only > for correctness. I think we'll have to model data distribution to evaluate the idea well. After all, if there is a uniform distribution of values anyway, you're going to end up with many dirty leaf pages. Whereas, in the more realistic case where updates are localized, we can hope to better amortize the cost of inserting index tuples, and recycling index tuples. > What do you mean with "first class part"? > > It's not included in the scankey for a variety of reasons, not the > least of which is not breaking the interface for current uses, and > because the tid type is quite limited in its capabilities ATM. Also, > the heap TID is usually there on most AM calls that care about it (ie: > insertions have it, of course), so adding it to the scankey also > seemed redundant. I mean that insertions into indexes that are low cardinality should be largely guided by TID comparisons. > If not adding to the scankey, what do you mean by "first class" then? Just what I said about the sort order of an index on "(foo)" now precisely matching what we'd get for "(foo, ctid)". There are a couple of tricky issues with that that you'd have to look out for, like making sure that the high key continues to hold a real TID, which at a glance looks like something that just happens already. I'm not sure that we preserve the item pointer TID in all cases, since the current assumption is that a high key's TID is just filler -- maybe we can lose that at some point. You should use amcheck to specifically verify that that happens reliably in all cases. Presumably, its use of an insertion scankey would automatically see the use of TID as a tie-breaker with patched Postgres amcheck verification, and so amcheck will work for this purpose unmodified. -- Peter Geoghegan
On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> Or, you might want to make >>> sure that B-Tree tuple insertions still find that "first page" to >>> check, despite still generally treating heap item pointer as part of >>> the key proper (in unique indexes, it can still behave like NULL, and >>> not participate in uniqueness checking, while still essentially being >>> part of the key/sort order). >> >> It behaves like that (even though it's not coded like that) > > Why not? What do you mean? > > What I'm interested in evaluating is an implementation where an index > on (foo) has a sort order/key space that is precisely the same as an > index on (foo, ctid), with zero exceptions. Well, the patch does the same sorting, but without explicitly adding the ctid to the scankey. When inserting into a unique index, the lookup for the insertion point proceeds as it would if the key was (foo, ctid), finding a page in the middle of the range that contain item pointers for foo. When that happens on a regular index, the insertion is done at that point and nothing else needs to be done. But when it happens on a unique index, the code first checks to see if the first item pointer for foo is on the same page (allegedly a frequent case). If so, the uniqueness check is done in a very straightforward and efficient manner. If not, however, the code retraces its steps up the tree to the first time it followed a key other than foo (that's the only way it could land at a middle page), and then follows the downlinks looking at a truncated key (just foo, ignoring ctid). Kind of what you say, but without the explicit null. > >>> It also occurs to me that our performance for updates in the event of >>> many NULL values in indexes is probably really bad, and could be >>> helped by this. You'll want to test all of this with a workload that >>> isn't very sympathetic to HOT, of course -- most of these benefits are >>> seen when HOT doesn't help. >> >> I haven't really tested with an overabundance of NULLs, I'll add that >> to the tests. I have tested low-cardinality indexes though, but only >> for correctness. > > I think we'll have to model data distribution to evaluate the idea > well. After all, if there is a uniform distribution of values anyway, > you're going to end up with many dirty leaf pages. Whereas, in the > more realistic case where updates are localized, we can hope to better > amortize the cost of inserting index tuples, and recycling index > tuples. Quite possibly >> What do you mean with "first class part"? >> >> It's not included in the scankey for a variety of reasons, not the >> least of which is not breaking the interface for current uses, and >> because the tid type is quite limited in its capabilities ATM. Also, >> the heap TID is usually there on most AM calls that care about it (ie: >> insertions have it, of course), so adding it to the scankey also >> seemed redundant. > > I mean that insertions into indexes that are low cardinality should be > largely guided by TID comparisons. ... >> If not adding to the scankey, what do you mean by "first class" then? > > Just what I said about the sort order of an index on "(foo)" now > precisely matching what we'd get for "(foo, ctid)". It is the case in both versions of the patch > There are a couple > of tricky issues with that that you'd have to look out for, like > making sure that the high key continues to hold a real TID, which at a > glance looks like something that just happens already. I'm not sure > that we preserve the item pointer TID in all cases, since the current > assumption is that a high key's TID is just filler -- maybe we can > lose that at some point. I thought long about that, and inner pages don't need a real TID in their keys, as those keys only specify key space cutoff points and not real tids, and high tids in leaf pages are always real. I haven't had any issue with that, aside from the headaches resulting from thinking about that for protracted periods of time. > You should use amcheck to specifically verify that that happens > reliably in all cases. Presumably, its use of an insertion scankey > would automatically see the use of TID as a tie-breaker with patched > Postgres amcheck verification, and so amcheck will work for this > purpose unmodified. It's totally on my roadmap. I'll have to adapt it for the new index format though, it doesn't work without modification.
On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> There are a couple >> of tricky issues with that that you'd have to look out for, like >> making sure that the high key continues to hold a real TID, which at a >> glance looks like something that just happens already. I'm not sure >> that we preserve the item pointer TID in all cases, since the current >> assumption is that a high key's TID is just filler -- maybe we can >> lose that at some point. > > I thought long about that, and inner pages don't need a real TID in > their keys, as those keys only specify key space cutoff points and not > real tids, and high tids in leaf pages are always real. Not if there are duplicates in the inner pages. Then, you have to add in the TID, and separate the key space, to guide insertions to the right leaf page directly (particularly for non-HOT UPDATEs). That's what I'm mostly interested in investigating, here. -- Peter Geoghegan
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Peter Geoghegan
Date:
On Wed, Aug 17, 2016 at 7:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > The attached patch tries to maintain the initial status of B-Tree > indexes, which are created with equal-key runs in physical order, > during the whole life of the B-Tree, and make key-tid pairs > efficiently searchable in the process. I don't see an entry for this in the next CF. Do you have a plan for it? BTW, I did post some remarks on this patch on another thread recently [1]. Not sure if any of what I said there is news to you at this point. [1] postgr.es/m/CAH2-Wzn=6Lc3OVA88x=E6SKG72ojNUE6ut6RZAqNnQx-YLcw=Q@mail.gmail.com -- Peter Geoghegan
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Claudio Freire
Date:
On Fri, Jul 21, 2017 at 10:31 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Aug 17, 2016 at 7:54 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> The attached patch tries to maintain the initial status of B-Tree >> indexes, which are created with equal-key runs in physical order, >> during the whole life of the B-Tree, and make key-tid pairs >> efficiently searchable in the process. > > I don't see an entry for this in the next CF. Do you have a plan for it? > > BTW, I did post some remarks on this patch on another thread recently > [1]. Not sure if any of what I said there is news to you at this > point. > > [1] postgr.es/m/CAH2-Wzn=6Lc3OVA88x=E6SKG72ojNUE6ut6RZAqNnQx-YLcw=Q@mail.gmail.com > -- > Peter Geoghegan I plan to restart work on it soonishly, but ATM most of my time is dedicated to the vacuum patch, which is almost done.
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Claudio Freire
Date:
On Wed, Nov 23, 2016 at 12:27 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Nov 21, 2016 at 5:15 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> There are a couple >>> of tricky issues with that that you'd have to look out for, like >>> making sure that the high key continues to hold a real TID, which at a >>> glance looks like something that just happens already. I'm not sure >>> that we preserve the item pointer TID in all cases, since the current >>> assumption is that a high key's TID is just filler -- maybe we can >>> lose that at some point. >> >> I thought long about that, and inner pages don't need a real TID in >> their keys, as those keys only specify key space cutoff points and not >> real tids, and high tids in leaf pages are always real. > > Not if there are duplicates in the inner pages. Then, you have to add > in the TID, and separate the key space, to guide insertions to the > right leaf page directly (particularly for non-HOT UPDATEs). That's > what I'm mostly interested in investigating, here. My point was that the TID doesn't have to point to an actual tuple. It's more of a keyspace thing, so it doesn't need to match real tuples, it can just divide the keyspace with an arbitrary cutoff, and should be cheapter to maintain without that requirement.
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Peter Geoghegan
Date:
On Mon, Jul 24, 2017 at 9:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > My point was that the TID doesn't have to point to an actual tuple. > > It's more of a keyspace thing, so it doesn't need to match real > tuples, it can just divide the keyspace with an arbitrary cutoff, and > should be cheapter to maintain without that requirement. I agree, but unless you're using normalized keys, then I don't see that you get much useful leeway from using fake or truncated TID values. Presumably the comparison logic will be based on comparing an ItemPointerData field, which is impractical to truncate. -- Peter Geoghegan
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Claudio Freire
Date:
On Mon, Jul 24, 2017 at 2:02 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jul 24, 2017 at 9:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> My point was that the TID doesn't have to point to an actual tuple. >> >> It's more of a keyspace thing, so it doesn't need to match real >> tuples, it can just divide the keyspace with an arbitrary cutoff, and >> should be cheapter to maintain without that requirement. > > I agree, but unless you're using normalized keys, then I don't see > that you get much useful leeway from using fake or truncated TID > values. Presumably the comparison logic will be based on comparing an > ItemPointerData field, which is impractical to truncate. In most cases, the TID itself can be omitted when the key itself differs already. When a split happens, a TID will be added referring to a real tid from a child page iff necessary. This gives a lot of leeway in choosing the cutoff point, since a cutoff point is only added when necessary. Vacuum *might* be able to redistribute tuples too, while holding locks to all 3 pages (the two children and the parent page), since it can move the TID to the middle point of two actual child TIDs, mindlessly, without checking to see if such TID actually exists - it just needs to choose a TID cutoff point that will distribute item pointers evently. I haven't tried this, but it is theoretically possible.
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Peter Geoghegan
Date:
On Mon, Jul 24, 2017 at 10:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > In most cases, the TID itself can be omitted when the key itself > differs already. > > When a split happens, a TID will be added referring to a real tid from > a child page iff necessary. > > This gives a lot of leeway in choosing the cutoff point, since a > cutoff point is only added when necessary. I agree with all that. That just sounds like a basic implementation of suffix truncation, to support making heap TID a part of the keyspace without paying a big cost in fan-in. > Vacuum *might* be able to redistribute tuples too, while holding locks > to all 3 pages (the two children and the parent page), since it can > move the TID to the middle point of two actual child TIDs, mindlessly, > without checking to see if such TID actually exists - it just needs to > choose a TID cutoff point that will distribute item pointers evently. > I haven't tried this, but it is theoretically possible. I don't understand what you mean here. As long as the TIDs are a first class part of the keyspace, what VACUUM does or doesn't do should not matter. Page deletion stashes a TID in the highkey position of a leaf page within _bt_mark_page_halfdead(), but that shouldn't matter to you. I guess you're talking about contriving a TID value that is the mean of two real TID values -- the two that are on each side of the split point during a leaf page split. While that seems possible, I don't see much value in it. Unless you have normalized keys, you can only really truncate whole attributes. And, I think it's a bad idea to truncate anything other than the new high key for leaf pages, with or without normalized keys. Changing the keys once they're in internal pages is never going to be worth it. -- Peter Geoghegan
Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location
From
Claudio Freire
Date:
On Mon, Jul 24, 2017 at 2:43 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jul 24, 2017 at 10:13 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Vacuum *might* be able to redistribute tuples too, while holding locks >> to all 3 pages (the two children and the parent page), since it can >> move the TID to the middle point of two actual child TIDs, mindlessly, >> without checking to see if such TID actually exists - it just needs to >> choose a TID cutoff point that will distribute item pointers evently. >> I haven't tried this, but it is theoretically possible. > > I don't understand what you mean here. As long as the TIDs are a first > class part of the keyspace, what VACUUM does or doesn't do should not > matter. Page deletion stashes a TID in the highkey position of a leaf > page within _bt_mark_page_halfdead(), but that shouldn't matter to > you. > > I guess you're talking about contriving a TID value that is the mean > of two real TID values -- the two that are on each side of the split > point during a leaf page split. While that seems possible, I don't see > much value in it. Unless you have normalized keys, you can only really > truncate whole attributes. And, I think it's a bad idea to truncate > anything other than the new high key for leaf pages, with or without > normalized keys. Changing the keys once they're in internal pages is > never going to be worth it. That's what I'm saying. It might not be worth it. I haven't tested yet.