Thread: Lossy Index Tuple Enhancement (LITE)
This proposal introduces the idea of "lossy index tuples", an idea that provides two important optimizations for btrees and potentially other index types stimulated by recent blog complaints. The two optimizations are Clustered Indexes (mostly for INSERTs) and Update Duplicate Removal (for non-HOT UPDATEs), discussed more later. LITE can be used for either or both of those optimizations independently, even though they are implemented in fairly similar ways. So if one of those gets shot down the other is still potentially valid. If this has wings, I'll start tracking this as a wiki page. If it doesn't have wings, at least we tried to respond with technical solutions to external input. =Lossy Index Tuple Enhancement= The LITE concept is that if the 13th index tuple bit is set the index tuple represents a collection of tuples on one heap page rather than a pointer to a single root heap tuple. 13th bit is currently unused, so this optimization is fully backwards compatible with all existing indexes, we just start using them more efficiently over time. Importantly, this means that the user doesn't need to do anything at all to begin benefiting from LITE. It also means that much of the code required is non-invasive to the index code. These benefits are much better than inventing a new index type and leaving the user to figure out how and when to use them. We call the 13th bit the "lossy bit", using the same terminology from bitmap index access, indicating that we no longer hold exact details of the heap tuples the pointer refers to, nor do we do reference counting. When the lossy bit is set we no longer interpret the TID as a (blocknumber, linepointer) we interpret the data as (blocknumber, lite_flags). The lite_flags have one bit reserved to decide whether the LITE tuple represents multiple duplicates or whether it represents a range of values. The remainder of the lite_flags allow us to store details of which ranges of linepointers are covered by the LITE tuple, splitting the block into 15 ranges of linepointers. We just XOR the new linepointer position with the bitmap when we update the bitmask. (The ranges are consecutive not striped, to ensure that heavily repetitive updates are optimized). If anyone is worried about running out of index tuple flags, then perhaps we can reserve a few of the 16 bits for additional flags rather than use them all for lp ranges. By splitting the index blocks into 15 ranges we will avoid the need for scanning the whole block in most cases, so the performance degradation from single pointer to complete-block-pointer should be much smoother. This optimization would still allow UNIQUE indexes. In fact this optimization would almost never overlap with a unique index anyway, since we seldom update the PK of a table. Lossy index tuples would not be able to skip accessing the heap during index-only scans, but the presence of covered index columns would increase the uniqueness and very likely stop the index tuple being marked lossy. The remainder of the description is split between the two use cases/optimizations which are similar but would have different code for each. == Update Duplicate Removal == We want an optimization that reduces the effects of multiple UPDATEs on the same block that have duplicate values caused because another index column has been updated and a non-HOT index insert has been generated. This optimizes the case where someone has 12 indexes yet updates just one of them, so we optimize the 11 unnecessary updates, if possible. As UPDATEs occur we request inserts into the index. If a lossy index pointer already exists that covers the new linepointer then we skip the index insert, optimizing writes to the index. If a lossy index pointer already exists for that value but doesn't yet cover our linepointer then we update the bitmask. If a non-lossy index pointer already exists we set the lossy bit and update the bitmask to reflect which linepointers the index entry covers. If no index pointer exists we insert one. The first update on a row will not be optimized, but the 2nd - 16th could be. These actions optimize away most of the writes and WAL activity to the index data block since only 1 in every 16 updates would cause changes to the block (actually about 1 in 10 on average). Most importantly, updates will cause far fewer index block splits and reindexing will be less needed. The same situation can occur if we have many INSERTs with same values on the same block. This could lead to a reduction in size of the btree for indexes with many duplicate values. == Clustered Index == We want an optimization that allows us to index a consecutive range of values on one block. Currently if we have values 1-100 on one heap block, we would have 100 index pointers all pointing to the one heap block. LITE would allow that to be indexed as a single index tuple, in the common case. As an INSERT starts a new block we begin with a base value of N. Frequently we have the case where the next value in the table is N+1 and is inserted into the same block (or could be arranged to do so). After say 100 INSERTs we now have values 1-100 on the heap block. As the first INSERT occurs we insert a normal index tuple pointing to it. As we INSERT a new value we can mark the index tuple as being an RLITE tuple. Further inserts with a higher value on this data block will be optimized away, as long as there is no later tuple e.g. val1 -> block0 val100 -> block1 if we insert the value 85 then it would go to block0, if we insert the value 105 it would go to block1 without creating an additional tuple. The value -5 would create a new tuple. Index entries that split block ranges would be more complex and costly, though they are only seen infrequently in real world apps. e.g. we might have values 1, 90, 91 on block0 and values 100, 105 on block1. If we insert the value 25 into block2 we would then have an index that looks like this val1 -> block0 val25 -> block2 val90 -> block0 val100 -> block1 First, we would clearly benefit here if we route the heap insert for value 25 to occur on block0. That routing may not be acceptable, or the block itself may be full, so we must handle the block split. If a value is inserted into a range, then we must re-read the heap block to see how to handle it. i.e. the index tuple val1 represents values from val1 to val100 and points to block0. If we insert val25 then we must re-read block0 and edit the index entries around it by creating two new index tuples val1 (representing 1-24) and val90 (representing 90-99). This looks doable, but more complex, so I'm posting this now to allow for discussion and for people to spot the holes. == IndexScan == Note that the executor code for IndexScan appears identical between the two optimizations. The difference between duplicate and range LITE tuples is needed only at INSERT time (or UPDATE indexed column to a new value). When we do an IndexScan if we see a LITE tuple we do a scan of the linepointer ranges covered by this index tuple (which might eventually go to a full block scan). For example with bit1 set we would scan 16 linepointers (on an 8192 byte block) and with 2 bits set we would scan 32 linepointers etc.., not necessarily consecutive ranges. So the IndexScan can return potentially many heap rows per index entry, which need to be re-checked and may also need to be sorted prior to returning them. If no rows are returned we can kill the index pointer, just as we do now for btrees, so they will be removed eventually from the index without the need for VACUUM. An BitmapIndexScan that sees a lossy pointer can also use the lossy TID concept, so we can have partially lossy bitmaps. == VACUUM == VACUUM would not cause a problem since line pointer removal for non-lossy index tuples would occur normally. Lossy index tuples would be ignored by index vacuum, but they would not prevent removal of line pointers from the heap. == Code Footprint == All of this code is concentrated in the Scan nodes, and in btinsert(). There are no changes to WAL records or replay. The details are all in the way we insert and interpret index tuples. == Other designs == This is related in a few ways to Grouped Item Tuple indexes, a patch Heikki worked on in 2007, but is not that close. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > == IndexScan == > > Note that the executor code for IndexScan appears identical between > the two optimizations. The difference between duplicate and range LITE > tuples is needed only at INSERT time (or UPDATE indexed column to a > new value). > > When we do an IndexScan if we see a LITE tuple we do a scan of the > linepointer ranges covered by this index tuple (which might eventually > go to a full block scan). For example with bit1 set we would scan 16 > linepointers (on an 8192 byte block) and with 2 bits set we would scan > 32 linepointers etc.., not necessarily consecutive ranges. So the > IndexScan can return potentially many heap rows per index entry, which > need to be re-checked and may also need to be sorted prior to > returning them. If no rows are returned we can kill the index pointer, > just as we do now for btrees, so they will be removed eventually from > the index without the need for VACUUM. An BitmapIndexScan that sees a > lossy pointer can also use the lossy TID concept, so we can have > partially lossy bitmaps. Wouldn't this risk scanning rows more than once unless it's part of a bitmap scan?
On Wed, Aug 3, 2016 at 08:20:49AM +0100, Simon Riggs wrote: > == Update Duplicate Removal == > > We want an optimization that reduces the effects of multiple UPDATEs > on the same block that have duplicate values caused because another > index column has been updated and a non-HOT index insert has been > generated. This optimizes the case where someone has 12 indexes yet > updates just one of them, so we optimize the 11 unnecessary updates, > if possible. > > As UPDATEs occur we request inserts into the index. If a lossy index > pointer already exists that covers the new linepointer then we skip > the index insert, optimizing writes to the index. If a lossy index > pointer already exists for that value but doesn't yet cover our > linepointer then we update the bitmask. If a non-lossy index pointer > already exists we set the lossy bit and update the bitmask to reflect > which linepointers the index entry covers. If no index pointer exists > we insert one. The first update on a row will not be optimized, but > the 2nd - 16th could be. These actions optimize away most of the > writes and WAL activity to the index data block since only 1 in every > 16 updates would cause changes to the block (actually about 1 in 10 on > average). Most importantly, updates will cause far fewer index block > splits and reindexing will be less needed. > > The same situation can occur if we have many INSERTs with same values > on the same block. This could lead to a reduction in size of the btree > for indexes with many duplicate values. Let me see if I understand this. Right now, if no indexed values are changed and the old and new update rows are in the same page, we don't create a new index entry for the new row, but rather use redirect tid pointers. We can recycle tuple data and tid pointers for frequent updates because there is only one index entry for the entire update chain. This doesn't work if we changed an indexed value. Your solution is not to use redirect tid/line pointers at all. Instead, you create normal index pointers for the indexes with changed column values. For the indexes without changed column values, you create this new LITE index entry which can point to all the update-chain tuples in the same page. In Postgres currently, we can remove tuple data when it is expired, and tids if they are part of redirect chains (no index changes), and need vacuum to remove non-redirect tids and old index entries. With LITE, you can avoid the creation of duplicate-value index entries for indexes without changed column values by using a bitmap in place of the tid item number (16 bits). It can't remove dead tids. I had some ideas of how to handle the bitmap, like doing "mod 16" on the tid item value, meaning we would have 16 equal-sized buckets. Assuming tuples are 100 bytes, that makes each bucket hold five rows on average. However, I am now wondering why bother with the bitmap at all. Can't the LITE item pointer just point to the head of the update chain, and we can walk the chain to find the tuple that matches our snapshot? (The tuple has a pointer to the next entry in the chain.) Basically, right now with an update that changes values in some indexes and not others, and the tuples are all on the same page, we are creating multiple index pointers to point to different tuples on the same page. I am suggesting we don't do that, and instead just leave the index tuple pointing to the head of the chain. I think the big problem with this is that removing the tuple data will mark also remove the rows update chain pointer. Maybe we can modify tid redirect pointers to handle this. > == Clustered Index == I read about clustered indexes and I am not sure of how it would help in most cases. Are there enough index entries with identical or sequentially-grouped values to be useful? It kind of feels like a single-page BRIN index. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Aug 3, 2016 at 07:28:52PM -0400, Bruce Momjian wrote: > With LITE, you can avoid the creation of duplicate-value index entries > for indexes without changed column values by using a bitmap in place of > the tid item number (16 bits). It can't remove dead tids. How would you handle the case where there are two LITE index entries pointing to two different update chains on the same page? When you search the page for the first heap chain, could the second index entry find the same chain. How would you know which index entry is which chain? Would you only add a LITE index entry when there isn't an existing index entry for the same values and heap page? That seems quite complicated. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 4 August 2016 at 00:56, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Aug 3, 2016 at 07:28:52PM -0400, Bruce Momjian wrote: >> With LITE, you can avoid the creation of duplicate-value index entries >> for indexes without changed column values by using a bitmap in place of >> the tid item number (16 bits). It can't remove dead tids. > > How would you handle the case where there are two LITE index entries > pointing to two different update chains on the same page? > When you > search the page for the first heap chain, could the second index entry > find the same chain. How would you know which index entry is which > chain? It's easiest to understand this is by imagining each LITE pointer pointing to a whole page. The chains aren't followed during the scan, individual heap tuple versions are treated as they would be by a seq scan. That is more expensive than we might like, so the bitmap/linepointer thing is just an extra tweak to avoid scanning the whole block. The bitmap refers to ranges of linepointers, not chains. i.e. 0-15, 16-31, 32-47 etc > Would you only add a LITE index entry when there isn't an > existing index entry for the same values and heap page? That seems > quite complicated. The insertion algorithm is described. Doesn't seem complicated to me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 4, 2016 at 01:16:20AM +0100, Simon Riggs wrote: > On 4 August 2016 at 00:56, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, Aug 3, 2016 at 07:28:52PM -0400, Bruce Momjian wrote: > >> With LITE, you can avoid the creation of duplicate-value index entries > >> for indexes without changed column values by using a bitmap in place of > >> the tid item number (16 bits). It can't remove dead tids. > > > > How would you handle the case where there are two LITE index entries > > pointing to two different update chains on the same page? > > When you > > search the page for the first heap chain, could the second index entry > > find the same chain. How would you know which index entry is which > > chain? > > It's easiest to understand this is by imagining each LITE pointer > pointing to a whole page. The chains aren't followed during the scan, > individual heap tuple versions are treated as they would be by a seq > scan. > > That is more expensive than we might like, so the bitmap/linepointer > thing is just an extra tweak to avoid scanning the whole block. The > bitmap refers to ranges of linepointers, not chains. i.e. 0-15, 16-31, > 32-47 etc Well, there is no way to know how many linepointers there are on a page, so doing "mod 16" automatically hashes the line pointers into the 16-bit field. > > Would you only add a LITE index entry when there isn't an > > existing index entry for the same values and heap page? That seems > > quite complicated. > > The insertion algorithm is described. Doesn't seem complicated to me. OK. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Aug 4, 2016 at 01:16:20AM +0100, Simon Riggs wrote: > > Would you only add a LITE index entry when there isn't an > > existing index entry for the same values and heap page? That seems > > quite complicated. > > The insertion algorithm is described. Doesn't seem complicated to me. Ah, I see it now: As UPDATEs occur we request inserts into the index. If a lossy indexpointer already exists that covers the new linepointerthen we skipthe index insert, optimizing writes to the index. I thought you meant "already exists" just means it matches the existing range. I was unclear that was the entire page. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Aug 3, 2016 at 08:34:02PM -0400, Bruce Momjian wrote: > On Thu, Aug 4, 2016 at 01:16:20AM +0100, Simon Riggs wrote: > > > Would you only add a LITE index entry when there isn't an > > > existing index entry for the same values and heap page? That seems > > > quite complicated. > > > > The insertion algorithm is described. Doesn't seem complicated to me. > > Ah, I see it now: > > As UPDATEs occur we request inserts into the index. If a lossy index > pointer already exists that covers the new linepointer then we skip > the index insert, optimizing writes to the index. > > I thought you meant "already exists" just means it matches the existing > range. I was unclear that was the entire page. How do plan to clear the bitmask so it, over time, doesn't end up being all-set? I can see how a non-LITE index tuple would become a LITE tuple when an update chain happens on the same page, but then new matching index values would also set the bitmap, update-chain or not. Then, when the update chain is gone and only non-update-chain tuples exist, you will need to remove the bitmap and create new index entries by scanning the heap page. Also, what happens when you have two non-LITE index tuples in a heap page, and then you create an update chain in the heap page --- do you remove one of the two index entries and create a bitmap out of the remaining one? Do you put them back when the heap chain is gone? Also, why not use this bitmap for all indexes, not just update chains? Because it would break index-only scans? I think there are two use-cases for LITE --- first, it would shrink indexes for tables with many duplicate indexed values on the same page. In the case Simon mentioned, it would help with update chain churn, but the problem is it seems to cause a lot of overhead to create and remove the bitmap in the index. This is why I was thinking of per-update-chain LITE entries. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Aug 3, 2016 at 4:37 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> == IndexScan == >> >> Note that the executor code for IndexScan appears identical between >> the two optimizations. The difference between duplicate and range LITE >> tuples is needed only at INSERT time (or UPDATE indexed column to a >> new value). >> >> When we do an IndexScan if we see a LITE tuple we do a scan of the >> linepointer ranges covered by this index tuple (which might eventually >> go to a full block scan). For example with bit1 set we would scan 16 >> linepointers (on an 8192 byte block) and with 2 bits set we would scan >> 32 linepointers etc.., not necessarily consecutive ranges. So the >> IndexScan can return potentially many heap rows per index entry, which >> need to be re-checked and may also need to be sorted prior to >> returning them. If no rows are returned we can kill the index pointer, >> just as we do now for btrees, so they will be removed eventually from >> the index without the need for VACUUM. An BitmapIndexScan that sees a >> lossy pointer can also use the lossy TID concept, so we can have >> partially lossy bitmaps. > > Wouldn't this risk scanning rows more than once unless it's part of a > bitmap scan? I think an alternative that would not have such a problem and would have a smaller impact both in performance and code, would be to just not add new index tuples when the update happens on the same page, easily checked by comparing the old tuple's t_ctid against the new's. Index scans would have to follow same-page update chains, and perhaps vacuum would need some fixing too, but that would probably be less of a performance hit than the proposed lossy item pointers.
On 3 August 2016 at 20:37, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> == IndexScan == >> >> Note that the executor code for IndexScan appears identical between >> the two optimizations. The difference between duplicate and range LITE >> tuples is needed only at INSERT time (or UPDATE indexed column to a >> new value). >> >> When we do an IndexScan if we see a LITE tuple we do a scan of the >> linepointer ranges covered by this index tuple (which might eventually >> go to a full block scan). For example with bit1 set we would scan 16 >> linepointers (on an 8192 byte block) and with 2 bits set we would scan >> 32 linepointers etc.., not necessarily consecutive ranges. So the >> IndexScan can return potentially many heap rows per index entry, which >> need to be re-checked and may also need to be sorted prior to >> returning them. If no rows are returned we can kill the index pointer, >> just as we do now for btrees, so they will be removed eventually from >> the index without the need for VACUUM. An BitmapIndexScan that sees a >> lossy pointer can also use the lossy TID concept, so we can have >> partially lossy bitmaps. > > Wouldn't this risk scanning rows more than once unless it's part of a > bitmap scan? It's always the job of the index insert logic to ensure a valid index exists. I'm not sure what you see that changes that here. Say more? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4 August 2016 at 02:13, Bruce Momjian <bruce@momjian.us> wrote: > How do plan to clear the bitmask so it, over time, doesn't end up being > all-set? I don't have a plan, though thoughts welcome. Similar situation that our current indexes don't recover from bloat, a situation made worse by non-HOT updates. So, we have a choice of which disadvantage to accept. > Also, why not use this bitmap for all indexes, not just update chains? I don't understand where you get this update chains thing from. The bitmap can apply to multiple tuples on one page, which is described. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 4, 2016 at 06:03:34PM +0100, Simon Riggs wrote: > On 4 August 2016 at 02:13, Bruce Momjian <bruce@momjian.us> wrote: > > > How do plan to clear the bitmask so it, over time, doesn't end up being > > all-set? > > I don't have a plan, though thoughts welcome. > > Similar situation that our current indexes don't recover from bloat, a > situation made worse by non-HOT updates. > > So, we have a choice of which disadvantage to accept. > > > > Also, why not use this bitmap for all indexes, not just update chains? > > I don't understand where you get this update chains thing from. > > The bitmap can apply to multiple tuples on one page, which is described. I am asking if we could use this idea for all rows on a page, not just HOT updated, e.g. we have five rows that have col1=4 in an index --- why not just use on index pointer for all of them, with a bitmap to point to the possible matches? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Aug 4, 2016 at 1:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 3 August 2016 at 20:37, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> == IndexScan == >>> >>> Note that the executor code for IndexScan appears identical between >>> the two optimizations. The difference between duplicate and range LITE >>> tuples is needed only at INSERT time (or UPDATE indexed column to a >>> new value). >>> >>> When we do an IndexScan if we see a LITE tuple we do a scan of the >>> linepointer ranges covered by this index tuple (which might eventually >>> go to a full block scan). For example with bit1 set we would scan 16 >>> linepointers (on an 8192 byte block) and with 2 bits set we would scan >>> 32 linepointers etc.., not necessarily consecutive ranges. So the >>> IndexScan can return potentially many heap rows per index entry, which >>> need to be re-checked and may also need to be sorted prior to >>> returning them. If no rows are returned we can kill the index pointer, >>> just as we do now for btrees, so they will be removed eventually from >>> the index without the need for VACUUM. An BitmapIndexScan that sees a >>> lossy pointer can also use the lossy TID concept, so we can have >>> partially lossy bitmaps. >> >> Wouldn't this risk scanning rows more than once unless it's part of a >> bitmap scan? > > It's always the job of the index insert logic to ensure a valid index exists. > > I'm not sure what you see that changes that here. Say more? I just explained it further in the WARM thread, which is actually the idea I was arriving to.
On 4 August 2016 at 18:27, Bruce Momjian <bruce@momjian.us> wrote: >> > Also, why not use this bitmap for all indexes, not just update chains? >> >> I don't understand where you get this update chains thing from. >> >> The bitmap can apply to multiple tuples on one page, which is described. > > I am asking if we could use this idea for all rows on a page, not just > HOT updated, e.g. we have five rows that have col1=4 in an index --- why > not just use on index pointer for all of them, with a bitmap to point to > the possible matches? Yes. I described that... "The same situation can occur if we have many INSERTs with same values on the same block. This could lead to a reduction in size of the btree for indexes with many duplicate values." -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services