Frequent Update Project: Design Overview of HOT Updates - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Frequent Update Project: Design Overview of HOT Updates |
Date | |
Msg-id | 1163092396.3634.461.camel@silverbirch.site Whole thread Raw |
Responses |
Re: Frequent Update Project: Design Overview of HOT Updates
Re: Frequent Update Project: Design Overview of HOT Updates Re: Frequent Update Project: Design Overview of HOT Updates |
List | pgsql-hackers |
Design Overview of HOT Updates ------------------------------ The objective is to increase the speed of the UPDATE case, while minimizing the overall negative effects of the UPDATE. We refer to the general requirement as *Frequent Update Optimization*, though this design proposal is for Heap Overflow Tuple (HOT) Updates. It is similar in some ways to the design for SITC already proposed, though has a number of additional features drawn from other designs to make it a practical and effective implementation. EnterpriseDB have a working, performant prototype of this design. There are still a number of issues to resolve and the intention is to follow an open community process to find the best way forward. All required detail will be provided for the work conducted so far. Current PGSQL behaviour is for UPDATEs to create a new tuple version within the heap, so acts from many perspectives as if it were an INSERT. All of the tuple versions are chained together, so that whichever of the tuples is visible to your Snapshot, you can walk the chain to find the most recent tuple version to update. The HOT proposal splits the heap into two areas: the main heap and an overflow relation, both of which are regular, transactional relations. INSERT and DELETE effect only the main heap exactly as they do now. UPDATE places new tuple versions into the overflow relation in most cases, maintaining all of the current capabilities of the MVCC model: all tuple versions remain accessible, plus the chain of updated tuple versions is maintained. UPDATEs do not insert into indexes at all - no index pointers exist at all into the overflow relation. So during a stream of UPDATEs the main heap and indexes stay the same size, while the indexes are used read-only, avoiding additional database writes and the need for eventual index rebuilding. The current tuple version within the main heap is referred to as the "root tuple" from now on. When reading the main heap, if the root tuple is not visible we "walk the chain" into the overflow relation until we find a tuple version that is visible - we follow the t_ctid pointer from tuple to tuple, testing for MVCC visibility as we go. As more UPDATEs take place these tuple chains would grow, making locating the latest tuple take progressively longer. Intuitively this sounds like a bad design, though an additional feature turns that from bad to good performance: - when anybody walks the chain into the overflow relation, if we find that the root tuple is vacuumable we copy back the first visible tuple version over the now-dead root tuple. This allows the length of a typical tuple chain to be extremely short in practice. For a single connection issuing a stream of UPDATEs the chain length will no more than 1 at any time. Even under heavy concurrent UPDATEs the modal chain length remains 1, with the chain length varying in approximately Poisson distribution. The overflow relation is similar in concept to a TOAST table, so we might describe this approach as TOAST-for-updated-versions. The code implementation is similar also, with reasonably similar modularity. HOT does sound radical, but no more so than TOAST was when first discussed. We can locate any tuple version and thereby allow MVCC to work correctly, while at the same time preserving the crucial property of the Postgres non-overwriting storage manager. This works very well for indexed tuple access since the index and main heap never grow, thus maintaining access speeds. HOT supports both Read Committed and Serializable transaction isolation levels, with transactions of any length, just as with current MVCC. Performance results against the previously described test cases are approximately: 1. pgbench (TPC-B) - Around 200-300% better 2. DBT-2 (TPC-C) - Signficicant improvement - possibly 10% improvement, somewhat difficult to measure exactly because of the way the test itself works. 3. truckin - Around 500% improvement The performance gain remains better than REL8_2 base even in the presence of longer running transactions. [There is also considerable synergy with changes to b-tree indexes that are both useful in their own right, and even better when HOT is added, more on that on a separate thread.] We expect that people will want to measure this for themselves, so we don't publish all of the detailed internal tests here since YMMV. Using HOT --------- HOT can only work in cases where a tuple does not modify one of the columns defined in an index on the table, and when we do not alter the row length of the tuple. [We'll be able to do that more efficiently when we have plan invalidation] If we perform an update that meets the HOT criteria then we put the new version into the overflow relation; we describe this as a HOT UPDATE. If we perform an update that does not meet the criteria, then we carry on with the existing/old MVCC behaviour; we describe this as a non-HOT UPDATE. So with HOT we must support both HOT and non-HOT UPDATEs. Each tuple whether it is in the main heap or the overflow relation can point to another tuple with the t_ctid, which is extended by using a previously unused bit to flag whether the destination is a block from the main heap or the overflow relation. Just to reiterate, since a HOT update doesn't touch index values that means that all members of a tuple chain have identical primary keys and index values. So we only ever need to grovel through the overflow relation *if* we have already met the index search criteria - so we don't have to re-check the index criteria for each tuple version we touch. All normal updates of a table will be read-only on the index, plus writes to heap/overflow. Simple Example of Tuple Chaining -------------------------------- Lets look at a simple case: 1. INSERT INTO table (1,....) Tuple is inserted into main heap (Version1 or V1) Heap Page P1 Overflow Relation ---------------- ----------------- | | | | | ______ | | | | | V1 | | | | | |______| | | | | | | | |--------------| |---------------| 2. UPDATE table SET nonindexcol = 6 WHERE pkcol = 1; Same length tuple, so V2 written using overflow Heap Page P1 Overflow Relation ---------------- ----------------- | | | ______ | | ______ | |------>| V2 | | | | V1 |------| | |______| | | |______| | | | | | | | |--------------| |---------------| 3. UPDATE table SET nonindexcol = 7 WHERE pkcol = 1; Same length tuple, so V3 written using overflow We update v2.t_ctid to maintain the forward tuple chain Heap Page P1 Overflow Relation ---------------- ----------------- | | | ______ | | ______ | |------>| V2 | | | | V1 |------| | |______|--| | | |______| | | ______ | | | | | | V3 |<-| | | | | |______| | |--------------| |---------------| Simple Example of Copying-back ------------------------------ 4. UPDATE table SET nonindexcol = 8 WHERE pkcol = 1; Same length tuple, so V4 will be written using overflow. While selecting for UPDATE, we look at V1, we see it is now vacuumable, so we first copy back the first visible tuple over V1 before proceeding. [Assume V1 is dead, V2 is still visible] Heap Page P1 Overflow Relation ---------------- ----------------- | | | ______ | | ______ | | | V2* | | | | V2 |------| | |______| | | |______| | | | ______ | | | |------>| V3 | | | | | |______| | |--------------| |---------------| The copy back operation leaves V2* as the now-unwritable old copy of V2. We update it also to break the chain to V3. Now orphaned, it can be removed later. 5. Now we continue with the UPDATE as planned Same length tuple, so V4 will be written using overflow. We update v3.t_ctid to maintain the forward tuple chain Heap Page P1 Overflow Relation ---------------- ----------------- | | | ______ | | ______ | | | V2* | | | | V2 |------| | |______| | | |______| | | | ______ | | | |------>| V3 | | | | | |______|--| | | | | ______ | | | | | | V4 |<-| | | | | |______| | |--------------| |---------------| Copying Back ------------ The copy back operation is required to keep the tuple chains short and efficient. When we copy back we leave a tuple version and possibly a long part of the tuple chain orphaned. These are then marked for removal by the next VACUUM. To support copying back, we add an additional header fields onto every tuple in the overflow relation (only). We generate WAL appropriately for this action. VACUUMing becomes significantly easier with HOT than with non-HOT UPDATEs. Specifically, since there are no indexes on the overflow relation we should be able to VACUUM it more easily and possibly piece-by-piece (i.e. "retail vacuum"). Behavioural Characteristics --------------------------- With HOT, it is easily possible that the chain of prior versions spans many blocks. The chain always starts with the block of the root tuple but possibly includes many overflow relation blocks. A SeqScan of a HOT table will turn into a large number of random accesses to the overflow relation, which could be considerably more expensive than sequential I/O. Clearly very large tables would not be able to be HOT enabled without additional cost, so we make HOT a table-level option: WITH (freqUPDATE = on) [discuss...] This means that a SeqScan of a heap with heavy concurrent update activity *could* be fairly costly. Most times it won't be, because the appropriate overflow relation blocks will most likely be in-memory. That is the price we pay for having this optimization - a suitable warning would apply, though this would not be a surprise to anybody since Oracle people know that reading heavily-updated data takes more CPU and DB2/SQLServer know that it takes ages to get around each of the row level locks. So this isn't bad *at all* in comparison. We hope/expect/recommend HOT to be used in conjunction with UPDATE RETURNING, so that fewer SELECTs and non-index operations will be used. [There's an extra prize if you can think of a *correct* way of SeqScanning with HOT more efficiently; we've not focused on that yet.] In the presence of a long running transaction, the overflow relation could become very large since VACUUM becomes less effective. (This same situation occurs with current MVCC). The overflow relation will then start to page out to disk and could grow large. Oracle has this problem also and this is why Oracle 10g has moved away from fixed-size Rollback Segments to the use of an overflow Tablespace. Some size controls are available to limit the growth there. That would be a useful optimization for HOT also and one much easier than any such enhancement of existing MVCC. Next Steps ---------- Deeper technical details will be published shortly; the design goes much deeper than the discussion here, but we wanted to present the overall summary first. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: