Nested transactions: low level stuff - Mailing list pgsql-hackers
From | Manfred Koizar |
---|---|
Subject | Nested transactions: low level stuff |
Date | |
Msg-id | 3u5g7v4qe59jj31vc78fbm120ak0flnm2l@4ax.com Whole thread Raw |
Responses |
Re: Nested transactions: low level stuff
|
List | pgsql-hackers |
Here is, what I've put together from various messages posted November/December last year. . Subtrans trees. Transaction states. Tuple visibility. HeapTupleSatifiesUpdate. Shortcuts. Still missing. Objections andsuggestions Subtrans trees -------------- A subtrans tree holds information about sub-transaction to parent transaction relationships. The nodes are transaction ids, edges denote the fact that the child node represents a sub-transaction of the transaction represented by the parent node. The root of a subtrans tree represents a main-transaction. A subtrans tree is navigable from child to parent. I.e. if you have a sub-transaction xid you can find its parent efficiently, but there is no fast and easy way to enumerate all children of a given parent xid. The subtrans dependencies have to be visible to all backends. At least two possible implementations have been discussed. Together with clog: Store a 30 bit parent transaction offset with the two state bits. Offset 0 means the transaction is a main-transaction. In a separate data structure: pg_subtrans is a huge (possibly sparsely populated) array of parent xids, indexed by child xid. This array is divided into manageable chunks, represented by files, "pg_subtrans_NNNN". These files are only created when necessary. At any time only a tiny part of the whole array is kept in shared buffers. This concept is similar or almost equal to pg_clog, which is an array of doublebits. Most of its implementation can be stolen from the clog code. We believe that parent xid lookups are far less frequent than transaction state lookups and therefor favour the second implementation. Transaction states ------------------ Let the currently unused fourth state in pg_clog indicate a committed subtransaction. In pg_clog there are two bits per transaction, commit and abort, with the following meaning: a c0 0 transaction in progress, the owning backend knows whether it is a main- or a sub-transaction, other backendsdon't care1 0 aborted, nobody cares whether main- or sub-transaction0 1 committed main-transaction or - withshortcut 2 - a sub- transaction that's known committed to all active transactions1 1 committed sub-transaction,have to look for parent in pg_subtrans Updating the state of a single (main or sub) transaction is atomic, just like it is now. This solves the atomicity problem without requiring long locks. Here is what is to be done for some operations: BEGIN main transaction: . Get a new xid (no change to current behaviour). . pg_clog[xid] is still 00, meaning active. . pg_subtrans[xid]is still 0, meaning no parent. BEGIN subtransaction: . Push current transaction info onto local stack. . Get a new xid. . Record parent xid in pg_subtrans[xid].. pg_clog[xid] is still 00. ROLLBACK subtransaction: . Set pg_clog[xid] to 10 (aborted). . Optionally set clog bits for subsubtransactions to 10 (*).. Pop transaction info from stack. COMMIT subtransaction: . Set pg_clog[xid] to 11 (committed subtrans). . Don't touch clog bits for subsubtransactions! . Poptransaction info from stack. ROLLBACK main transaction: . Set pg_clog[xid] to 10 (aborted). . Optionally set clog bits for subtransactions to 10 (*). COMMIT main transaction: . Set pg_clog[xid] to 01 (committed). . Don't touch clog bits for subtransactions! (*) This can only be done if there is a transaction local data structure holding information about all subtransactions or we have a subtrans tree implementation that is navigable from parent to child. Anyway, whether this is done or not does not influence the consistency of this proposal. Tuple visibility ---------------- Visibility check by other transactions: If a tuple is visited and its XMIN/XMAX_IS_COMMITTED/ABORTED flags are not yet set, pg_clog has to be consulted to find out the status of the inserting/deleting transaction xid. If pg_clog[xid] is ... 00: transaction still active 10: aborted 01: committed 11: committed subtransaction, have to check parent Only in this last case do we have to get parentxid from pg_subtrans. Now we look at pg_clog[parentxid]. If we find ... 00: parent still active, so xid is considered active, too 10: parent aborted, so xid is considered aborted 01: parent committed, so xid is considered committed 11: recursively check grandparent(s) ... If a visibility check sees a transaction as committed, it checks whether the main-transaction xid is visible in the current snapshot. Fortunately we get the main-transaction xid as a by-product of looking up the parentxid states. If we set XMIN/MAX_IS_COMMITTED in a tuple header, we have to replace a sub-transaction xid in xmin or xmax respectively with the main-transaction xid at the same time. Otherwise we'd have to look for the main xid, whenever a tuple is touched. XXX "at the same time" may need a bit more thought regarding locking. HeapTupleSatifiesUpdate ----------------------- If an UPDATE is about to change a tuple touched by another active transaction, it waits for the other transaction to commit or abort. We must always wait for the main transaction, not the subtrans. Shortcuts --------- Under certain conditions the 1/1 state can be replaced with 0/1 or 1/0. This could save a lot of parent lookups. Shortcut 1: As soon as a transaction is rolled back all its subtransactions are considered aborted, too. So their pg_clog bits can be set to 10 and there is no need to look up their parent any more. This can be done . immediately on ROLLBACK (if the implementation has a list of committed subtransactions handy),. as a side effect of a later visibility check, . by VACUUM. Shortcut 2: If a committed subtransaction has only committed (grand)parents up to the committed main transaction and the main transaction is visible to all running transactions, i.e. the main transaction's xid is before RecentGlobalXmin (*), then the subtransaction's pg_clog bits can be set to 01 and further parent xid lookups are not needed any more. This can be done . as a side effect of a later visibility check, . by VACUUM. (*) AFAICS the statements "main transaction's xid is before RecentGlobalXmin" and "subtransaction's xid is before RecentGlobalXmin" are equivalent. The point here is: pg_subtrans is only queried, if we find 11 in pg_clog; we try to replace 11 a soon as possible, but only if it is safe to do so. Status 11 is short-lived, it is replaced with the final status by one or more of - ROLLBACK of the parent transaction- a later visibility check (as a side effect)- VACUUM pg_subtrans cleanup: A pg_subtrans_NNNN file covers a known range of transaction ids. As soon as none of these transactions has a pg_clog status of 11, the pg_subtrans_NNNN file can be removed. VACUUM can do this, and it won't even have to check the heap. Still missing ------------- . Visibility checks for tuples inserted/deleted by a (sub)transaction belonging to the current transaction tree (have to check local transaction stack whenever we look at a xid or switch to a parent xid) . WAL Objections and suggestions -------------------------- Tom: >Also, using a 11 state doubles the amount of pg_clog I/O needed to >commit a collection of subtransactions. You have to write 11 as the >state of each commitable subtransaction, then commit the parent (write >01 as its state), then go back and change the state of each >subtransaction to 01. (Whether this last bit is done as part of parent >transaction commit, or during later inspections of the state of the >subtransaction, doesn't change the argument.) Yes, there may be more pg_clog I/O. But I don't think that it is doubled. I'd expect some locality; after all one page holds the status bits for 32K transactions. Tom: >I think it would be preferable to use only three states: active, >aborted, committed. The parent commit protocol is (1) write 10 as state >of each aborted subtransaction (this should be done as soon as the >subtransaction is known aborted, rather than delaying to parent commit); >(2) write 01 as state of parent (this is the atomic commit); (3) write >01 as state of each committed subtransaction. Readers who see 00 must >check the parent state; if the parent is committed then they have to go >back and recheck the child state (to see if it became "aborted" after >they looked). This halves the write traffic during a commit, at the >cost of additional read traffic when subtransaction state is checked in >a narrow window after the time of parent transaction commit. I believe >it nets out to be faster. Pro: Saves the fourth state for something different. Con: Does a pg_subtrans lookup even for main transactions (at least until xid < RecentGlobalXmin) Con: Needs subtrans tree navigation from parent to child. It's noticeably past midnight as I write this ... will reconsider that approach after I had same sleep. Tom: >But wait a moment. The child xid is by definition always greater than >(newer than) its parent. So if we consult pg_clog and find the >transaction marked committed, *and* the xid is before the window of XIDs >in our snapshot, then even if it's not a top-level xid, the parent must >be before our window too. Therefore we can conclude the transaction is >visible in our snapshot. So indeed there is a good-size range of xids >for which we'll never need to chase the parent link: everything before >the RecentGlobalXmin computed by GetSnapshotData. I think this has been integrated into this proposal (cf. shortcut 2). > (We do have to set >subtransactions to committed during parent commit to make this true; >we can't update them lazily. But I think that's okay.) This is not clear to me ... I'll try again later ... ------------------------------ That's all for now, folks. Fell free to comment. Sorry for the long post. Would you prefer such kind of stuff on a web page and just a short note with the URL to the list? ServusManfred
pgsql-hackers by date: