Proposal for CSN based snapshots - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Proposal for CSN based snapshots |
Date | |
Msg-id | CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com Whole thread Raw |
Responses |
Re: Proposal for CSN based snapshots
Re: Proposal for CSN based snapshots Re: Proposal for CSN based snapshots |
List | pgsql-hackers |
Given the recent ideas being thrown about changing how freezing and clog is handled and MVCC catalog access I thought I would write out the ideas that I have had about speeding up snapshots in case there is an interesting tie in with the current discussions. To refresh your memory the basic idea is to change visibility determination to be based on a commit sequence number (CSN for short) - a 8 byte number incremented on every commit representing the total ordering of commits. To take a snapshot in this scheme you only need to know the value of last assigned CSN, all transactions with XID less than or equal to that number were commited at the time of the snapshots, everything above wasn't committed. Besides speeding up snapshot taking, this scheme can also be a building block for consistent snapshots in a multi-master environment with minimal communication. Google's Spanner database uses snapshots based on a similar scheme. The main tricky part about this scheme is finding the CSN that was assigned to each XIDs in face of arbitrarily long transactions and snapshots using only a bounded amount of shared memory. The secondary tricky part is doing this in a way that doesn't need locks for visibility determination as that would kill any hope of a performance gain. We need to keep around CSN slots for all currently running transactions and CSN slots of transactions that are concurrent with any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do this I propose the following datastructures to do the XID-to-CSN mapping. For most recently assigned XIDs there is a ringbuffer of slots that contain the CSN values of the XIDs or special CSN values for transactions that haven't completed yet, aborted transactions or subtransactions. I call this the dense CSN map. Looking up a CSN of a XID from the ringbuffer is just a trivial direct indexing into the ring buffer. For long running transactions the ringbuffer may do a full circle before a transaction commits. Such CSN slots along with slots that are needed for older snapshots are evicted from the dense buffer into a sorted array of XID-CSN pairs, or the sparse mapping. For locking purposes there are two sparse buffers, one of them being active the other inactive, more on that later. Looking up the CSN value of a XID that has been evicted into the sparse mapping is a matter of performing a binary search to find the slot and reading the CSN value. Because snapshots can remain active for an unbounded amount of time and there can be unbounded amount of active snapshots, even the sparse mapping can fill up. To handle that case, each backend advertises its lowest snapshot number csn_min. When values need to be evicted from the sparse mapping, they are evicted in CSN order and written into the CSN log - a series of CSN-XID pairs. Backends that may still be concerned about those values are then notified that values that they might need to use have been evicted. Backends with invalidated snapshots can then convert their snapshots to regular list of concurrent XIDs snapshots at their leisure. To convert a CSN based snapshot to XID based, a backend would first scan the shared memory structures for xids up to snapshot.xmax for CSNs that are concurrent to the snapshot and insert the XIDs into the snapshot, then read in the CSN log starting from snapshots CSN, looking for xid's less than the snapshots xmax. After this the snapshot can be handled like current snapshots are handled. A more detailed view of synchronization primitives required for common operations follows. Taking a new snapshot --------------------- Taking a CSN based snapshot under this scheme would consist of reading xmin, csn and xmax from global variables, unlocked and in that order with read barriers in between each load. If this is our oldest snapshot we write our csn_min into pgproc, do a full memory barrier and check from a global variable if the CSN we used is still guaranteed to not be evicted (exceedingly unlikely but cannot be ruled out). The read barrier between reading xmin and csn is needed so the guarantee applies that no transaction with tx.xid < ss.xmin could have committed with tx.csn >= ss.csn, so xmin can be used to safely exclude csn lookups. Read barrier between reading csn and xmax is needed to guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >= ss.csn. From the write side, there needs to be at least one full memory barrier between GetNewTransactionId updating nextXid and CommitTransaction updating nextCsn, which is quite easily satisfied. Updating global xmin without races is slightly trickier but doable. Checking visibility ------------------- XidInMVCCSnapshot will need to switch on the type of snapshot used after checking xmin and xmax. For list-of-XIDs snapshots the current logic applies. CSN based snapshots need to first do an unlocked read from a backend specific global variable to check if we have been instructed to convert our snapshots to xid based, if so the snapshot is converted (more on that separately). Otherwise we look up the CSN of the xid. To map xid to csn, first the value from the csn slot corresponding to the xid is read in from dense map (just a plain denseCSNMap[xid % denseMapSize]), then after issuing a read barrier, the dense map eviction horizon is checked to verify that the value that we read in was in fact valid, if it is, it can be compared to the snapshot csn value to get the result, if not continue to check the sparse map. To check the sparse map, we read in the sparse map version counter value and use it to determine which one of the two maps is currently active (an optimistic locking scheme). We use linear or binary search to look up the slot for the XID. If the XID is not found we know that it wasn't committed after taking the snapshot, we then have to check the clog if it was committed or not. Otherwise compare the value against our snapshots csn to determine visibility. Finally after issuing a read barrier, check the sparse map version counter to see if the result is valid. Checking from the dense map can omit clog checks as we can use special values to signify aborted and uncommitted values. Even more so, we can defer clog updates until the corresponding slots are evicted from the dense map. Assigning XIDs and evicting from CSN buffers -------------------------------------------- XID assignment will need an additional step to allocate a CSN slot for the transaction. If the dense map ring buffer has filled up, this might require evicting some entries from the dense CSN map. For less contention and better efficiency it would be a good idea to do evictions larger blocks at a time. One 8th or 16th of the dense map at a time might be a good balance here. Eviction will be done under a new CSNEvictionLock. First we publish the point up to which we are evicting from dense to notify committing backends of the hazard. We then read the current nextCSN value and publish it as the largest possible global_csn_min value we can arrive at, so if there is a backend in the middle of taking a snapshot that has fetched a CSN but hasn't yet updated procarray entry will notice that it is at risk and will retry. Using nextCSN is being very conservative, but as far as I can see the invalidations will be rare and cheap. We have to issue a full memory barrier here so either the snapshot take sees our value or we see its csn min. If there is enough space, we just use global_csn_min as the sparse map eviction horizon. If there is not enough free space in the current active sparse map to guarantee that dense map will fit, we scan both the active sparse array and the to-be-evicted block, collecting the missing space worth xid-csn pairs with smallest CSN values to a heap, reducing the heap size for every xid,csn we omit due to not being needed either because the transaction was aborted or because it is visible to all active snapshots. When we finish scanning and the heap isn't empty, then the largest value in the heap is the sparse map eviction horizon. After arriving at the correct sparse map eviction horizon, we iterate through the sparse map and the dense map block to be evicted, copying all active or not-all-visible CSN slots to the inactive sparse map. We also update clog for every committed transaction that we found in the dense map. If we decided to evict some values, we write them to the CSN log here and update the sparse map eviction horizon with the CSN we arrived at. At this point the state in the current active sparse map and the evictable dense map block are duplicated into the inactive sparse map and CSN log. We now need to make the new state visible to visibility checkers in a safe order, issuing a write barrier before each step so the previous changes are visible: * Notify all backends that have csn_min <= sparse map eviction horizon that their snapshots are invalid and at what CSN log location they can start to read to find concurrent xids. * Increment sparse map version counter to switch sparse maps. * Raise the dense map eviction horizon, to free up the dense map block. * Overwrite the dense map block with special UncommittedCSN values that are tagged with their XID (more on that later) At this point we can consider the block cleared up for further use. Because we don't want to lock shared structures for snapshotting we need to maintain a global xmin value. To do this we acquire a spinlock on the global xmin value, and check if it's empty, if no other transaction is running we replace it with our xid. At this point we know the minimum CSN of any unconverted snapshots, so it's also a good time to clean up unnecessary CSN log. Finally we are done, so we can release CSNEvictionLock and XIDGenLock. Committing a transaction ------------------------ Most of the commit sequence remains the same, but where we currently do ProcArrayEndTransaction, we will acquire a new LWLock protecting the nextCSN value, I'll call it the CSNGenLock for now. We first read the dense array eviction horizon, then stamp all our non-overflowed or still in dense map subtransaction slots with special SubcommittedCSN value, then stamp the top level xid with nextCSN. We then issue a full memory barrier, and check the soon-to-be-evicted pointer into the dense map. If it overlaps with any of the XIDs we just tagged then the backend doing the eviction might have missed our update, we have to wait for CSNEvictionLock to become free and go and restamp the relevant XIDs in the sparse map. To stamp a XID with the commit CSN, we compare the XID to the dense map eviction horizon. If the XID still maps to the dense array, we do CAS and swap out UncommittedCSN(xid) with the value that we needed. If the CAS fails, then between when we read the dense array horizon and the actual stamping an eviction process replaced our slot with a newer one. If we didn't tag the slots with the XID value, then we might accidentally stamp another transactions slot. If the XID maps to the sparse array, we have to take the CSNEvictionLock so the sparse array doesn't get swapped out underneath us, and then look up the slot and stamp it and then update the CLOG before releasing the lock. Lock contention shouldn't be a problem here as only long running transactions map to the sparse array. When done with the stamping we will check the global xmin value, if it's our xid, we grab the spinlock, scan through the procarray for the next xmin value, update and release it. Rolling back a transaction -------------------------- Rolling back a transaction is basically the same as committing, but the CSN slots need to be stamped with a AbortedCSN. Subtransactions --------------- Because of limited size of the sparse array, we cannot keep open CSN slots for all of the potentially unbounded number of subtransactions there. I propose something similar to what is done currently with PGPROC_MAX_CACHED_SUBXIDS. When we assign xids to subtransactions that are above this limit, we tag them in the dense array with a special OverflowedSubxidCSN value. When evicting subtransactions from dense array, non-overflowed subtransaction slots are handled like regular slots. We discard the overflowed slots when evicting from the dense map. We also keep track of the lowest subxid that was overflowed and the latest subxid that was overflowed, lowest overflowed subxid is reset when before eviction the highest overflowed subxid is lower than the smallest xid in the sparse array (i.e. we know that the XID region convered by the sparse array doesn't contain any overflowed subxids). When constructing the regular snapshot we can then detect that we don't have the full information about subtransactions and can correctly set the overflowed flag on the snapshot. Similarly visibility checks can omit subxid lookups for xids missing from the sparse array when they know that the xid can't be overflowed. Prepared transactions --------------------- Prepared transactions are handled basically like regular transactions, when starting up with prepared transactions, they are inserted into the sparse array, when they are committed they get stamped with CSNs and become visible like usual. We just need to account for them when sizing the sparse array. Crash safety and checkpoints ---------------------------- Because clog updating is delayed for transactions in the dense map, checkpoints need to flush the dense array before writing out clog. Otherwise the datastructures are fully transient and don't need any special treatment. Hot standby ----------- I haven't yet worked out how CSN based snapshots best integrate with hot standby. As far as I can tell, we could just use the current KnownAssignedXidsGetAndSetXmin mechanism and get regular snapshots. But I think there is an opportunity here to get rid of most of or all of the KnownAssignedXids mechanism if we WAL log the CSNs that are assigned to commits (or use a side channel as proposed by Robert). The extra write is not great, but might not be a large problem after the WAL insertion scaling work is done. Another option would be to use the LSN of commit record as the next CSN value. The locking in that case requires some further thought to guarantee that commits are stamped in the same order as WAL records are inserted without holding WALInsertLock for too long. That seems doable by inserting commiting backends into a queue under WALInsertLock and then have them wait for the transaction in front of them to commit when WALInsertLock has been released. Serializable transactions ------------------------- I won't pretend to be familiar with SSI code, but as far as I can tell serializable transactions don't need any modifications to work with the CSN based snapshot scheme. There actually already is a commit sequence number in the SSI code that could be replaced by the actual CSN. IIRC one of the problems with serializable transactions on hot standby was that transaction visibility order on the standby is different from the master. If we use CSNs for visibility on the slave then we can actually provide identical visibility order. Required atomic primitives -------------------------- Besides the copious amount of memory barriers that are required for correctness. We will need the following lockless primitives: * 64bit atomic read * 64bit atomic write * 64bit CAS Are there any supported platforms where it would be impossible to have those? AFAIK everything from 32bit x86 through POWER and MIPS to ARM can do it. If there are any platforms that can't handle 64bit atomics, would it be ok to run on them with reduced concurrency/performance? Sizing the CSN maps ------------------- CSN maps need to sized to accomodate the number of backends. Dense array size should be picked so that most xids commit before being evicted from the dense map and sparse array will contain slots necessary for either long running transactions or for long snapshots not yet converted to XID based snapshots. I did a few quick simulations to measure the dynamics. If we assume a power law distribution of transaction lengths and snapshots for the full length of transactions with no think time, then 16 slots per backend is enough to make the probability of eviction before commit less than 1e-6 and being needed at eviction due to a snapshot about 1:10000. In the real world very long transactions are more likely than predicted model, but at least the normal variation is mostly buffered by this size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for the default value of 100 backends, or 125KB for 1000 backends. Sparse buffer needs to be at least big enough to fit CSN slots for the xids of all active transactions and non-overflowed subtransactions. At the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer, or 203KB total in the default configuration. Performance discussion ---------------------- Taking a snapshot is extremely cheap in this scheme, I think the cost will be mostly for publishing the csnmin and rechecking validity after it. Taking snapshots in the shadow of another snapshot (often the case for the proposed MVCC catalog access) will be even cheaper as we don't have to advertise the new snapshot. The delayed XID based snapshot construction should be unlikely, but even if it isn't the costs are similar to GetSnapshotData, but we don't need to hold a lock. Visibility checking will also be simpler as for the cases where the snapshot is covered by the dense array it only requires two memory lookups and comparisons. The main extra work for writing transactions is the eviction process. The amortized worst case extra work per xid is dominated by copying the sparse buffer back and forth and spooling out the csn log. We need to write out 16 bytes per xid and copy sparse buffer size / eviction block size sparse buffer slots. If we evict 1/8th of dense map at each eviction it works out as 520 bytes copied per xid assigned. About the same ballpark as GetSnapshotData is now. With the described scheme, long snapshots will cause the sparse buffer to quickly fill up and then spool out until the backend wakes up, converts its snapshots and releases the eviction process to free up the log. It would be more efficient to be slightly more proactive and tell them to convert the snapshots earlier so if they manage to be timely with their conversion we can avoid writing any CSN log. I'm not particularly pleased about the fact that both xid assignment and committing can block behind the eviction lock. On the other hand, plugging in some numbers, with 100 backends doing 100k xid assignments/s the lock will be acquired 1000 times per second for less than 100us at a time. The contention might not be bad enough to warrant extra complexity to deal with it. If it does happen to be a problem, then I have some ideas how to cope with it. Having to do CSN log writes while holding a LWLock might not be the best of ideas, to combat that we can either add one more buffer so we can do the actual write syscall after we release CSNEvictionLock, or we can reuse the SLRU machinery to handle this. Overall it looks to be a big win for typical workloads. Workloads using large amounts of subtransactions might not be as well off, but I doubt there will be a regression. At this point I don't see any major issues with this approach. If the ensuing discussion doesn't find any major showstoppers then I will start to work on a patch bit-by-bit. It might take a while though as my free hacking time has been severely cut down since we have a small one in the family. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: