Global snapshots - Mailing list pgsql-hackers
From | Stas Kelvich |
---|---|
Subject | Global snapshots |
Date | |
Msg-id | 21BC916B-80A1-43BF-8650-3363CCDAE09C@postgrespro.ru Whole thread Raw |
Responses |
Re: Global snapshots
Re: Global snapshots Re: Global snapshots Re: Global snapshots |
List | pgsql-hackers |
Global snapshots ================ Here proposed a set of patches that allow achieving proper snapshot isolation semantics in the case of cross-node transactions. Provided infrastructure to synchronize snapshots between different Postgres instances and a way to atomically commit such transaction with respect to other concurrent global and local transactions. Such global transactions can be coordinated outside of Postgres by using provided SQL functions or through postgres_fdw, which make use of this functions on remote nodes transparently. Background ---------- Several years ago was proposed extensible transaction manager (XTM) patch [1,2,3] which allowed extensions to hook and override transaction-related functions like a xid assignment, snapshot acquiring, transaction tree commit/abort, etc. Also, two transaction management extensions were created based on that API: pg_dtm, pg_tsdtm [4]. The first one, pg_dtm, was inspired by Postgres-XL and represents a direct generalization of Postgres MVCC to the cross-node scenario: there is standalone service (DTM) that maintains a xid counter and an array of running backends on all nodes. Every node keeps an open connection to a DTM and asks for xids and Postgres-style snapshots by a network call. While this approach is reasonable for low transaction rates (which is common for OLAP-like load) it quickly gasps for OLTP case. The second one, pg_tsdtm, based on Clock-SI paper [5] implements distributed snapshot synchronization protocol without the necessity of central service like DTM. It makes use of CSN-based visibility and requires two steps of communication during transaction commit. Since commit between nodes each of which can abort transaction is usually done by 2PC protocol two steps of communication are anyway already needed. The current patch is a reimplementation of pg_tsdtm moved into core directly without using any API. The decision to drop XTM API was based on two thoughts. At first XTM API itself needed a huge pile of work to be done to became an actual API instead of the set of hooks over current implementation of MVCC with extensions responsible for handling random static variables like RecentGlobalXmin and friend, which other guts make use of. For the second in all of our test pg_tsdtm was better and faster, but that wasn't obvious from the beginning of work on XTM/DTM. So we decided that it is better to focus on the good implementation of Clock-SI approach. Algorithm --------- Clock-SI is described in [5] and here I provide a small overview, which supposedly should be enough to catch the idea. Assume that each node runs Commit Sequence Number (CSN) based visibility: database tracks one counter for each transaction start (xid) and another counter for each transaction commit (csn). In such setting, a snapshot is just a single number -- a copy of current CSN at the moment when the snapshot was taken. Visibility rules are boiled down to checking whether current tuple's CSN is less than our snapshot's csn. Also it worth of mentioning that for the last 5 years there is an active proposal to switch Postgres to CSN-based visibility [6]. Next, let's assume that CSN is current physical time on the node and call it GlobalCSN. If the physical time on different nodes would be perfectly synchronized then such snapshot exported on one node can be used on other nodes. But unfortunately physical time never perfectly sync and can drift, so that fact should be taken into mind. Also, there is no easy notion of lock or atomic operation in the distributed environment, so commit atomicity on different nodes with respect to concurrent snapshot acquisition should be handled somehow. Clock-SI addresses that in following way: 1) To achieve commit atomicity of different nodes intermediate step is introduced: at first running transaction is marked as InDoubt on all nodes, and only after that each node commit it and stamps with a given GlobalCSN. All readers who ran into tuples of an InDoubt transaction should wait until it ends and recheck visibility. The same technique was used in [6] to achieve atomicity of subtransactions tree commit. 2) When coordinator is marking transactions as InDoubt on other nodes it collects ProposedGlobalCSN from each participant which is local time at that nodes. Next, it selects the maximal value of all ProposedGlobalCSN's and commits the transaction on all nodes with that maximal GlobaCSN. Even if that value is greater than current time on this node due to clock drift. So the GlobalCSN for the given transaction will be the same on all nodes. 3) When local transaction imports foreign global snapshot with some GlobalCSN and current time on this node is smaller then incoming GlobalCSN then the transaction should wait until this GlobalCSN time comes on the local clock. Rules 2) and 3) provide protection against time drift. Paper [5] proves that this is enough to guarantee a proper Snapshot Isolation. Implementation -------------- Main design decision of this patch is trying to affect the performance of local transaction as low as possible while providing a way to make global transactions. GUC variable track_global_snapshots enables/disables this feature. Patch 0001-GlobalCSNLog introduces new SLRU instance that maps xids to GlobalCSN. GlobalCSNLog code is pretty straightforward and more or less copied from SUBTRANS log which is also not persistent. Also, I kept an eye on the corresponding part of Heikki's original patch for CSN's in [6] and commit_ts.c. Patch 0002-Global-snapshots provides infrastructure to snapshot sync functions and global commit functions. It consists of several parts which would be enabled when GUC track_global_snapshots is on: * Each Postgres snapshot acquisition is accompanied by taking current GlobalCSN under the same shared ProcArray lock. * Each transaction commit also writes current GlobalCSN into GlobalCSNLog. To avoid writes to SLRU under exclusive ProcArray lock (which would be the major hit on commit performance) trick with intermediate InDoubt state is used: before calling ProcArrayEndTransaction() backend writes InDoubt state in SLRU, then inside of ProcArrayEndTransaction() under a ProcArray lock GlobalCSN is assigned, and after the lock is released assigned GlobalCSN value is written to GlobalCSNLog SLRU. This approach ensures XIP-based snapshots and GlobalCSN-based are going to see the same subset of tuple versions without putting too much extra contention on ProcArray lock. * XidInMVCCSnapshot can use both XIP-based and GlobalCSN-based snapshot. If the current transaction is local one then at first XIP-based check is performed, then if the tuple is visible don't do any further checks; if this xid is in-progress we need to fetch GlobalCSN from SLRU and recheck it with GlobalCSN-based visibility rules, as it may be part of global InDoubt transaction. So, for local transactions XidInMVCCSnapshot() will fetch SLRU entry only for in-progress transactions. This can be optimized further: it is possible to store a flag in a snapshot which indicates whether there were any active global transactions when this snapshot was taken. Then if there were no global transactions during snapshot creation SLRU access in XidInMVCCSnapshot() can be skipped at all. Otherwise, if current backend have the snapshot that was imported from foreign node then we use only GlobalCSN-based visibility rules (as we don't have any information about how XIP array looked like when that GlobalCSN was taken). * Import/export routines provided for global snapshots. Export just returns current snapshot's global_csn. Import, on the other hand, is more complicated. Since imported GlobalCSN usually points in past we should hold OldestXmin and ensure that tuple versions for given GlobalCSN don't pass through OldestXmin boundary. To achieve that, mechanism like one in SnapshotTooOld is created: on each snapshot creation current oldestXmin is written to a sparse ring buffer which holds oldestXmin entries for a last global_snapshot_defer_time seconds. GetOldestXmin() is delaying its results to the oldest entry in this ring buffer. If we asked to import snapshot that is later then current_time - global_snapshot_defer_time then we just error out with "global snapshot too old" message. Otherwise, we have enough info to set proper xmin in our proc entry to defuse GetOldestXmin(). * Following routines for commit provided: * pg_global_snaphot_prepare(gid) sets InDoubt state for given global tx and return proposed GlobalCSN. * pg_global_snaphot_assign(gid, global_csn) assign given global_csn to given global tx. Consequent COMMIT PREPARED will use that. Import/export and commit routines are made as SQL functions. IMO it's better to transform them to IMPORT GLOBAL SNAPSHOT / EXPORT GLOBAL SNAPSHOT / PREPARE GLOBAL / COMMIT GLOBAL PREPARED. But at this stage, I don't want to clutter this patch with new grammar. Patch 0003-postgres_fdw-support-for-global-snapshots uses the previous infrastructure to achieve isolation in the generated transaction. Right now it is a minimalistic implementation of 2PC like one in [7], but which doesn't care about writing something about remote prepares in WAL on coordinator and doesn't care about any recovery. The main usage is to test things in global snapshot patch, as it easier to do with TAP tests over several postgres_fdw-connected instances. Tests are included. Usage ----- The distributed transaction can be coordinated by an external coordinator. In this case normal scenario would be following: 1) Start transaction on the first node, do some queries if needed, call pg_global_snaphot_export(). 2) On the other node open transaction and call pg_global_snaphot_import(global_csn). global_csn is from previous export. ... do some useful work ... 3) Issue PREPARE for all participant. 4) Call pg_global_snaphot_prepare(gid) on all participant and store returned global_csn's; select maximal over of returned global_csn's. 5) Call pg_global_snaphot_assign(gid, max_global_csn) on all participant. 6) Issue COMMIT PREPARED for all participant. As it was said earlier steps 4) and 5) can be melded into 3) and 5) respectively, but let's go for grammar changes after and if there will be agreement on the overall concept. postgres_fdw in 003-GlobalSnapshotsForPostgresFdw does the same thing, but transparently to the user. Tests ----- Each XidInMVCCSnapshot() in this patch is coated with assert that checks that same things are visible under XIP-based check and GlobalCSN one. Patch 003-GlobalSnapshotsForPostgresFdw includes numerous variants of banking test. It spin-offs several Postgres instances and several concurrent pgbenches which simulate cross-node bank account transfers, while test constantly checks that total balance is correct. Also, there are tests that imported snapshot is going to see the same checksum of data as it was during import. I think this feature also deserves separate test module that will wobble the clock time, but that isn't done yet. Attributions ------------ Original XTM, pg_dtm, pg_tsdtm were written by Konstantin Knizhnik, Constantin Pan and me. This version is mostly on me, with some inputs by Konstantin Knizhnik and Arseny Sher. [1] https://www.postgresql.org/message-id/flat/20160223164335.GA11285%40momjian.us (originally thread was about fdw-based sharding, but later got "hijacked" by xtm) [2] https://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74%40postgrespro.ru [3] https://www.postgresql.org/message-id/flat/56E3CAAE.6060407%40postgrespro.ru [4] https://wiki.postgresql.org/wiki/DTM [5] https://dl.acm.org/citation.cfm?id=2553434 [6] https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com [7] https://www.postgresql.org/message-id/flat/CAFjFpRfPoDAzf3x-fs86nDwJ4FAwn2cZ%2BxdmbdDPSChU-kt7%2BQ%40mail.gmail.com
Attachment
pgsql-hackers by date: