>From b4e663f53f92a727f6f4d9832542546cbff977c8 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 11 Jun 2013 23:25:27 +0200 Subject: [PATCH 17/17] wal_decoding: design document v2.4 and snapshot building design doc v0.5 --- src/backend/replication/logical/DESIGN.txt | 593 +++++++++++++++++++++ src/backend/replication/logical/Makefile | 6 + .../replication/logical/README.SNAPBUILD.txt | 241 +++++++++ 3 files changed, 840 insertions(+) create mode 100644 src/backend/replication/logical/DESIGN.txt create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt new file mode 100644 index 0000000..d76fdb4 --- /dev/null +++ b/src/backend/replication/logical/DESIGN.txt @@ -0,0 +1,593 @@ +//-*- mode: adoc -*- += High Level Design for Logical Replication in Postgres = +:copyright: PostgreSQL Global Development Group 2012 +:author: Andres Freund, 2ndQuadrant Ltd. +:email: andres@2ndQuadrant.com + +== Introduction == + +This document aims to first explain why we think postgres needs another +replication solution and what that solution needs to offer in our opinion. Then +it sketches out our proposed implementation. + +In contrast to an earlier version of the design document which talked about the +implementation of four parts of replication solutions: + +1. Source data generation +1. Transportation of that data +1. Applying the changes +1. Conflict resolution + +this version only plans to talk about the first part in detail as it is an +independent and complex part usable for a wide range of use cases which we want +to get included into postgres in a first step. + +=== Previous discussions === + +There are two rather large threads discussing several parts of the initial +prototype and proposed architecture: + +- http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com[Logical Replication/BDR prototype and architecture] +- http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL] + +Those discussions lead to some fundamental design changes which are presented in this document. + +=== Changes from v1 === +* At least a partial decoding step required/possible on the source system +* No intermediate ("schema only") instances required +* DDL handling, without event triggers +* A very simple text conversion is provided for debugging/demo purposes +* Smaller scope + +== Existing approaches to replication in Postgres == + +If any currently used approach to replication can be made to support every +use-case/feature we need, it likely is not a good idea to implement something +different. Currently three basic approaches are in use in/around postgres +today: + +. Trigger based +. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery] +. Statement based + +Statement based replication has obvious and known problems with consistency and +correctness making it hard to use in the general case so we will not further +discuss it here. + +Lets have a look at the advantages/disadvantages of the other approaches: + +=== Trigger based Replication === + +This variant has a multitude of significant advantages: + +* implementable in userspace +* easy to customize +* just about everything can be made configurable +* cross version support +* cross architecture support +* can feed into systems other than postgres +* no overhead from writes to non-replicated tables +* writable standbys +* mature solutions +* multimaster implementations possible & existing + +But also a number of disadvantages, some of them very hard to solve: + +* essentially duplicates the amount of writes (or even more!) +* synchronous replication hard or impossible to implement +* noticeable CPU overhead +** trigger functions +** text conversion of data +* complex parts implemented in several solutions +* not in core + +Especially the higher amount of writes might seem easy to solve at a first +glance but a solution not using a normal transactional table for its log/queue +has to solve a lot of problems. The major ones are: + +* crash safety, restartability & spilling to disk +* consistency with the commit status of transactions +* only a minimal amount of synchronous work should be done inside individual +transactions + +In our opinion those problems are restricting progress/wider distribution of +these class of solutions. It is our aim though that existing solutions in this +space - most prominently slony and londiste - can benefit from the work we are +doing & planning to do by incorporating at least parts of the changeset +generation infrastructure. + +=== Recovery based Replication === + +This type of solution, being built into postgres and of increasing popularity, +has and will have its use cases and we do not aim to replace but to complement +it. We plan to reuse some of the infrastructure and to make it possible to mix +both modes of replication + +Advantages: + +* builtin +* built on existing infrastructure from crash recovery +* efficient +** minimal CPU, memory overhead on primary +** low amount of additional writes +* synchronous operation mode +* low maintenance once setup +* handles DDL + +Disadvantages: + +* standbys are read only +* no cross version support +* no cross architecture support +* no replication into foreign systems +* hard to customize +* not configurable on the level of database, tables, ... + +== Goals == + +As seen in the previous short survey of the two major interesting classes of +replication solution there is a significant gap between those. Our aim is to +make it smaller. + +We aim for: + +* in core +* low CPU overhead +* low storage overhead +* asynchronous, optionally synchronous operation modes +* robust +* modular +* basis for other technologies (sharding, replication into other DBMS's, ...) +* basis for at least one multi-master solution +* make the implementation as unintrusive as possible, but not more + +== New Architecture == + +=== Overview === + +Our proposal is to reuse the basic principle of WAL based replication, namely +reusing data that already needs to be written for another purpose, and extend +it to allow most, but not all, the flexibility of trigger based solutions. +We want to do that by decoding the WAL back into a non-physical form. + +To get the flexibility we and others want we propose that the last step of +changeset generation, transforming it into a format that can be used by the +replication consumer, is done in an extensible manner. In the schema the part +that does that is described as 'Output Plugin'. To keep the amount of +duplication between different plugins as low as possible the plugin should only +do a a very limited amount of work. + +The following paragraphs contain reasoning for the individual design decisions +made and their highlevel design. + +=== Schematics === + +The basic proposed architecture for changeset extraction is presented in the +following diagram. The first part should look familiar to anyone knowing +postgres' architecture. The second is where most of the new magic happens. + +[[basic-schema]] +.Architecture Schema +["ditaa"] +------------------------------------------------------------------------------ + Traditional Stuff + + +---------+---------+---------+---------+----+ + | Backend | Backend | Backend | Autovac | ...| + +----+----+---+-----+----+----+----+----+-+--+ + | | | | | + +------+ | +--------+ | | + +-+ | | | +----------------+ | + | | | | | | + | v v v v | + | +------------+ | + | | WAL writer |<------------------+ + | +------------+ + | | | | | | + v v v v v v +-------------------+ ++--------+ +---------+ +->| Startup/Recovery | +|{s} | |{s} | | +-------------------+ +|Catalog | | WAL |---+->| SR/Hot Standby | +| | | | | +-------------------+ ++--------+ +---------+ +->| Point in Time | + ^ | +-------------------+ + ---|----------|-------------------------------- + | New Stuff ++---+ | +| v Running separately +| +----------------+ +=-------------------------+ +| | Walsender | | | | +| | v | | +-------------------+ | +| +-------------+ | | +->| Logical Rep. | | +| | WAL | | | | +-------------------+ | ++-| decoding | | | +->| Multimaster | | +| +------+------/ | | | +-------------------+ | +| | | | | +->| Slony | | +| | v | | | +-------------------+ | +| +-------------+ | | +->| Auditing | | +| | TX | | | | +-------------------+ | ++-| reassembly | | | +->| Mysql/... | | +| +-------------/ | | | +-------------------+ | +| | | | | +->| Custom Solutions | | +| | v | | | +-------------------+ | +| +-------------+ | | +->| Debugging | | +| | Output | | | | +-------------------+ | ++-| Plugin |--|--|-+->| Data Recovery | | + +-------------/ | | +-------------------+ | + | | | | + +----------------+ +--------------------------| +------------------------------------------------------------------------------ + +=== WAL enrichement === + +To be able to decode individual WAL records at the very minimal they need to +contain enough information to reconstruct what has happened to which row. The +action is already encoded in the WAL records header in most of the cases. + +As an example of missing data, the WAL record emitted when a row gets deleted, +only contains its physical location. At the very least we need a way to +identify the deleted row: in a relational database the minimal amount of data +that does that should be the primary key footnote:[Yes, there are use cases +where the whole row is needed, or where no primary key can be found]. + +We propose that for now it is enough to extend the relevant WAL record with +additional data when the newly introduced 'WAL_level = logical' is set. + +Previously it has been argued on the hackers mailing list that a generic 'WAL +record annotation' mechanism might be a good thing. That mechanism would allow +to attach arbitrary data to individual wal records making it easier to extend +postgres to support something like what we propose.. While we don't oppose that +idea we think it is largely orthogonal issue to this proposal as a whole +because the format of a WAL records is version dependent by nature and the +necessary changes for our easy way are small, so not much effort is lost. + +A full annotation capability is a complex endeavour on its own as the parts of +the code generating the relevant WAL records has somewhat complex requirements +and cannot easily be configured from the outside. + +Currently this is contained in the http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-andres@2ndquadrant.com[Log enough data into the wal to reconstruct logical changes from it] patch. + +=== WAL parsing & decoding === + +The main complexity when reading the WAL as stored on disk is that the format +is somewhat complex and the existing parser is too deeply integrated in the +recovery system to be directly reusable. Once a reusable parser exists decoding +the binary data into individual WAL records is a small problem. + +Currently two competing proposals for this module exist, each having its own +merits. In the grand scheme of this proposal it is irrelevant which one gets +picked as long as the functionality gets integrated. + +The mailing list post +http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-andres@2ndquadrant.com[Add +support for a generic wal reading facility dubbed XLogReader] contains both +competing patches and discussion around which one is preferable. + +Once the WAL has been decoded into individual records two major issues exist: + +1. records from different transactions and even individual user level actions +are intermingled +1. the data attached to records cannot be interpreted on its own, it is only +meaningful with a lot of required information (including table, columns, types +and more) + +The solution to the first issue is described in the next section: <> + +The second problem is probably the reason why no mature solution to reuse the +WAL for logical changeset generation exists today. See the <> +paragraph for some details. + +As decoding, Transaction reassembly and Snapshot building are interdependent +they currently are implemented in the same patch: +http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com[Introduce +wal decoding via catalog timetravel] + +That patch also includes a small demonstration that the approach works in the +presence of DDL: + +[[example-of-decoding]] +.Decoding example +[NOTE] +--------------------------- +/* just so we keep a sensible xmin horizon */ +ROLLBACK PREPARED 'f'; +BEGIN; +CREATE TABLE keepalive(); +PREPARE TRANSACTION 'f'; + +DROP TABLE IF EXISTS replication_example; + +SELECT pg_current_xlog_insert_location(); +CHECKPOINT; +CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text +varchar(120)); +begin; +INSERT INTO replication_example(somedata, text) VALUES (1, 1); +INSERT INTO replication_example(somedata, text) VALUES (1, 2); +commit; + + +ALTER TABLE replication_example ADD COLUMN bar int; + +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4); + +BEGIN; +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4); +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4); +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL); +COMMIT; + +/* slightly more complex schema change, still no table rewrite */ +ALTER TABLE replication_example DROP COLUMN bar; +INSERT INTO replication_example(somedata, text) VALUES (3, 1); + +BEGIN; +INSERT INTO replication_example(somedata, text) VALUES (3, 2); +INSERT INTO replication_example(somedata, text) VALUES (3, 3); +commit; + +ALTER TABLE replication_example RENAME COLUMN text TO somenum; + +INSERT INTO replication_example(somedata, somenum) VALUES (4, 1); + +/* complex schema change, changing types of existing column, rewriting the table */ +ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING +(somenum::int4); + +INSERT INTO replication_example(somedata, somenum) VALUES (5, 1); + +SELECT pg_current_xlog_insert_location(); + +/* now decode what has been written to the WAL during that time */ + +SELECT decode_xlog('0/1893D78', '0/18BE398'); + +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1 +WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4 +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4 +WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4 +WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]: +(null) +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1 +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2 +WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1 +WARNING: COMMIT + +--------------------------- + +[[tx-reassembly]] +=== TX reassembly === + +In order to make usage of the decoded stream easy we want to present the user +level code with a correctly ordered image of individual transactions at once +because otherwise every user will have to reassemble transactions themselves. + +Transaction reassembly needs to solve several problems: + +1. changes inside a transaction can be interspersed with other transactions +1. a top level transaction only knows which subtransactions belong to it when +it reads the commit record +1. individual user level actions can be smeared over multiple records (TOAST) + +Our proposed module solves 1) and 2) by building individual streams of records +split by xid. While not fully implemented yet we plan to spill those individual +xid streams to disk after a certain amount of memory is used. This can be +implemented without any change in the external interface. + +As all the individual streams are already sorted by LSN by definition - we +build them from the wal in a FIFO manner, and the position in the WAL is the +definition of the LSN footnote:[the LSN is just the byte position int the WAL +stream] - the individual changes can be merged efficiently by a k-way merge +(without sorting!) by keeping the individual streams in a binary heap. + +To manipulate the binary heap a generic implementation is proposed. Several +independent implementations of binary heaps already exist in the postgres code, +but none of them is generic. The patch is available at +http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-andres@2ndquadrant.com[Add +minimal binary heap implementation]. + +[NOTE] +============ +The reassembly component was previously coined ApplyCache because it was +proposed to run on replication consumers just before applying changes. This is +not the case anymore. + +It is still called that way in the source of the patch recently submitted. +============ + +[[snapbuilder]] +=== Snapshot building === + +To decode the contents of wal records describing data changes we need to decode +and transform their contents. A single tuple is stored in a data structure +called HeapTuple. As stored on disk that structure doesn't contain any +information about the format of its contents. + +The basic problem is twofold: + +1. The wal records only contain the relfilenode not the relation oid of a table +11. The relfilenode changes when an action performing a full table rewrite is performed +1. To interpret a HeapTuple correctly the exact schema definition from back +when the wal record was inserted into the wal stream needs to be available + +We chose to implement timetraveling access to the system catalog using +postgres' MVCC nature & implementation because of the following advantages: + +* low amount of additional data in wal +* genericity +* similarity of implementation to Hot Standby, quite a bit of the infrastructure is reusable +* all kinds of DDL can be handled in reliable manner +* extensibility to user defined catalog like tables + +Timetravel access to the catalog means that we are able to look at the catalog +just as it looked when changes were generated. That allows us to get the +correct information about the contents of the aforementioned HeapTuple's so we +can decode them reliably. + +Other solutions we thought about that fell through: +* catalog only proxy instances that apply schema changes exactly to the point + were decoding using ``old fashioned'' wal replay +* do the decoding on a 2nd machine, replicating all DDL exactly, rely on the catalog there +* do not allow DDL at all +* always add enough data into the WAL to allow decoding +* build a fully versioned catalog + +The email thread available under +http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata +consistency during changeset extraction from WAL] contains some details, +advantages and disadvantages about the different possible implementations. + +How we build snapshots is somewhat intricate and complicated and seems to be +out of scope for this document. We will provide a second document discussing +the implementation in detail. Let's just assume it is possible from here on. + +[NOTE] +Some details are already available in comments inside 'src/backend/replication/logical/snapbuild.{c,h}'. + +=== Output Plugin === + +As already mentioned previously our aim is to make the implementation of output +plugins as simple and non-redundant as possible as we expect several different +ones with different use cases to emerge quickly. See <> for a +list of possible output plugins that we think might emerge. + +Although we for now only plan to tackle logical replication and based on that a +multi-master implementation in the near future we definitely aim to provide all +use-cases with something easily useable! + +To decode and translate local transaction an output plugin needs to be able to +transform transactions as a whole so it can apply them as a meaningful +transaction at the other side. + +What we do to provide that is, that very time we find a transaction commit and +thus have completed reassembling the transaction we start to provide the +individual changes to the output plugin. It currently only has to fill out 3 +callbacks: +[options="header"] +|===================================================================================================================================== +|Callback |Passed Parameters |Called per TX | Use +|begin |xid |once |Begin of a reassembled transaction +|change |xid, subxid, change, mvcc snapshot |every change |Gets passed every change so it can transform it to the target format +|commit |xid |once |End of a reassembled transaction +|===================================================================================================================================== + +During each of those callback an appropriate timetraveling SnapshotNow snapshot +is setup so the callbacks can perform all read-only catalog accesses they need, +including using the sys/rel/catcache. For obvious reasons only read access is +allowed. + +The snapshot guarantees that the result of lookups are be the same as they +were/would have been when the change was originally created. + +Additionally they get passed a MVCC snapshot, to e.g. run sql queries on +catalogs or similar. + +[IMPORTANT] +============ +At the moment none of these snapshots can be used to access normal user +tables. Adding additional tables to the allowed set is easy implementation +wise, but every transaction changing such tables incurs a noticeably higher +overhead. +============ + +For now transactions won't be decoded/output in parallel. There are ideas to +improve on this, but we don't think the complexity is appropriate for the first +release of this feature. + +This is an adoption barrier for databases where large amounts of data get +loaded/written in one transaction. + +=== Setup of replication nodes === + +When setting up a new standby/consumer of a primary some problem exist +independent of the implementation of the consumer. The gist of the problem is +that when making a base backup and starting to stream all changes since that +point transactions that were running during all this cannot be included: + +* Transaction that have not committed before starting to dump a database are + invisible to the dumping process + +* Transactions that began before the point from which on the WAL is being + decoded are incomplete and cannot be replayed + +Our proposal for a solution to this is to detect points in the WAL stream where we can provide: + +. A snapshot exported similarly to pg_export_snapshot() footnote:[http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION] that can be imported with +SET TRANSACTION SNAPSHOT+ footnote:[http://www.postgresql.org/docs/devel/static/sql-set-transaction.html] +. A stream of changes that will include the complete data of all transactions seen as running by the snapshot generated in 1) + +See the diagram. + +[[setup-schema]] +.Control flow during setup of a new node +["ditaa",scaling="0.7"] +------------------------------------------------------------------------------ ++----------------+ +| Walsender | | +------------+ +| v | | Consumer | ++-------------+ |<--IDENTIFY_SYSTEM-------------| | +| WAL | | | | +| decoding | |----....---------------------->| | ++------+------/ | | | +| | | | | +| v | | | ++-------------+ |<--INIT_LOGICAL $PLUGIN--------| | +| TX | | | | +| reassembly | |---FOUND_STARTING %X/%X------->| | ++-------------/ | | | +| | |---FOUND_CONSISTENT %X/%X----->| | +| v |---pg_dump snapshot----------->| | ++-------------+ |---replication slot %P-------->| | +| Output | | | | +| Plugin | | ^ | | ++-------------/ | | | | +| | +-run pg_dump separately --| | +| | | | +| |<--STREAM_DATA-----------------| | +| | | | +| |---data ---------------------->| | +| | | | +| | | | +| | ---- SHUTDOWN ------------- | | +| | | | +| | | | +| |<--RESTART_LOGICAL $PLUGIN %P--| | +| | | | +| |---data----------------------->| | +| | | | +| | | | ++----------------+ +------------+ + +------------------------------------------------------------------------------ + +=== Disadvantages of the approach === + +* somewhat intricate code for snapshot timetravel +* output plugins/walsenders need to work per database as they access the catalog +* when sending to multiple standbys some work is done multiple times +* decoding/applying multiple transactions in parallel is somewhat hard diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile index 310a45c..6fae278 100644 --- a/src/backend/replication/logical/Makefile +++ b/src/backend/replication/logical/Makefile @@ -17,3 +17,9 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS) OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o include $(top_srcdir)/src/backend/common.mk + +DESIGN.pdf: DESIGN.txt + a2x -v --fop -f pdf -D $(shell pwd) $< + +README.SNAPBUILD.pdf: README.SNAPBUILD.txt + a2x -v --fop -f pdf -D $(shell pwd) $< diff --git a/src/backend/replication/logical/README.SNAPBUILD.txt b/src/backend/replication/logical/README.SNAPBUILD.txt new file mode 100644 index 0000000..b6c7470 --- /dev/null +++ b/src/backend/replication/logical/README.SNAPBUILD.txt @@ -0,0 +1,241 @@ += Snapshot Building = +:author: Andres Freund, 2nQuadrant Ltd + +== Why do we need timetravel catalog access == + +When doing WAL decoding (see DESIGN.txt for reasons to do so), we need to know +how the catalog looked at the point a record was inserted into the WAL, because +without that information we don't know much more about the record other than +its length. It's just an arbitrary bunch of bytes without further information. +Unfortunately, due the possibility that the table definition might change we +cannot just access a newer version of the catalog and assume the table +definition continues to be the same. + +If only the type information were required, it might be enough to annotate the +wal records with a bit more information (table oid, table name, column name, +column type) --- but as we want to be able to convert the output to more useful +formats such as text, we additionally need to be able to call output functions. +Those need a normal environment including the usual caches and normal catalog +access to lookup operators, functions and other types. + +Our solution to this is to add the capability to access the catalog such as it +was at the time the record was inserted into the WAL. The locking used during +WAL generation guarantees the catalog is/was in a consistent state at that +point. We call this 'time-travel catalog access'. + +Interesting cases include: + +- enums +- composite types +- extension types +- non-C functions +- relfilenode to table OID mapping + +Due to postgres' non-overwriting storage manager, regular modifications of a +table's content are theoretically non-destructive. The problem is that there is +no way to access an arbitrary point in time even if the data for it is there. + +This module adds the capability to do so in the very limited set of +circumstances we need it in for WAL decoding. It does *not* provide a general +time-travelling facility. + +A 'Snapshot' is the data structure used in postgres to describe which tuples +are visible and which are not. We need to build a Snapshot which can be used to +access the catalog the way it looked when the wal record was inserted. + +Restrictions: + +- Only works for catalog tables or tables explicitly marked as such. +- Snapshot modifications are somewhat expensive +- it cannot build initial visibility information for every point in time, it + needs a specific circumstances to start. + +== How are time-travel snapshots built == + +'Hot Standby' added infrastructure to build snapshots from WAL during recovery in +the 9.0 release. Most of that can be reused for our purposes. + +We cannot reuse all of the hot standby infrastructure because: + +- we are not in recovery +- we need to look at interim states *inside* a transaction +- we need the capability to have multiple different snapshots arround at the same time + +Normally the catalog is accessed using SnapshotNow which can legally be +replaced by SnapshotMVCC that has been taken at the start of a scan. So catalog +timetravel contains infrastructure to make SnapshotNow catalog access use +appropriate MVCC snapshots. They aren't generated with GetSnapshotData() +though, but reassembled from WAL contents. + +We collect our data in a normal struct SnapshotData, repurposing some fields +creatively: + +- +Snapshot->xip+ contains all transaction we consider committed +- +Snapshot->subxip+ contains all transactions belonging to our transaction, + including the toplevel one +- +Snapshot->active_count+ is used as a refcount + +The meaning of +xip+ is inverted in comparison with non-timetravel snapshots in +the sense that members of the array are the committed transactions, not the in +progress ones. Because usually only a tiny percentage of comitted transactions +will have modified the catalog between xmin and xmax this allows us to keep the +array small in the usual cases. It also makes subtransaction handling easier +since we neither need to query pg_subtrans (which we couldn't anyway since it's +truncated at restart) nor have problems with suboverflowed snapshots. + +== Building of initial snapshot == + +We can start building an initial snapshot as soon as we find either an ++XLOG_RUNNING_XACTS+ or an +XLOG_CHECKPOINT_SHUTDOWN+ record because they allow us +to know how many transactions are running. + +We need to know which transactions were running when we start to build a +snapshot/start decoding as we don't have enough information about them (they +could have done catalog modifications before we started watching). Also, we +wouldn't have the complete contents of those transactions, because we started +reading after they began. (The latter is also important when building +snapshots that can be used to build a consistent initial clone.) + +There also is the problem that +XLOG_RUNNING_XACT+ records can be +'suboverflowed' which means there were more running subtransactions than +fitting into shared memory. In that case we use the same incremental building +trick hot standby uses which is either + +1. wait till further +XLOG_RUNNING_XACT+ records have a running->oldestRunningXid +after the initial xl_runnign_xacts->nextXid +2. wait for a further +XLOG_RUNNING_XACT+ that is not overflowed or +a +XLOG_CHECKPOINT_SHUTDOWN+ + +When we start building a snapshot we are in the +SNAPBUILD_START+ state. As +soon as we find any visibility information, even if incomplete, we change to ++SNAPBUILD_INITIAL_POINT+. + +When we have collected enough information to decode any transaction starting +after that point in time we fall over to +SNAPBUILD_FULL_SNAPSHOT+. If those +transactions commit before the next state is reached, we throw their complete +contents away. + +As soon as all transactions that were running when we switched over to ++SNAPBUILD_FULL_SNAPSHOT+ commit, we change state to +SNAPBUILD_CONSISTENT+. +Every transaction that commits from now on gets handed to the output plugin. +When doing the switch to +SNAPBUILD_CONSISTENT+ we optionally export a snapshot +which makes all transactions that committed up to this point visible. This +exported snapshot can be used to run pg_dump; replaying all changes emitted +by the output plugin on a database restored from such a dump will result in +a consistent clone. + +["ditaa",scaling="0.8"] +--------------- + + +-------------------------+ + +----|SNAPBUILD_START |-------------+ + | +-------------------------+ | + | | | + | | | + | running_xacts with running xacts | + | | | + | | | + | v | + | +-------------------------+ v + | |SNAPBUILD_FULL_SNAPSHOT |------------>| + | +-------------------------+ | +XLOG_RUNNING_XACTS | saved snapshot + with zero xacts | at running_xacts's lsn + | | | + | all running toplevel TXNs finished | + | | | + | v | + | +-------------------------+ | + +--->|SNAPBUILD_CONSISTENT |<------------+ + +-------------------------+ + +--------------- + +== Snapshot Management == + +Whenever a transaction is detected as having started during decoding in ++SNAPBUILD_FULL_SNAPSHOT+ state, we distribute the currently maintained +snapshot to it (i.e. call ReorderBufferSetBaseSnapshot). This serves as its +initial snapshot. Unless there are concurrent catalog changes that snapshot +will be used for the decoding the entire transaction's changes. + +Whenever a transaction-with-catalog-changes commits, we iterate over all +concurrently active transactions and add a new SnapshotNow to it +(ReorderBufferAddSnapshot(current_lsn)). This is required because any row +written from now that point on will have used the changed catalog contents. + +When decoding a transaction that made catalog changes itself we tell that +transaction that (ReorderBufferAddNewCommandId(current_lsn)) which will cause +the decoding to use the appropriate command id from that point on. + +SnapshotNow's need to be setup globally so the syscache and other pieces access +it transparently. This is done using two new tqual.h functions: +SetupDecodingSnapshots() and RevertFromDecodingSnapshots(). + +== Catalog/User Table Detection == + +Since we only want to store committed transactions that actually modified the +catalog we need a way to detect that from WAL: + +Right now, we assume that every transaction that commits before we reach ++SNAPBUILD_CONSISTENT+ state has made catalog modifications since we can't rely +on having seen the entire transaction before that. That's not harmful beside +incurring some price in memory usage and runtime. + +After having reached consistency we recognize catalog modifying transactions +via HEAP2_NEW_CID and HEAP_INPLACE that are logged by catalog modifying +actions. + +== mixed DDL/DML transaction handling == + +When a transactions uses DDL and DML in the same transaction things get a bit +more complicated because we need to handle CommandIds and ComboCids as we need +to use the correct version of the catalog when decoding the individual tuples. + +For that we emit the new HEAP2_NEW_CID records which contain the physical tuple +location, cmin and cmax when the catalog is modified. If we need to detect +visibility of a catalog tuple that has been modified in our own transaction - +which we can detect via xmin/xmax - we look in a hash table using the location +as key to get correct cmin/cmax values. +From those values we can also extract the commandid that generated the record. + +All this only needs to happen in the transaction performing the DDL. + +== Cache Handling == + +As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate +cache invalidation. For transactions that only do DDL thats easy as everything +is already provided by HS. Everytime we read a commit record we apply the +sinval messages contained therein. + +For transactions that contain DDL and DML cache invalidation needs to happen +more frequently because we need to all tore down all caches that just got +modified. To do that we simply apply all invalidation messages that got +collected at the end of transaction and apply them everytime we've decoded +single change. At some point this can get optimized by generating new local +invalidation messages, but that seems too complicated for now. + +XXX: talk about syscache handling of relmapped relation. + +== xmin Horizon Handling == + +Reusing MVCC for timetravel access has one obvious major problem: VACUUM. Rows +we still need for decoding cannot be removed but at the same time we cannot +keep data in the catalog indefinitely. + +For that we peg the xmin horizon that's used to decide which rows can be +removed. We only need to prevent removal of those rows for catalog like +relations, not for all user tables. For that reason a separate xmin horizon +RecentGlobalDataXmin got introduced. + +Since we need to persist that knowledge across restarts we keep the xmin for a +in the logical slots which are safed in a crashsafe manner. They are restored +from disk into memory at server startup. + +== Restartable Decoding == + +As we want to generate a consistent stream of changes we need to have the +ability to start from a previously decoded location without waiting possibly +very long to reach consistency. For that reason we dump the current visibility +information to disk everytime we read an xl_running_xacts record. + -- 1.8.2.rc2.4.g7799588.dirty