>From b4e663f53f92a727f6f4d9832542546cbff977c8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Jun 2013 23:25:27 +0200
Subject: [PATCH 17/17] wal_decoding: design document v2.4 and snapshot
 building design doc v0.5

---
 src/backend/replication/logical/DESIGN.txt         | 593 +++++++++++++++++++++
 src/backend/replication/logical/Makefile           |   6 +
 .../replication/logical/README.SNAPBUILD.txt       | 241 +++++++++
 3 files changed, 840 insertions(+)
 create mode 100644 src/backend/replication/logical/DESIGN.txt
 create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt

diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt
new file mode 100644
index 0000000..d76fdb4
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN.txt
@@ -0,0 +1,593 @@
+//-*- mode: adoc -*-
+= High Level Design for Logical Replication in Postgres =
+:copyright: PostgreSQL Global Development Group 2012
+:author: Andres Freund, 2ndQuadrant Ltd.
+:email: andres@2ndQuadrant.com
+
+== Introduction ==
+
+This document aims to first explain why we think postgres needs another
+replication solution and what that solution needs to offer in our opinion. Then
+it sketches out our proposed implementation.
+
+In contrast to an earlier version of the design document which talked about the
+implementation of four parts of replication solutions:
+
+1. Source data generation
+1. Transportation of that data
+1. Applying the changes
+1. Conflict resolution
+
+this version only plans to talk about the first part in detail as it is an
+independent and complex part usable for a wide range of use cases which we want
+to get included into postgres in a first step.
+
+=== Previous discussions ===
+
+There are two rather large threads discussing several parts of the initial
+prototype and proposed architecture:
+
+- http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com[Logical Replication/BDR prototype and architecture]
+- http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL]
+
+Those discussions lead to some fundamental design changes which are presented in this document.
+
+=== Changes from v1 ===
+* At least a partial decoding step required/possible on the source system
+* No intermediate ("schema only") instances required
+* DDL handling, without event triggers
+* A very simple text conversion is provided for debugging/demo purposes
+* Smaller scope
+
+== Existing approaches to replication in Postgres ==
+
+If any currently used approach to replication can be made to support every
+use-case/feature we need, it likely is not a good idea to implement something
+different. Currently three basic approaches are in use in/around postgres
+today:
+
+. Trigger based
+. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery]
+. Statement based
+
+Statement based replication has obvious and known problems with consistency and
+correctness making it hard to use in the general case so we will not further
+discuss it here.
+
+Lets have a look at the advantages/disadvantages of the other approaches:
+
+=== Trigger based Replication ===
+
+This variant has a multitude of significant advantages:
+
+* implementable in userspace
+* easy to customize
+* just about everything can be made configurable
+* cross version support
+* cross architecture support
+* can feed into systems other than postgres
+* no overhead from writes to non-replicated tables
+* writable standbys
+* mature solutions
+* multimaster implementations possible & existing
+
+But also a number of disadvantages, some of them very hard to solve:
+
+* essentially duplicates the amount of writes (or even more!)
+* synchronous replication hard or impossible to implement
+* noticeable CPU overhead
+** trigger functions
+** text conversion of data
+* complex parts implemented in several solutions
+* not in core
+
+Especially the higher amount of writes might seem easy to solve at a first
+glance but a solution not using a normal transactional table for its log/queue
+has to solve a lot of problems. The major ones are:
+
+* crash safety, restartability & spilling to disk
+* consistency with the commit status of transactions
+* only a minimal amount of synchronous work should be done inside individual
+transactions
+
+In our opinion those problems are restricting progress/wider distribution of
+these class of solutions. It is our aim though that existing solutions in this
+space - most prominently slony and londiste - can benefit from the work we are
+doing & planning to do by incorporating at least parts of the changeset
+generation infrastructure.
+
+=== Recovery based Replication ===
+
+This type of solution, being built into postgres and of increasing popularity,
+has and will have its use cases and we do not aim to replace but to complement
+it. We plan to reuse some of the infrastructure and to make it possible to mix
+both modes of replication
+
+Advantages:
+
+* builtin
+* built on existing infrastructure from crash recovery
+* efficient
+** minimal CPU, memory overhead on primary
+** low amount of additional writes
+* synchronous operation mode
+* low maintenance once setup
+* handles DDL
+
+Disadvantages:
+
+* standbys are read only
+* no cross version support
+* no cross architecture support
+* no replication into foreign systems
+* hard to customize
+* not configurable on the level of database, tables, ...
+
+== Goals ==
+
+As seen in the previous short survey of the two major interesting classes of
+replication solution there is a significant gap between those. Our aim is to
+make it smaller.
+
+We aim for:
+
+* in core
+* low CPU overhead
+* low storage overhead
+* asynchronous, optionally synchronous operation modes
+* robust
+* modular
+* basis for other technologies (sharding, replication into other DBMS's, ...)
+* basis for at least one multi-master solution
+* make the implementation as unintrusive as possible, but not more
+
+== New Architecture ==
+
+=== Overview ===
+
+Our proposal is to reuse the basic principle of WAL based replication, namely
+reusing data that already needs to be written for another purpose, and extend
+it to allow most, but not all, the flexibility of trigger based solutions.
+We want to do that by decoding the WAL back into a non-physical form.
+
+To get the flexibility we and others want we propose that the last step of
+changeset generation, transforming it into a format that can be used by the
+replication consumer, is done in an extensible manner. In the schema the part
+that does that is described as 'Output Plugin'. To keep the amount of
+duplication between different plugins as low as possible the plugin should only
+do a a very limited amount of work.
+
+The following paragraphs contain reasoning for the individual design decisions
+made and their highlevel design.
+
+=== Schematics ===
+
+The basic proposed architecture for changeset extraction is presented in the
+following diagram. The first part should look familiar to anyone knowing
+postgres' architecture. The second is where most of the new magic happens.
+
+[[basic-schema]]
+.Architecture Schema
+["ditaa"]
+------------------------------------------------------------------------------
+        Traditional Stuff
+
+ +---------+---------+---------+---------+----+
+ | Backend | Backend | Backend | Autovac | ...|
+ +----+----+---+-----+----+----+----+----+-+--+
+      |        |          |         |      |
+      +------+ | +--------+         |      |
+    +-+      | | | +----------------+      |
+    |        | | | |                       |
+    |        v v v v                       |
+    |     +------------+                   |
+    |     | WAL writer |<------------------+
+    |     +------------+
+    |       | | | | |
+    v       v v v v v       +-------------------+
++--------+ +---------+   +->| Startup/Recovery  |
+|{s}     | |{s}      |   |  +-------------------+
+|Catalog | |   WAL   |---+->| SR/Hot Standby    |
+|        | |         |   |  +-------------------+
++--------+ +---------+   +->| Point in Time     |
+    ^          |            +-------------------+
+ ---|----------|--------------------------------
+    |       New Stuff
++---+          |
+|              v            Running separately
+| +----------------+  +=-------------------------+
+| | Walsender  |   |  |                          |
+| |            v   |  |    +-------------------+ |
+| +-------------+  |  | +->| Logical Rep.      | |
+| |     WAL     |  |  | |  +-------------------+ |
++-|  decoding   |  |  | +->| Multimaster       | |
+| +------+------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Slony             | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Auditing          | |
+| |     TX      |  |  | |  +-------------------+ |
++-| reassembly  |  |  | +->| Mysql/...         | |
+| +-------------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Custom Solutions  | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Debugging         | |
+| |   Output    |  |  | |  +-------------------+ |
++-|   Plugin    |--|--|-+->| Data Recovery     | |
+  +-------------/  |  |    +-------------------+ |
+  |                |  |                          |
+  +----------------+  +--------------------------|
+------------------------------------------------------------------------------
+
+=== WAL enrichement ===
+
+To be able to decode individual WAL records at the very minimal they need to
+contain enough information to reconstruct what has happened to which row. The
+action is already encoded in the WAL records header in most of the cases.
+
+As an example of missing data, the WAL record emitted when a row gets deleted,
+only contains its physical location. At the very least we need a way to
+identify the deleted row: in a relational database the minimal amount of data
+that does that should be the primary key footnote:[Yes, there are use cases
+where the whole row is needed, or where no primary key can be found].
+
+We propose that for now it is enough to extend the relevant WAL record with
+additional data when the newly introduced 'WAL_level = logical' is set.
+
+Previously it has been argued on the hackers mailing list that a generic 'WAL
+record annotation' mechanism might be a good thing. That mechanism would allow
+to attach arbitrary data to individual wal records making it easier to extend
+postgres to support something like what we propose.. While we don't oppose that
+idea we think it is largely orthogonal issue to this proposal as a whole
+because the format of a WAL records is version dependent by nature and the
+necessary changes for our easy way are small, so not much effort is lost.
+
+A full annotation capability is a complex endeavour on its own as the parts of
+the code generating the relevant WAL records has somewhat complex requirements
+and cannot easily be configured from the outside.
+
+Currently this is contained in the http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-andres@2ndquadrant.com[Log enough data into the wal to reconstruct logical changes from it] patch.
+
+=== WAL parsing & decoding ===
+
+The main complexity when reading the WAL as stored on disk is that the format
+is somewhat complex and the existing parser is too deeply integrated in the
+recovery system to be directly reusable. Once a reusable parser exists decoding
+the binary data into individual WAL records is a small problem.
+
+Currently two competing proposals for this module exist, each having its own
+merits. In the grand scheme of this proposal it is irrelevant which one gets
+picked as long as the functionality gets integrated.
+
+The mailing list post
+http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-andres@2ndquadrant.com[Add
+support for a generic wal reading facility dubbed XLogReader] contains both
+competing patches and discussion around which one is preferable.
+
+Once the WAL has been decoded into individual records two major issues exist:
+
+1. records from different transactions and even individual user level actions
+are intermingled
+1. the data attached to records cannot be interpreted on its own, it is only
+meaningful with a lot of required information (including table, columns, types
+and more)
+
+The solution to the first issue is described in the next section: <<tx-reassembly>>
+
+The second problem is probably the reason why no mature solution to reuse the
+WAL for logical changeset generation exists today. See the <<snapbuilder>>
+paragraph for some details.
+
+As decoding, Transaction reassembly and Snapshot building are interdependent
+they currently are implemented in the same patch:
+http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com[Introduce
+wal decoding via catalog timetravel]
+
+That patch also includes a small demonstration that the approach works in the
+presence of DDL:
+
+[[example-of-decoding]]
+.Decoding example
+[NOTE]
+---------------------------
+/* just so we keep a sensible xmin horizon */
+ROLLBACK PREPARED 'f';
+BEGIN;
+CREATE TABLE keepalive();
+PREPARE TRANSACTION 'f';
+
+DROP TABLE IF EXISTS replication_example;
+
+SELECT pg_current_xlog_insert_location();
+CHECKPOINT;
+CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
+varchar(120));
+begin;
+INSERT INTO replication_example(somedata, text) VALUES (1, 1);
+INSERT INTO replication_example(somedata, text) VALUES (1, 2);
+commit;
+
+
+ALTER TABLE replication_example ADD COLUMN bar int;
+
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);
+COMMIT;
+
+/* slightly more complex schema change, still no table rewrite */
+ALTER TABLE replication_example DROP COLUMN bar;
+INSERT INTO replication_example(somedata, text) VALUES (3, 1);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text) VALUES (3, 2);
+INSERT INTO replication_example(somedata, text) VALUES (3, 3);
+commit;
+
+ALTER TABLE replication_example RENAME COLUMN text TO somenum;
+
+INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);
+
+/* complex schema change, changing types of existing column, rewriting the table */
+ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
+(somenum::int4);
+
+INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);
+
+SELECT pg_current_xlog_insert_location();
+
+/* now decode what has been written to the WAL during that time */
+
+SELECT decode_xlog('0/1893D78', '0/18BE398');
+
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
+WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
+WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
+WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
+(null)
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
+WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
+WARNING:  COMMIT
+
+---------------------------
+
+[[tx-reassembly]]
+=== TX reassembly ===
+
+In order to make usage of the decoded stream easy we want to present the user
+level code with a correctly ordered image of individual transactions at once
+because otherwise every user will have to reassemble transactions themselves.
+
+Transaction reassembly needs to solve several problems:
+
+1. changes inside a transaction can be interspersed with other transactions
+1. a top level transaction only knows which subtransactions belong to it when
+it reads the commit record
+1. individual user level actions can be smeared over multiple records (TOAST)
+
+Our proposed module solves 1) and 2) by building individual streams of records
+split by xid. While not fully implemented yet we plan to spill those individual
+xid streams to disk after a certain amount of memory is used. This can be
+implemented without any change in the external interface.
+
+As all the individual streams are already sorted by LSN by definition - we
+build them from the wal in a FIFO manner, and the position in the WAL is the
+definition of the LSN footnote:[the LSN is just the byte position int the WAL
+stream] - the individual changes can be merged efficiently by a k-way merge
+(without sorting!) by keeping the individual streams in a binary heap.
+
+To manipulate the binary heap a generic implementation is proposed. Several
+independent implementations of binary heaps already exist in the postgres code,
+but none of them is generic.  The patch is available at
+http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-andres@2ndquadrant.com[Add
+minimal binary heap implementation].
+
+[NOTE]
+============
+The reassembly component was previously coined ApplyCache because it was
+proposed to run on replication consumers just before applying changes. This is
+not the case anymore.
+
+It is still called that way in the source of the patch recently submitted.
+============
+
+[[snapbuilder]]
+=== Snapshot building  ===
+
+To decode the contents of wal records describing data changes we need to decode
+and transform their contents. A single tuple is stored in a data structure
+called HeapTuple. As stored on disk that structure doesn't contain any
+information about the format of its contents.
+
+The basic problem is twofold:
+
+1. The wal records only contain the relfilenode not the relation oid of a table
+11. The relfilenode changes when an action performing a full table rewrite is performed
+1. To interpret a HeapTuple correctly the exact schema definition from back
+when the wal record was inserted into the wal stream needs to be available
+
+We chose to implement timetraveling access to the system catalog using
+postgres' MVCC nature & implementation because of the following advantages:
+
+* low amount of additional data in wal
+* genericity
+* similarity of implementation to Hot Standby, quite a bit of the infrastructure is reusable
+* all kinds of DDL can be handled in reliable manner
+* extensibility to user defined catalog like tables
+
+Timetravel access to the catalog means that we are able to look at the catalog
+just as it looked when changes were generated. That allows us to get the
+correct information about the contents of the aforementioned HeapTuple's so we
+can decode them reliably.
+
+Other solutions we thought about that fell through:
+* catalog only proxy instances that apply schema changes exactly to the point
+  were decoding using ``old fashioned'' wal replay
+* do the decoding on a 2nd machine, replicating all DDL exactly, rely on the catalog there
+* do not allow DDL at all
+* always add enough data into the WAL to allow decoding
+* build a fully versioned catalog
+
+The email thread available under
+http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com[Catalog/Metadata
+consistency during changeset extraction from WAL] contains some details,
+advantages and disadvantages about the different possible implementations.
+
+How we build snapshots is somewhat intricate and complicated and seems to be
+out of scope for this document. We will provide a second document discussing
+the implementation in detail. Let's just assume it is possible from here on.
+
+[NOTE]
+Some details are already available in comments inside 'src/backend/replication/logical/snapbuild.{c,h}'.
+
+=== Output Plugin ===
+
+As already mentioned previously our aim is to make the implementation of output
+plugins as simple and non-redundant as possible as we expect several different
+ones with different use cases to emerge quickly. See <<basic-schema>> for a
+list of possible output plugins that we think might emerge.
+
+Although we for now only plan to tackle logical replication and based on that a
+multi-master implementation in the near future we definitely aim to provide all
+use-cases with something easily useable!
+
+To decode and translate local transaction an output plugin needs to be able to
+transform transactions as a whole so it can apply them as a meaningful
+transaction at the other side.
+
+What we do to provide that is, that very time we find a transaction commit and
+thus have completed reassembling the transaction we start to provide the
+individual changes to the output plugin. It currently only has to fill out 3
+callbacks:
+[options="header"]
+|=====================================================================================================================================
+|Callback |Passed Parameters                    |Called per TX  | Use
+|begin    |xid                                  |once           |Begin of a reassembled transaction
+|change   |xid, subxid, change, mvcc snapshot   |every change   |Gets passed every change so it can transform it to the target format
+|commit   |xid                                  |once           |End of a reassembled transaction
+|=====================================================================================================================================
+
+During each of those callback an appropriate timetraveling SnapshotNow snapshot
+is setup so the callbacks can perform all read-only catalog accesses they need,
+including using the sys/rel/catcache. For obvious reasons only read access is
+allowed.
+
+The snapshot guarantees that the result of lookups are be the same as they
+were/would have been when the change was originally created.
+
+Additionally they get passed a MVCC snapshot, to e.g. run sql queries on
+catalogs or similar.
+
+[IMPORTANT]
+============
+At the moment none of these snapshots can be used to access normal user
+tables. Adding additional tables to the allowed set is easy implementation
+wise, but every transaction changing such tables incurs a noticeably higher
+overhead.
+============
+
+For now transactions won't be decoded/output in parallel. There are ideas to
+improve on this, but we don't think the complexity is appropriate for the first
+release of this feature.
+
+This is an adoption barrier for databases where large amounts of data get
+loaded/written in one transaction.
+
+=== Setup of replication nodes ===
+
+When setting up a new standby/consumer of a primary some problem exist
+independent of the implementation of the consumer. The gist of the problem is
+that when making a base backup and starting to stream all changes since that
+point transactions that were running during all this cannot be included:
+
+* Transaction that have not committed before starting to dump a database are
+  invisible to the dumping process
+
+* Transactions that began before the point from which on the WAL is being
+  decoded are incomplete and cannot be replayed
+
+Our proposal for a solution to this is to detect points in the WAL stream where we can provide:
+
+. A snapshot exported similarly to pg_export_snapshot() footnote:[http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION] that can be imported with +SET TRANSACTION SNAPSHOT+ footnote:[http://www.postgresql.org/docs/devel/static/sql-set-transaction.html]
+. A stream of changes that will include the complete data of all transactions seen as running by the snapshot generated in 1)
+
+See the diagram.
+
+[[setup-schema]]
+.Control flow during setup of a new node
+["ditaa",scaling="0.7"]
+------------------------------------------------------------------------------
++----------------+
+| Walsender  |   |                               +------------+
+|            v   |                               | Consumer   |
++-------------+  |<--IDENTIFY_SYSTEM-------------|            |
+|     WAL     |  |                               |            |
+|  decoding   |  |----....---------------------->|            |
++------+------/  |                               |            |
+|            |   |                               |            |
+|            v   |                               |            |
++-------------+  |<--INIT_LOGICAL $PLUGIN--------|            |
+|     TX      |  |                               |            |
+| reassembly  |  |---FOUND_STARTING %X/%X------->|            |
++-------------/  |                               |            |
+|            |   |---FOUND_CONSISTENT %X/%X----->|            |
+|            v   |---pg_dump snapshot----------->|            |
++-------------+  |---replication slot %P-------->|            |
+|   Output    |  |                               |            |
+|   Plugin    |  |    ^                          |            |
++-------------/  |    |                          |            |
+|                |    +-run pg_dump separately --|            |
+|                |                               |            |
+|                |<--STREAM_DATA-----------------|            |
+|                |                               |            |
+|                |---data ---------------------->|            |
+|                |                               |            |
+|                |                               |            |
+|                |  ---- SHUTDOWN -------------  |            |
+|                |                               |            |
+|                |                               |            |
+|                |<--RESTART_LOGICAL $PLUGIN %P--|            |
+|                |                               |            |
+|                |---data----------------------->|            |
+|                |                               |            |
+|                |                               |            |
++----------------+                               +------------+
+
+------------------------------------------------------------------------------
+
+=== Disadvantages of the approach ===
+
+* somewhat intricate code for snapshot timetravel
+* output plugins/walsenders need to work per database as they access the catalog
+* when sending to multiple standbys some work is done multiple times
+* decoding/applying multiple transactions in parallel is somewhat hard
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index 310a45c..6fae278 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -17,3 +17,9 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
 OBJS = decode.o logical.o logicalfuncs.o reorderbuffer.o snapbuild.o
 
 include $(top_srcdir)/src/backend/common.mk
+
+DESIGN.pdf: DESIGN.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
+
+README.SNAPBUILD.pdf: README.SNAPBUILD.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
diff --git a/src/backend/replication/logical/README.SNAPBUILD.txt b/src/backend/replication/logical/README.SNAPBUILD.txt
new file mode 100644
index 0000000..b6c7470
--- /dev/null
+++ b/src/backend/replication/logical/README.SNAPBUILD.txt
@@ -0,0 +1,241 @@
+= Snapshot Building =
+:author: Andres Freund, 2nQuadrant Ltd
+
+== Why do we need timetravel catalog access ==
+
+When doing WAL decoding (see DESIGN.txt for reasons to do so), we need to know
+how the catalog looked at the point a record was inserted into the WAL, because
+without that information we don't know much more about the record other than
+its length.  It's just an arbitrary bunch of bytes without further information.
+Unfortunately, due the possibility that the table definition might change we
+cannot just access a newer version of the catalog and assume the table
+definition continues to be the same.
+
+If only the type information were required, it might be enough to annotate the
+wal records with a bit more information (table oid, table name, column name,
+column type) --- but as we want to be able to convert the output to more useful
+formats such as text, we additionally need to be able to call output functions.
+Those need a normal environment including the usual caches and normal catalog
+access to lookup operators, functions and other types.
+
+Our solution to this is to add the capability to access the catalog such as it
+was at the time the record was inserted into the WAL. The locking used during
+WAL generation guarantees the catalog is/was in a consistent state at that
+point.  We call this 'time-travel catalog access'.
+
+Interesting cases include:
+
+- enums
+- composite types
+- extension types
+- non-C functions
+- relfilenode to table OID mapping
+
+Due to postgres' non-overwriting storage manager, regular modifications of a
+table's content are theoretically non-destructive. The problem is that there is
+no way to access an arbitrary point in time even if the data for it is there.
+
+This module adds the capability to do so in the very limited set of
+circumstances we need it in for WAL decoding. It does *not* provide a general
+time-travelling facility.
+
+A 'Snapshot' is the data structure used in postgres to describe which tuples
+are visible and which are not. We need to build a Snapshot which can be used to
+access the catalog the way it looked when the wal record was inserted.
+
+Restrictions:
+
+- Only works for catalog tables or tables explicitly marked as such.
+- Snapshot modifications are somewhat expensive
+- it cannot build initial visibility information for every point in time, it
+  needs a specific circumstances to start.
+
+== How are time-travel snapshots built ==
+
+'Hot Standby' added infrastructure to build snapshots from WAL during recovery in
+the 9.0 release. Most of that can be reused for our purposes.
+
+We cannot reuse all of the hot standby infrastructure because:
+
+- we are not in recovery
+- we need to look at interim states *inside* a transaction
+- we need the capability to have multiple different snapshots arround at the same time
+
+Normally the catalog is accessed using SnapshotNow which can legally be
+replaced by SnapshotMVCC that has been taken at the start of a scan. So catalog
+timetravel contains infrastructure to make SnapshotNow catalog access use
+appropriate MVCC snapshots. They aren't generated with GetSnapshotData()
+though, but reassembled from WAL contents.
+
+We collect our data in a normal struct SnapshotData, repurposing some fields
+creatively:
+
+- +Snapshot->xip+ contains all transaction we consider committed
+- +Snapshot->subxip+ contains all transactions belonging to our transaction,
+  including the toplevel one
+- +Snapshot->active_count+ is used as a refcount
+
+The meaning of +xip+ is inverted in comparison with non-timetravel snapshots in
+the sense that members of the array are the committed transactions, not the in
+progress ones. Because usually only a tiny percentage of comitted transactions
+will have modified the catalog between xmin and xmax this allows us to keep the
+array small in the usual cases. It also makes subtransaction handling easier
+since we neither need to query pg_subtrans (which we couldn't anyway since it's
+truncated at restart) nor have problems with suboverflowed snapshots.
+
+== Building of initial snapshot ==
+
+We can start building an initial snapshot as soon as we find either an
++XLOG_RUNNING_XACTS+ or an +XLOG_CHECKPOINT_SHUTDOWN+ record because they allow us
+to know how many transactions are running.
+
+We need to know which transactions were running when we start to build a
+snapshot/start decoding as we don't have enough information about them (they
+could have done catalog modifications before we started watching). Also, we
+wouldn't have the complete contents of those transactions, because we started
+reading after they began.  (The latter is also important when building
+snapshots that can be used to build a consistent initial clone.)
+
+There also is the problem that +XLOG_RUNNING_XACT+ records can be
+'suboverflowed' which means there were more running subtransactions than
+fitting into shared memory. In that case we use the same incremental building
+trick hot standby uses which is either
+
+1. wait till further +XLOG_RUNNING_XACT+ records have a running->oldestRunningXid
+after the initial xl_runnign_xacts->nextXid
+2. wait for a further +XLOG_RUNNING_XACT+ that is not overflowed or
+a +XLOG_CHECKPOINT_SHUTDOWN+
+
+When we start building a snapshot we are in the +SNAPBUILD_START+ state. As
+soon as we find any visibility information, even if incomplete, we change to
++SNAPBUILD_INITIAL_POINT+.
+
+When we have collected enough information to decode any transaction starting
+after that point in time we fall over to +SNAPBUILD_FULL_SNAPSHOT+. If those
+transactions commit before the next state is reached, we throw their complete
+contents away.
+
+As soon as all transactions that were running when we switched over to
++SNAPBUILD_FULL_SNAPSHOT+ commit, we change state to +SNAPBUILD_CONSISTENT+.
+Every transaction that commits from now on gets handed to the output plugin.
+When doing the switch to +SNAPBUILD_CONSISTENT+ we optionally export a snapshot
+which makes all transactions that committed up to this point visible.  This
+exported snapshot can be used to run pg_dump; replaying all changes emitted
+by the output plugin on a database restored from such a dump will result in
+a consistent clone.
+
+["ditaa",scaling="0.8"]
+---------------
+
+        +-------------------------+
+   +----|SNAPBUILD_START          |-------------+
+   |    +-------------------------+             |
+   |                 |                          |
+   |                 |                          |
+   |     running_xacts with running xacts       |
+   |                 |                          |
+   |                 |                          |
+   |                 v                          |
+   |    +-------------------------+             v
+   |    |SNAPBUILD_FULL_SNAPSHOT  |------------>|
+   |    +-------------------------+             |
+XLOG_RUNNING_XACTS   |                      saved snapshot
+  with zero xacts    |                 at running_xacts's lsn
+   |                 |                          |
+   |     all running toplevel TXNs finished     |
+   |                 |                          |
+   |                 v                          |
+   |    +-------------------------+             |
+   +--->|SNAPBUILD_CONSISTENT     |<------------+
+        +-------------------------+
+
+---------------
+
+== Snapshot Management ==
+
+Whenever a transaction is detected as having started during decoding in
++SNAPBUILD_FULL_SNAPSHOT+ state, we distribute the currently maintained
+snapshot to it (i.e. call ReorderBufferSetBaseSnapshot). This serves as its
+initial snapshot. Unless there are concurrent catalog changes that snapshot
+will be used for the decoding the entire transaction's changes.
+
+Whenever a transaction-with-catalog-changes commits, we iterate over all
+concurrently active transactions and add a new SnapshotNow to it
+(ReorderBufferAddSnapshot(current_lsn)). This is required because any row
+written from now that point on will have used the changed catalog contents.
+
+When decoding a transaction that made catalog changes itself we tell that
+transaction that (ReorderBufferAddNewCommandId(current_lsn)) which will cause
+the decoding to use the appropriate command id from that point on.
+
+SnapshotNow's need to be setup globally so the syscache and other pieces access
+it transparently. This is done using two new tqual.h functions:
+SetupDecodingSnapshots() and RevertFromDecodingSnapshots().
+
+== Catalog/User Table Detection ==
+
+Since we only want to store committed transactions that actually modified the
+catalog we need a way to detect that from WAL:
+
+Right now, we assume that every transaction that commits before we reach
++SNAPBUILD_CONSISTENT+ state has made catalog modifications since we can't rely
+on having seen the entire transaction before that. That's not harmful beside
+incurring some price in memory usage and runtime.
+
+After having reached consistency we recognize catalog modifying transactions
+via HEAP2_NEW_CID and HEAP_INPLACE that are logged by catalog modifying
+actions.
+
+== mixed DDL/DML transaction handling  ==
+
+When a transactions uses DDL and DML in the same transaction things get a bit
+more complicated because we need to handle CommandIds and ComboCids as we need
+to use the correct version of the catalog when decoding the individual tuples.
+
+For that we emit the new HEAP2_NEW_CID records which contain the physical tuple
+location, cmin and cmax when the catalog is modified. If we need to detect
+visibility of a catalog tuple that has been modified in our own transaction -
+which we can detect via xmin/xmax - we look in a hash table using the location
+as key to get correct cmin/cmax values.
+From those values we can also extract the commandid that generated the record.
+
+All this only needs to happen in the transaction performing the DDL.
+
+== Cache Handling ==
+
+As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate
+cache invalidation. For transactions that only do DDL thats easy as everything
+is already provided by HS. Everytime we read a commit record we apply the
+sinval messages contained therein.
+
+For transactions that contain DDL and DML cache invalidation needs to happen
+more frequently because we need to all tore down all caches that just got
+modified. To do that we simply apply all invalidation messages that got
+collected at the end of transaction and apply them everytime we've decoded
+single change. At some point this can get optimized by generating new local
+invalidation messages, but that seems too complicated for now.
+
+XXX: talk about syscache handling of relmapped relation.
+
+== xmin Horizon Handling ==
+
+Reusing MVCC for timetravel access has one obvious major problem: VACUUM. Rows
+we still need for decoding cannot be removed but at the same time we cannot
+keep data in the catalog indefinitely.
+
+For that we peg the xmin horizon that's used to decide which rows can be
+removed. We only need to prevent removal of those rows for catalog like
+relations, not for all user tables. For that reason a separate xmin horizon
+RecentGlobalDataXmin got introduced.
+
+Since we need to persist that knowledge across restarts we keep the xmin for a
+in the logical slots which are safed in a crashsafe manner. They are restored
+from disk into memory at server startup.
+
+== Restartable Decoding ==
+
+As we want to generate a consistent stream of changes we need to have the
+ability to start from a previously decoded location without waiting possibly
+very long to reach consistency. For that reason we dump the current visibility
+information to disk everytime we read an xl_running_xacts record.
+
-- 
1.8.2.rc2.4.g7799588.dirty