Home > mailing lists
Re: [PATCHES] Infrastructure changes for recovery (v8) - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: [PATCHES] Infrastructure changes for recovery (v8)
Date	December 22, 2008 16:19:13
Msg-id	494FF631.90908@enterprisedb.com Whole thread Raw
In response to	Re: [PATCHES] Infrastructure changes for recovery (v8) (Simon Riggs <simon@2ndQuadrant.com>)
Responses	Re: [PATCHES] Infrastructure changes for recovery (v8) Re: [PATCHES] Infrastructure changes for recovery (v8)
List	pgsql-hackers
Tree view
Simon Riggs wrote:
> On Wed, 2008-12-17 at 23:32 -0300, Alvaro Herrera wrote:
>> Simon Riggs escribió:
>>
>>> Please let me know how I can make the reviewer's job easier. Diagrams,
>>> writeups, whatever. Thanks,
>> A link perhaps?
>
> There is much confusion on this point for which I'm very sorry.
>
> I originally wrote "infra" patch to allow it to be committed separately
> in the Sept commitfest, to reduce size of the forthcoming hotstandby
> patch. That didn't happen (no moans there) so the eventual "hotstandby"
> patch includes all of what was the infra patch, plus the new code.
>
> So currently there is no separate "infra" patch. The two line items on
> the CommitFest page are really just one large project. I would be in
> favour of removing the "infra" lines from the CommitFest page.

I think it's useful to review the "infra" part of the patch separately,
so I split it out of the big patch again. I haven't looked at this in
detail yet, but it compiles and passes regression tests.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
*** src/backend/access/transam/clog.c
--- src/backend/access/transam/clog.c
***************
*** 475,480 **** ZeroCLOGPage(int pageno, bool writeXlog)
--- 475,483 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  *
+  * We access just a single clog page, so this action is atomic and safe
+  * for use if other processes are active during recovery.
   */
  void
  StartupCLOG(void)
*** src/backend/access/transam/multixact.c
--- src/backend/access/transam/multixact.c
***************
*** 1413,1420 **** ZeroMultiXactMemberPage(int pageno, bool writeXlog)
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.    Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We don't need any locks here, really; the SLRU locks are taken
!  * only because slru.c expects to be called with locks held.
   */
  void
  StartupMultiXact(void)
--- 1413,1423 ----
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.    Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We want this operation to be atomic to ensure that other processes can
!  * use MultiXact while we complete recovery. We access one page only from the
!  * offset and members buffers, so once locks are acquired they will not be
!  * dropped and re-acquired by SLRU code. So we take both locks at start, then
!  * hold them all the way to the end.
   */
  void
  StartupMultiXact(void)
***************
*** 1426,1431 **** StartupMultiXact(void)
--- 1429,1435 ----

      /* Clean up offsets state */
      LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+     LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);

      /*
       * Initialize our idea of the latest page number.
***************
*** 1452,1461 **** StartupMultiXact(void)
          MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
      }

-     LWLockRelease(MultiXactOffsetControlLock);
-
      /* And the same for members */
-     LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);

      /*
       * Initialize our idea of the latest page number.
--- 1456,1462 ----
***************
*** 1483,1488 **** StartupMultiXact(void)
--- 1484,1490 ----
      }

      LWLockRelease(MultiXactMemberControlLock);
+     LWLockRelease(MultiXactOffsetControlLock);

      /*
       * Initialize lastTruncationPoint to invalid, ensuring that the first
***************
*** 1542,1549 **** CheckPointMultiXact(void)
       * isn't valid (because StartupMultiXact hasn't been called yet) and so
       * SimpleLruTruncate would get confused.  It seems best not to risk
       * removing any data during recovery anyway, so don't truncate.
       */
!     if (!InRecovery)
          TruncateMultiXact();

      TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1544,1552 ----
       * isn't valid (because StartupMultiXact hasn't been called yet) and so
       * SimpleLruTruncate would get confused.  It seems best not to risk
       * removing any data during recovery anyway, so don't truncate.
+      * We are executing in the bgwriter, so we must access shared status.
       */
!     if (!IsRecoveryProcessingMode())
          TruncateMultiXact();

      TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
*** src/backend/access/transam/slru.c
--- src/backend/access/transam/slru.c
***************
*** 598,604 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
       * commands to set the commit status of transactions whose bits are in
       * already-truncated segments of the commit log (see notes in
       * SlruPhysicalWritePage).    Hence, if we are InRecovery, allow the case
!      * where the file doesn't exist, and return zeroes instead.
       */
      fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
      if (fd < 0)
--- 598,605 ----
       * commands to set the commit status of transactions whose bits are in
       * already-truncated segments of the commit log (see notes in
       * SlruPhysicalWritePage).    Hence, if we are InRecovery, allow the case
!      * where the file doesn't exist, and return zeroes instead. We also
!      * return a zeroed page when seek and read fails.
       */
      fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
      if (fd < 0)
***************
*** 619,624 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
--- 620,633 ----

      if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
      {
+         if (InRecovery)
+         {
+             ereport(LOG,
+                     (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+                             path)));
+             MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+             return true;
+         }
          slru_errcause = SLRU_SEEK_FAILED;
          slru_errno = errno;
          close(fd);
***************
*** 628,633 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
--- 637,650 ----
      errno = 0;
      if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
      {
+         if (InRecovery)
+         {
+             ereport(LOG,
+                     (errmsg("file \"%s\" doesn't exist, reading as zeroes",
+                             path)));
+             MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+             return true;
+         }
          slru_errcause = SLRU_READ_FAILED;
          slru_errno = errno;
          close(fd);
*** src/backend/access/transam/subtrans.c
--- src/backend/access/transam/subtrans.c
***************
*** 223,255 **** ZeroSUBTRANSPage(int pageno)
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
-  *
-  * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
-  * if there are none.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
  {
!     int            startPage;
!     int            endPage;

-     /*
-      * Since we don't expect pg_subtrans to be valid across crashes, we
-      * initialize the currently-active page(s) to zeroes during startup.
-      * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
-      * the new page without regard to whatever was previously on disk.
-      */
      LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);

!     startPage = TransactionIdToPage(oldestActiveXID);
!     endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
!
!     while (startPage != endPage)
!     {
!         (void) ZeroSUBTRANSPage(startPage);
!         startPage++;
!     }
!     (void) ZeroSUBTRANSPage(startPage);

      LWLockRelease(SubtransControlLock);
  }
--- 223,241 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
  {
!     TransactionId xid = ShmemVariableCache->nextXid;
!     int            pageno = TransactionIdToPage(xid);

      LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);

!     /*
!      * Initialize our idea of the latest page number.
!      */
!     SubTransCtl->shared->latest_page_number = pageno;

      LWLockRelease(SubtransControlLock);
  }
*** src/backend/access/transam/xact.c
--- src/backend/access/transam/xact.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "storage/fd.h"
  #include "storage/lmgr.h"
  #include "storage/procarray.h"
+ #include "storage/sinval.h"
  #include "storage/sinvaladt.h"
  #include "storage/smgr.h"
  #include "utils/combocid.h"
*** src/backend/access/transam/xlog.c
--- src/backend/access/transam/xlog.c
***************
*** 114,120 **** CheckpointStatsData CheckpointStats;

  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system.
   */
  TimeLineID    ThisTimeLineID = 0;

--- 114,121 ----

  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system. Zero is always a bug, so we
!  * start with that to allow us to spot any errors.
   */
  TimeLineID    ThisTimeLineID = 0;

***************
*** 122,128 **** TimeLineID    ThisTimeLineID = 0;
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
! static bool InArchiveRecovery = false;

  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
--- 123,136 ----
  bool        InRecovery = false;

  /* Are we recovering using offline XLOG archives? */
! bool         InArchiveRecovery = false;
!
! /* Local copy of shared RecoveryProcessingMode state */
! static bool LocalRecoveryProcessingMode = true;
! static bool knownProcessingMode = false;
!
! /* is the database proven consistent yet? */
! bool    reachedSafeStartPoint = false;

  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
***************
*** 241,250 **** static XLogRecPtr RedoRecPtr;
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
   *
   *----------
   */

--- 249,278 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict
!  * to ensure there are no mistakes.
   *
+  * In 8.4 we progress through a number of states at startup. Initially, the
+  * postmaster is in PM_STARTUP state and spawns the Startup process. We then
+  * progress until the database is in a consistent state, then if we are in
+  * InArchiveRecovery we go into PM_RECOVERY state. The bgwriter then starts
+  * up and takes over responsibility for performing restartpoints. We then
+  * progress until the end of recovery when we enter PM_RUN state upon
+  * termination of the Startup process. In summary:
+  *
+  * PM_STARTUP state:    Startup process performs restartpoints
+  * PM_RECOVERY state:    bgwriter process performs restartpoints
+  * PM_RUN state:         bgwriter process performs checkpoints
+  *
+  * These transitions are fairly delicate, with many things that need to
+  * happen at the same time in order to change state successfully throughout
+  * the system. Changing PM_STARTUP to PM_RECOVERY only occurs when we can
+  * prove the databases are in a consistent state. Changing from PM_RECOVERY
+  * to PM_RUN happens whenever recovery ends, which could be forced upon us
+  * externally or it can occur because of damage or termination of the WAL
+  * sequence.
   *----------
   */

***************
*** 312,317 **** typedef struct XLogCtlData
--- 340,372 ----
      int            XLogCacheBlck;    /* highest allocated xlog buffer index */
      TimeLineID    ThisTimeLineID;

+     /*
+      * IsRecoveryProcessingMode shows whether the postmaster is in a
+      * postmaster state earlier than PM_RUN, or not. This is a globally
+      * accessible state to allow EXEC_BACKEND case.
+      *
+      * We also retain a local state variable InRecovery. InRecovery=true
+      * means the code is being executed by Startup process and therefore
+      * always during Recovery Processing Mode. This allows us to identify
+      * code executed *during* Recovery Processing Mode but not necessarily
+      * by Startup process itself.
+      *
+      * This is only written to by the startup process, so no need for locking.
+      */
+     bool        SharedRecoveryProcessingMode;
+
+     /*
+      * recovery target control information
+      *
+      * Protected by info_lck
+      */
+     TransactionId    recoveryTargetXid;
+     TimestampTz        recoveryTargetTime;
+     int                recoveryTargetAdvance;
+
+     TimestampTz     recoveryLastXTime;
+     TransactionId     recoveryLastXid;
+
      slock_t        info_lck;        /* locks shared variables shown above */
  } XLogCtlData;

***************
*** 398,405 **** static void XLogArchiveCleanup(const char *xlog);
--- 453,462 ----
  static void readRecoveryCommandFile(void);
  static void exitArchiveRecovery(TimeLineID endTLI,
                      uint32 endLogId, uint32 endLogSeg);
+ static void exitRecovery(void);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+ static XLogRecPtr GetRedoLocationForCheckpoint(void);

  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
                  XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 482,487 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 539,552 ----
      bool        updrqst;
      bool        doPageWrites;
      bool        isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+     bool        isRecoveryEnd = (rmid == RM_XLOG_ID &&
+                                     (info == XLOG_RECOVERY_END ||
+                                      info == XLOG_CHECKPOINT_ONLINE));
+
+     /* cross-check on whether we should be here or not */
+     if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+         elog(FATAL, "cannot make new WAL entries during recovery "
+                     "(RMgrId = %d info = %d)", rmid, info);

      /* info's high bits are reserved for use by me */
      if (info & XLR_INFO_MASK)
***************
*** 1728,1735 **** XLogFlush(XLogRecPtr record)
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     /* Disabled during REDO */
!     if (InRedo)
          return;

      /* Quick exit if already known flushed */
--- 1793,1799 ----
      XLogRecPtr    WriteRqstPtr;
      XLogwrtRqst WriteRqst;

!     if (IsRecoveryProcessingMode())
          return;

      /* Quick exit if already known flushed */
***************
*** 1817,1826 **** XLogFlush(XLogRecPtr record)
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while InRedo is true, but if the bad page is brought in
!      * and marked dirty during recovery then CreateCheckPoint will try to
!      * flush it at the end of recovery.)
       *
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
       * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR will
--- 1881,1891 ----
       * the bad page is encountered again during recovery then we would be
       * unable to restart the database at all!  (This scenario has actually
       * happened in the field several times with 7.1 releases. Note that we
!      * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
!      * brought in and marked dirty during recovery, the next checkpoint after
!      * recovery will try to flush it.
       *
+      * XXX obsolete comment
       * The current approach is to ERROR under normal conditions, but only
       * WARNING during recovery, so that the system can be brought up even if
       * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR will
***************
*** 1829,1835 **** XLogFlush(XLogRecPtr record)
       * and so we will not force a restart for a bad LSN on a data page.
       */
      if (XLByteLT(LogwrtResult.Flush, record))
!         elog(InRecovery ? WARNING : ERROR,
          "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
               record.xlogid, record.xrecoff,
               LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1894,1900 ----
       * and so we will not force a restart for a bad LSN on a data page.
       */
      if (XLByteLT(LogwrtResult.Flush, record))
!         elog(ERROR,
          "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
               record.xlogid, record.xrecoff,
               LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
***************
*** 2102,2108 **** XLogFileInit(uint32 log, uint32 seg,
          unlink(tmppath);
      }

!     elog(DEBUG2, "done creating and filling new WAL file");

      /* Set flag to tell caller there was no existent file */
      *use_existent = false;
--- 2167,2174 ----
          unlink(tmppath);
      }

!     XLogFileName(tmppath, ThisTimeLineID, log, seg);
!     elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

      /* Set flag to tell caller there was no existent file */
      *use_existent = false;
***************
*** 2408,2413 **** XLogFileRead(uint32 log, uint32 seg, int emode)
--- 2474,2501 ----
                       xlogfname);
              set_ps_display(activitymsg, false);

+             /*
+              * Calculate and write out a new safeStartPoint. This defines
+              * the latest LSN that might appear on-disk while we apply
+              * the WAL records in this file. If we crash during recovery
+              * we must reach this point again before we can prove
+              * database consistency. Not a restartpoint! Restart points
+              * define where we should start recovery from, if we crash.
+              */
+             if (InArchiveRecovery)
+             {
+                 uint32 nextLog = log;
+                 uint32 nextSeg = seg;
+
+                 NextLogSeg(nextLog, nextSeg);
+
+                 LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+                 ControlFile->minSafeStartPoint.xlogid = nextLog;
+                 ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize;
+                 UpdateControlFile();
+                 LWLockRelease(ControlFileLock);
+             }
+
              return fd;
          }
          if (errno != ENOENT)    /* unexpected failure? */
***************
*** 4733,4754 **** exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
      unlink(recoveryPath);        /* ignore any error */

      /*
!      * Rename the config file out of the way, so that we don't accidentally
!      * re-enter archive recovery mode in a subsequent crash.
       */
-     unlink(RECOVERY_COMMAND_DONE);
-     if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
-         ereport(FATAL,
-                 (errcode_for_file_access(),
-                  errmsg("could not rename file \"%s\" to \"%s\": %m",
-                         RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));

      ereport(LOG,
              (errmsg("archive recovery complete")));
  }

  /*
!  * For point-in-time recovery, this function decides whether we want to
   * stop applying the XLOG at or after the current record.
   *
   * Returns TRUE if we are stopping, FALSE otherwise.  On TRUE return,
--- 4821,4840 ----
      unlink(recoveryPath);        /* ignore any error */

      /*
!      * As of 8.4 we no longer rename the recovery.conf file out of the
!      * way until after we have performed a full checkpoint. This ensures
!      * that any crash between now and the end of the checkpoint does not
!      * attempt to restart from a WAL file that is no longer available to us.
!      * As soon as we remove recovery.conf we lose our recovery_command and
!      * cannot reaccess WAL files from the archive.
       */

      ereport(LOG,
              (errmsg("archive recovery complete")));
  }

  /*
!  * For archive recovery, this function decides whether we want to
   * stop applying the XLOG at or after the current record.
   *
   * Returns TRUE if we are stopping, FALSE otherwise.  On TRUE return,
***************
*** 4876,4881 **** StartupXLOG(void)
--- 4962,4968 ----
      CheckPoint    checkPoint;
      bool        wasShutdown;
      bool        reachedStopPoint = false;
+     bool        performedRecovery = false;
      bool        haveBackupLabel = false;
      XLogRecPtr    RecPtr,
                  LastRec,
***************
*** 4888,4893 **** StartupXLOG(void)
--- 4975,4982 ----
      uint32        freespace;
      TransactionId oldestActiveXID;

+     XLogCtl->SharedRecoveryProcessingMode = true;
+
      /*
       * Read control file and check XLOG status looks valid.
       *
***************
*** 5108,5116 **** StartupXLOG(void)
--- 5197,5211 ----
          if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
              ControlFile->minRecoveryPoint = minRecoveryLoc;
          ControlFile->time = (pg_time_t) time(NULL);
+         /* No need to hold ControlFileLock yet, we aren't up far enough */
          UpdateControlFile();

          /*
+          * Reset pgstat data, because it may be invalid after recovery.
+          */
+         pgstat_reset_all();
+
+         /*
           * If there was a backup label file, it's done its job and the info
           * has now been propagated into pg_control.  We must get rid of the
           * label file so that if we crash during recovery, we'll pick up at
***************
*** 5220,5225 **** StartupXLOG(void)
--- 5315,5348 ----

                  LastRec = ReadRecPtr;

+                 /*
+                  * Can we signal Postmaster to enter consistent recovery mode?
+                  *
+                  * There are two points in the log that we must pass. The first
+                  * is minRecoveryPoint, which is the LSN at the time the
+                  * base backup was taken that we are about to rollforward from.
+                  * If recovery has ever crashed or was stopped there is also
+                  * another point also: minSafeStartPoint, which we know the
+                  * latest LSN that recovery could have reached prior to crash.
+                  *
+                  * We must also have assembled sufficient information about
+                  * transaction state to allow valid snapshots to be taken.
+                  */
+                 if (!reachedSafeStartPoint &&
+                      XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) &&
+                      XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+                 {
+                     reachedSafeStartPoint = true;
+                     if (InArchiveRecovery)
+                     {
+                         ereport(LOG,
+                             (errmsg("database has now reached consistent state at %X/%X",
+                                 EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+                         if (IsUnderPostmaster)
+                             SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+                     }
+                 }
+
                  record = ReadRecord(NULL, LOG);
              } while (record != NULL && recoveryContinue);

***************
*** 5241,5246 **** StartupXLOG(void)
--- 5364,5370 ----
              /* there are no WAL records following the checkpoint */
              ereport(LOG,
                      (errmsg("redo is not required")));
+             reachedSafeStartPoint = true;
          }
      }

***************
*** 5254,5269 **** StartupXLOG(void)

      /*
       * Complain if we did not roll forward far enough to render the backup
!      * dump consistent.
       */
!     if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
                      (errmsg("requested recovery stop point is before end time of backup dump")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("WAL ends before end time of backup dump")));
      }

      /*
--- 5378,5393 ----

      /*
       * Complain if we did not roll forward far enough to render the backup
!      * dump consistent and start safely.
       */
!     if (InArchiveRecovery && !reachedSafeStartPoint)
      {
          if (reachedStopPoint)    /* stopped because of stop request */
              ereport(FATAL,
                      (errmsg("requested recovery stop point is before end time of backup dump")));
          else    /* ran off end of WAL */
              ereport(FATAL,
!                     (errmsg("end of WAL reached before end time of backup dump")));
      }

      /*
***************
*** 5378,5416 **** StartupXLOG(void)
          XLogCheckInvalidPages();

          /*
!          * Reset pgstat data, because it may be invalid after recovery.
           */
!         pgstat_reset_all();

!         /*
!          * Perform a checkpoint to update all our recovery activity to disk.
!          *
!          * Note that we write a shutdown checkpoint rather than an on-line
!          * one. This is not particularly critical, but since we may be
!          * assigning a new TLI, using a shutdown checkpoint allows us to have
!          * the rule that TLI only changes in shutdown checkpoints, which
!          * allows some extra error checking in xlog_redo.
!          */
!         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
      }

-     /*
-      * Preallocate additional log files, if wanted.
-      */
-     PreallocXlogFiles(EndOfLog);
-
-     /*
-      * Okay, we're officially UP.
-      */
-     InRecovery = false;
-
-     ControlFile->state = DB_IN_PRODUCTION;
-     ControlFile->time = (pg_time_t) time(NULL);
-     UpdateControlFile();
-
-     /* start the archive_timeout timer running */
-     XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
-
      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
      XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
--- 5502,5515 ----
          XLogCheckInvalidPages();

          /*
!          * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
!          * a shutdown checkpoint here, but we ask bgwriter to do that now.
           */
!         exitRecovery();

!         performedRecovery = true;
      }

      /* initialize shared-memory copy of latest checkpoint XID/epoch */
      XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
      XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
***************
*** 5444,5449 **** StartupXLOG(void)
--- 5543,5641 ----
          readRecordBuf = NULL;
          readRecordBufSize = 0;
      }
+
+     /*
+      * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
+      * This could add minutes to the startup time, so we want bgwriter
+      * to perform it. This then frees the Startup process to complete so we can
+      * allow transactions and WAL inserts. We still write a checkpoint, but
+      * it will be an online checkpoint. Online checkpoints have a redo
+      * location that can be prior to the actual checkpoint record. So we want
+      * to derive that redo location *before* we let anybody else write WAL,
+      * otherwise we might miss some WAL records if we crash.
+      */
+     if (performedRecovery)
+     {
+         XLogRecPtr    redo;
+
+         /*
+          * We must grab the pointer before anybody writes WAL
+          */
+         redo = GetRedoLocationForCheckpoint();
+
+         /*
+          * Set up information for the bgwriter, but if it is not active
+          * for whatever reason, perform the checkpoint ourselves.
+          */
+         if (SetRedoLocationForArchiveCheckpoint(redo))
+         {
+             /*
+              * Okay, we can come up now. Allow others to write WAL.
+              */
+             XLogCtl->SharedRecoveryProcessingMode = false;
+
+             /*
+              * Now request checkpoint from bgwriter.
+              */
+             RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+         }
+         else
+         {
+             /*
+              * Startup process performs the checkpoint, but defers
+              * the change in processing mode until afterwards.
+              */
+             CreateCheckPoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+         }
+     }
+     else
+     {
+         /*
+          * No recovery, so lets just get on with it.
+          */
+         LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+         ControlFile->state = DB_IN_PRODUCTION;
+         ControlFile->time = (pg_time_t) time(NULL);
+         UpdateControlFile();
+         LWLockRelease(ControlFileLock);
+     }
+
+     /*
+      * Okay, we can come up now. Allow others to write WAL.
+      */
+     XLogCtl->SharedRecoveryProcessingMode = false;
+
+     /* start the archive_timeout timer running */
+     XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+ }
+
+ /*
+  * IsRecoveryProcessingMode()
+  *
+  * Fast test for whether we're still in recovery or not. We test the shared
+  * state each time only until we leave recovery mode. After that we never
+  * look again, relying upon the settings of our local state variables. This
+  * is designed to avoid the need for a separate initialisation step.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+     if (knownProcessingMode && !LocalRecoveryProcessingMode)
+         return false;
+
+     {
+         /* use volatile pointer to prevent code rearrangement */
+         volatile XLogCtlData *xlogctl = XLogCtl;
+
+         if (xlogctl == NULL)
+             return false;
+
+         LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
+     }
+
+     knownProcessingMode = true;
+
+     return LocalRecoveryProcessingMode;
  }

  /*
***************
*** 5701,5720 **** ShutdownXLOG(int code, Datum arg)
  static void
  LogCheckpointStart(int flags)
  {
!     elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
!          (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
!          (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
!          (flags & CHECKPOINT_FORCE) ? " force" : "",
!          (flags & CHECKPOINT_WAIT) ? " wait" : "",
!          (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
!          (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }

  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
      long        write_secs,
                  sync_secs,
--- 5893,5916 ----
  static void
  LogCheckpointStart(int flags)
  {
!     if (flags & CHECKPOINT_RESTARTPOINT)
!         elog(LOG, "restartpoint starting:%s",
!             (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
!     else
!         elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
!              (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
!              (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
!              (flags & CHECKPOINT_FORCE) ? " force" : "",
!              (flags & CHECKPOINT_WAIT) ? " wait" : "",
!              (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
!              (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }

  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(int flags)
  {
      long        write_secs,
                  sync_secs,
***************
*** 5737,5753 **** LogCheckpointEnd(void)
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!          "%d transaction log file(s) added, %d removed, %d recycled; "
!          "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!          CheckpointStats.ckpt_bufs_written,
!          (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!          CheckpointStats.ckpt_segs_added,
!          CheckpointStats.ckpt_segs_removed,
!          CheckpointStats.ckpt_segs_recycled,
!          write_secs, write_usecs / 1000,
!          sync_secs, sync_usecs / 1000,
!          total_secs, total_usecs / 1000);
  }

  /*
--- 5933,5958 ----
                          CheckpointStats.ckpt_sync_end_t,
                          &sync_secs, &sync_usecs);

!     if (flags & CHECKPOINT_RESTARTPOINT)
!         elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
!     else
!         elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
!              "%d transaction log file(s) added, %d removed, %d recycled; "
!              "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
!              CheckpointStats.ckpt_bufs_written,
!              (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
!              CheckpointStats.ckpt_segs_added,
!              CheckpointStats.ckpt_segs_removed,
!              CheckpointStats.ckpt_segs_recycled,
!              write_secs, write_usecs / 1000,
!              sync_secs, sync_usecs / 1000,
!              total_secs, total_usecs / 1000);
  }

  /*
***************
*** 5772,5788 **** CreateCheckPoint(int flags)
      XLogRecPtr    recptr;
      XLogCtlInsert *Insert = &XLogCtl->Insert;
      XLogRecData rdata;
-     uint32        freespace;
      uint32        _logId;
      uint32        _logSeg;
      TransactionId *inCommitXids;
      int            nInCommit;

      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
!      * (This is just pro forma, since in the present system structure there is
!      * only one process that is allowed to issue checkpoints at any given
!      * time.)
       */
      LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

--- 5977,5992 ----
      XLogRecPtr    recptr;
      XLogCtlInsert *Insert = &XLogCtl->Insert;
      XLogRecData rdata;
      uint32        _logId;
      uint32        _logSeg;
      TransactionId *inCommitXids;
      int            nInCommit;
+     bool        leavingArchiveRecovery = false;

      /*
       * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
!      * That shouldn't be happening, but checkpoints are an important aspect
!      * of our resilience, so we take no chances.
       */
      LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

***************
*** 5797,5811 **** CreateCheckPoint(int flags)
--- 6001,6024 ----
      CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

      /*
+      * Find out if this is the first checkpoint after archive recovery.
+      */
+     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+     leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY);
+     LWLockRelease(ControlFileLock);
+
+     /*
       * Use a critical section to force system panic if we have trouble.
       */
      START_CRIT_SECTION();

      if (shutdown)
      {
+         LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
          ControlFile->state = DB_SHUTDOWNING;
          ControlFile->time = (pg_time_t) time(NULL);
          UpdateControlFile();
+         LWLockRelease(ControlFileLock);
      }

      /*
***************
*** 5861,5901 **** CreateCheckPoint(int flags)
          }
      }

!     /*
!      * Compute new REDO record ptr = location of next XLOG record.
!      *
!      * NB: this is NOT necessarily where the checkpoint record itself will be,
!      * since other backends may insert more XLOG records while we're off doing
!      * the buffer flush work.  Those XLOG records are logically after the
!      * checkpoint, even though physically before it.  Got that?
!      */
!     freespace = INSERT_FREESPACE(Insert);
!     if (freespace < SizeOfXLogRecord)
!     {
!         (void) AdvanceXLInsertBuffer(false);
!         /* OK to ignore update return flag, since we will do flush anyway */
!         freespace = INSERT_FREESPACE(Insert);
!     }
!     INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
!
!     /*
!      * Here we update the shared RedoRecPtr for future XLogInsert calls; this
!      * must be done while holding the insert lock AND the info_lck.
!      *
!      * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
!      * pointing past where it really needs to point.  This is okay; the only
!      * consequence is that XLogInsert might back up whole buffers that it
!      * didn't really need to.  We can't postpone advancing RedoRecPtr because
!      * XLogInserts that happen while we are dumping buffers must assume that
!      * their buffer changes are not included in the checkpoint.
!      */
      {
!         /* use volatile pointer to prevent code rearrangement */
!         volatile XLogCtlData *xlogctl = XLogCtl;
!
!         SpinLockAcquire(&xlogctl->info_lck);
!         RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
!         SpinLockRelease(&xlogctl->info_lck);
      }

      /*
--- 6074,6092 ----
          }
      }

!     if (leavingArchiveRecovery)
!         checkPoint.redo = GetRedoLocationForArchiveCheckpoint();
!     else
      {
!         /*
!          * Compute new REDO record ptr = location of next XLOG record.
!          *
!          * NB: this is NOT necessarily where the checkpoint record itself will be,
!          * since other backends may insert more XLOG records while we're off doing
!          * the buffer flush work.  Those XLOG records are logically after the
!          * checkpoint, even though physically before it.  Got that?
!          */
!         checkPoint.redo = GetRedoLocationForCheckpoint();
      }

      /*
***************
*** 6013,6023 **** CreateCheckPoint(int flags)
      XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);

      /*
!      * Update the control file.
       */
      LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      if (shutdown)
          ControlFile->state = DB_SHUTDOWNED;
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
      ControlFile->checkPoint = ProcLastRecPtr;
      ControlFile->checkPointCopy = checkPoint;
--- 6204,6221 ----
      XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);

      /*
!      * Update the control file. In 8.4, this routine becomes the primary
!      * point for recording changes of state in the control file at the
!      * end of recovery. Postmaster state already shows us being in
!      * normal running mode, but it is only after this point that we
!      * are completely free of reperforming a recovery if we crash.  Note
!      * that this is executed by bgwriter after the death of Startup process.
       */
      LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
      if (shutdown)
          ControlFile->state = DB_SHUTDOWNED;
+     else
+         ControlFile->state = DB_IN_PRODUCTION;
      ControlFile->prevCheckPoint = ControlFile->checkPoint;
      ControlFile->checkPoint = ProcLastRecPtr;
      ControlFile->checkPointCopy = checkPoint;
***************
*** 6025,6030 **** CreateCheckPoint(int flags)
--- 6223,6243 ----
      UpdateControlFile();
      LWLockRelease(ControlFileLock);

+     if (leavingArchiveRecovery)
+     {
+         /*
+          * Rename the config file out of the way, so that we don't accidentally
+          * re-enter archive recovery mode in a subsequent crash. Prior to
+          * 8.4 this step was performed at end of exitArchiveRecovery().
+          */
+         unlink(RECOVERY_COMMAND_DONE);
+         if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+             ereport(ERROR,
+                     (errcode_for_file_access(),
+                      errmsg("could not rename file \"%s\" to \"%s\": %m",
+                                 RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+     }
+
      /* Update shared-memory copy of checkpoint XID/epoch */
      {
          /* use volatile pointer to prevent code rearrangement */
***************
*** 6068,6082 **** CreateCheckPoint(int flags)
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
       * attempt to reference any pg_subtrans entry older than that (see Asserts
!      * in subtrans.c).    During recovery, though, we mustn't do this because
!      * StartupSUBTRANS hasn't been called yet.
       */
!     if (!InRecovery)
          TruncateSUBTRANS(GetOldestXmin(true, false));

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd();

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
--- 6281,6294 ----
       * Truncate pg_subtrans if possible.  We can throw away all data before
       * the oldest XMIN of any running transaction.    No future transaction will
       * attempt to reference any pg_subtrans entry older than that (see Asserts
!      * in subtrans.c).
       */
!     if (!shutdown)
          TruncateSUBTRANS(GetOldestXmin(true, false));

      /* All real work is done, but log before releasing lock. */
      if (log_checkpoints)
!         LogCheckpointEnd(flags);

          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
***************
*** 6085,6090 **** CreateCheckPoint(int flags)
--- 6297,6347 ----

      LWLockRelease(CheckpointLock);
  }
+
+ /*
+  * GetRedoLocationForCheckpoint()
+  *
+  * When !IsRecoveryProcessingMode() this must be called while holding
+  * WALInsertLock().
+  */
+ static XLogRecPtr
+ GetRedoLocationForCheckpoint()
+ {
+     XLogCtlInsert  *Insert = &XLogCtl->Insert;
+     uint32                  freespace;
+     XLogRecPtr              redo;
+
+     freespace = INSERT_FREESPACE(Insert);
+     if (freespace < SizeOfXLogRecord)
+     {
+             (void) AdvanceXLInsertBuffer(false);
+             /* OK to ignore update return flag, since we will do flush anyway */
+             freespace = INSERT_FREESPACE(Insert);
+     }
+     INSERT_RECPTR(redo, Insert, Insert->curridx);
+
+     /*
+      * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+      * must be done while holding the insert lock AND the info_lck.
+      *
+      * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+      * pointing past where it really needs to point.  This is okay; the only
+      * consequence is that XLogInsert might back up whole buffers that it
+      * didn't really need to.  We can't postpone advancing RedoRecPtr because
+      * XLogInserts that happen while we are dumping buffers must assume that
+      * their buffer changes are not included in the checkpoint.
+      */
+     {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile XLogCtlData *xlogctl = XLogCtl;
+
+         SpinLockAcquire(&xlogctl->info_lck);
+         RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo;
+         SpinLockRelease(&xlogctl->info_lck);
+     }
+
+     return redo;
+ }

  /*
   * Flush all data in shared memory to disk, and fsync
***************
*** 6150,6180 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
              }
      }

      /*
!      * OK, force data out to disk
       */
!     CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);

      /*
!      * Update pg_control so that any subsequent crash will restart from this
!      * checkpoint.    Note: ReadRecPtr gives the XLOG address of the checkpoint
!      * record itself.
       */
-     ControlFile->prevCheckPoint = ControlFile->checkPoint;
-     ControlFile->checkPoint = ReadRecPtr;
-     ControlFile->checkPointCopy = *checkPoint;
-     ControlFile->time = (pg_time_t) time(NULL);
-     UpdateControlFile();

      ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!             (errmsg("recovery restart point at %X/%X",
!                     checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
!     if (recoveryLastXTime)
!         ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!                 (errmsg("last completed transaction was at log time %s",
!                         timestamptz_to_str(recoveryLastXTime))));
! }

  /*
   * Write a NEXTOID log record
   */
--- 6407,6477 ----
              }
      }

+     RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint);
+ }
+
+ /*
+ * As of 8.4, RestartPoints are always created by the bgwriter
+ * once we have reachedSafeStartPoint. We use bgwriter's shared memory
+ * area wherever we call it from, to keep better code structure.
+ */
+ void
+ CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags)
+ {
+     if (recoveryLogRestartpoints || log_checkpoints)
+     {
+           /*
+          * Prepare to accumulate statistics.
+            */
+
+         MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+         CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+
+         LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags);
+     }
+
+       /*
+      * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
+      * We rely on this lock to ensure that the startup process doesn't exit
+      * Recovery while we are half way through a restartpoint.
+        */
+     LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+
+     CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags);
+
      /*
!      * Update pg_control, using current time
       */
!     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
!       ControlFile->prevCheckPoint = ControlFile->checkPoint;
!     ControlFile->checkPoint = ReadPtr;
!     ControlFile->checkPointCopy = *restartPoint;
!       ControlFile->time = (pg_time_t) time(NULL);
!       UpdateControlFile();
!     LWLockRelease(ControlFileLock);

      /*
!      * Currently, there is no need to truncate pg_subtrans during recovery.
!      * If we did do that, we will need to have called StartupSUBTRANS()
!      * already and then TruncateSUBTRANS() would go here.
       */

+     /* All real work is done, but log before releasing lock. */
+     if (recoveryLogRestartpoints || log_checkpoints)
+         LogCheckpointEnd(CHECKPOINT_RESTARTPOINT);
+
      ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!               (errmsg("recovery restart point at %X/%X",
!                     restartPoint->redo.xlogid, restartPoint->redo.xrecoff)));

+       if (recoveryLastXTime)
+           ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
+             (errmsg("last completed transaction was at log time %s",
+                     timestamptz_to_str(recoveryLastXTime))));
+
+     LWLockRelease(CheckpointLock);
+ }
+
  /*
   * Write a NEXTOID log record
   */
***************
*** 6237,6243 **** RequestXLogSwitch(void)
  }

  /*
!  * XLOG resource manager's routines
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 6534,6596 ----
  }

  /*
!  * exitRecovery()
!  *
!  * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
!  * only record type that can record a change of timelineID. We assume
!  * caller has already set ThisTimeLineID, if appropriate.
!  */
! static void
! exitRecovery(void)
! {
!     XLogRecData rdata;
!
!     rdata.buffer = InvalidBuffer;
!     rdata.data = (char *) (&ThisTimeLineID);
!     rdata.len = sizeof(TimeLineID);
!     rdata.next = NULL;
!
!     /*
!      * If a restartpoint is in progress, we will not be able to successfully
!      * acquire CheckpointLock. If bgwriter is still in progress then send
!      * a second signal to nudge bgwriter to go faster so we can avoid delay.
!      * Then wait for lock, so we know the restartpoint has completed. We do
!      * this because we don't want to interrupt the restartpoint half way
!      * through, which might leave us in a mess and we want to be robust. We're
!      * going to checkpoint soon anyway, so not it's not wasted effort.
!      */
!     if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
!         LWLockRelease(CheckpointLock);
!     else
!     {
!         RequestRestartPointCompletion();
!         ereport(DEBUG1,
!             (errmsg("startup process waiting for restartpoint to complete")));
!         LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
!         LWLockRelease(CheckpointLock);
!     }
!
!     /*
!      * This is the only type of WAL message that can be inserted during
!      * recovery. This ensures that we don't allow others to get access
!      * until after we have changed state.
!      */
!     (void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
!
!     /*
!      * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
!      * file ourselves. So just let bgwriter's forthcoming checkpoint do
!      * that for us.
!      */
!
!     InRecovery = false;
! }
!
! /*
!  * XLOG resource manager's routines.
!  *
!  * Definitions of message info are in include/catalog/pg_control.h,
!  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6271,6293 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
          ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

!         /*
!          * TLI may change in a shutdown checkpoint, but it shouldn't decrease
           */
-         if (checkPoint.ThisTimeLineID != ThisTimeLineID)
-         {
-             if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
-                 !list_member_int(expectedTLIs,
-                                  (int) checkPoint.ThisTimeLineID))
-                 ereport(PANIC,
-                         (errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
-                                 checkPoint.ThisTimeLineID, ThisTimeLineID)));
-             /* Following WAL records should be run with new TLI */
-             ThisTimeLineID = checkPoint.ThisTimeLineID;
-         }

          RecoveryRestartPoint(&checkPoint);
      }
      else if (info == XLOG_CHECKPOINT_ONLINE)
      {
          CheckPoint    checkPoint;
--- 6624,6663 ----
          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
          ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

!           /*
!          * TLI no longer changes at shutdown checkpoint, since as of 8.4,
!          * shutdown checkpoints only occur at shutdown. Much less confusing.
           */

          RecoveryRestartPoint(&checkPoint);
      }
+     else if (info == XLOG_RECOVERY_END)
+     {
+         TimeLineID    tli;
+
+         memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
+
+         /*
+          * TLI may change when recovery ends, but it shouldn't decrease.
+          *
+          * This is the only WAL record that can tell us to change timelineID
+          * while we process WAL records.
+          *
+          * We can *choose* to stop recovery at any point, generating a
+          * new timelineID which is recorded using this record type.
+          */
+         if (tli != ThisTimeLineID)
+           {
+             if (tli < ThisTimeLineID ||
+                   !list_member_int(expectedTLIs,
+                                  (int) tli))
+                   ereport(PANIC,
+                         (errmsg("unexpected timeline ID %u (after %u) at recovery end record",
+                                 tli, ThisTimeLineID)));
+               /* Following WAL records should be run with new TLI */
+             ThisTimeLineID = tli;
+           }
+       }
      else if (info == XLOG_CHECKPOINT_ONLINE)
      {
          CheckPoint    checkPoint;
***************
*** 6309,6315 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
          ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

!         /* TLI should not change in an on-line checkpoint */
          if (checkPoint.ThisTimeLineID != ThisTimeLineID)
              ereport(PANIC,
                      (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
--- 6679,6685 ----
          ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
          ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;

!         /* TLI must not change at a checkpoint */
          if (checkPoint.ThisTimeLineID != ThisTimeLineID)
              ereport(PANIC,
                      (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
***************
*** 6545,6550 **** pg_start_backup(PG_FUNCTION_ARGS)
--- 6915,6926 ----
                   errhint("archive_command must be defined before "
                           "online backups can be made safely.")));

+     if (IsRecoveryProcessingMode())
+         ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("recovery is in progress"),
+                  errhint("WAL control functions cannot be executed during recovery.")));
+
      backupidstr = text_to_cstring(backupid);

      /*
***************
*** 6710,6715 **** pg_stop_backup(PG_FUNCTION_ARGS)
--- 7086,7097 ----
                   errmsg("WAL archiving is not active"),
                   errhint("archive_mode must be enabled at server start.")));

+     if (IsRecoveryProcessingMode())
+         ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("recovery is in progress"),
+                  errhint("WAL control functions cannot be executed during recovery.")));
+
      /*
       * OK to clear forcePageWrites
       */
***************
*** 6865,6870 **** pg_switch_xlog(PG_FUNCTION_ARGS)
--- 7247,7258 ----
                  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
               (errmsg("must be superuser to switch transaction log files"))));

+     if (IsRecoveryProcessingMode())
+         ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("recovery is in progress"),
+                  errhint("WAL control functions cannot be executed during recovery.")));
+
      switchpoint = RequestXLogSwitch();

      /*
***************
*** 6887,6892 **** pg_current_xlog_location(PG_FUNCTION_ARGS)
--- 7275,7286 ----
  {
      char        location[MAXFNAMELEN];

+     if (IsRecoveryProcessingMode())
+         ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("recovery is in progress"),
+                  errhint("WAL control functions cannot be executed during recovery.")));
+
      /* Make sure we have an up-to-date local LogwrtResult */
      {
          /* use volatile pointer to prevent code rearrangement */
***************
*** 6914,6919 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
--- 7308,7319 ----
      XLogRecPtr    current_recptr;
      char        location[MAXFNAMELEN];

+     if (IsRecoveryProcessingMode())
+         ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("recovery is in progress"),
+                  errhint("WAL control functions cannot be executed during recovery.")));
+
      /*
       * Get the current end-of-WAL position ... shared lock is sufficient
       */
*** src/backend/commands/dbcommands.c
--- src/backend/commands/dbcommands.c
***************
*** 1976,1981 **** dbase_redo(XLogRecPtr lsn, XLogRecord *record)
--- 1976,1986 ----
           * We don't need to copy subdirectories
           */
          copydir(src_path, dst_path, false);
+
+         /*
+          * Flat files are updated immediately following transaction commit.
+           * Nothing to do here.
+          */
      }
      else if (info == XLOG_DBASE_DROP)
      {
***************
*** 1998,2003 **** dbase_redo(XLogRecPtr lsn, XLogRecord *record)
--- 2003,2012 ----
              ereport(WARNING,
                      (errmsg("some useless files may be left behind in old database directory \"%s\"",
                              dst_path)));
+         /*
+          * Flat files are updated immediately following transaction commit.
+           * Nothing to do here.
+          */
      }
      else
          elog(PANIC, "dbase_redo: unknown op code %u", info);
*** src/backend/postmaster/bgwriter.c
--- src/backend/postmaster/bgwriter.c
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>

  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 129,134 **** typedef struct
--- 130,142 ----

      int            ckpt_flags;        /* checkpoint flags, as defined in xlog.h */

+     /*
+      * When the Startup process wants bgwriter to perform a restartpoint, it
+      * sets these fields so that we can update the control file afterwards.
+      */
+     XLogRecPtr    ReadPtr;        /* Requested log pointer */
+     CheckPoint  restartPoint;    /* restartPoint data for ControlFile */
+
      uint32        num_backend_writes;        /* counts non-bgwriter buffer writes */

      int            num_requests;    /* current # of requests */
***************
*** 165,171 **** static bool ckpt_active = false;

  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;
  static double ckpt_cached_elapsed;

  static pg_time_t last_checkpoint_time;
--- 173,179 ----

  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;    /* not used if IsRecoveryProcessingMode */
  static double ckpt_cached_elapsed;

  static pg_time_t last_checkpoint_time;
***************
*** 197,202 **** BackgroundWriterMain(void)
--- 205,211 ----
  {
      sigjmp_buf    local_sigjmp_buf;
      MemoryContext bgwriter_context;
+     bool        BgWriterRecoveryMode;

      BgWriterShmem->bgwriter_pid = MyProcPid;
      am_bg_writer = true;
***************
*** 355,370 **** BackgroundWriterMain(void)
       */
      PG_SETMASK(&UnBlockSig);

      /*
       * Loop forever
       */
      for (;;)
      {
-         bool        do_checkpoint = false;
-         int            flags = 0;
-         pg_time_t    now;
-         int            elapsed_secs;
-
          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
           * necessity for manual cleanup of all postmaster children.
--- 364,380 ----
       */
      PG_SETMASK(&UnBlockSig);

+     BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+     if (BgWriterRecoveryMode)
+         elog(DEBUG1, "bgwriter starting during recovery, pid = %u",
+             BgWriterShmem->bgwriter_pid);
+
      /*
       * Loop forever
       */
      for (;;)
      {
          /*
           * Emergency bailout if postmaster has died.  This is to avoid the
           * necessity for manual cleanup of all postmaster children.
***************
*** 372,499 **** BackgroundWriterMain(void)
          if (!PostmasterIsAlive(true))
              exit(1);

-         /*
-          * Process any requests or signals received recently.
-          */
-         AbsorbFsyncRequests();
-
          if (got_SIGHUP)
          {
              got_SIGHUP = false;
              ProcessConfigFile(PGC_SIGHUP);
          }
-         if (checkpoint_requested)
-         {
-             checkpoint_requested = false;
-             do_checkpoint = true;
-             BgWriterStats.m_requested_checkpoints++;
-         }
-         if (shutdown_requested)
-         {
-             /*
-              * From here on, elog(ERROR) should end with exit(1), not send
-              * control back to the sigsetjmp block above
-              */
-             ExitOnAnyError = true;
-             /* Close down the database */
-             ShutdownXLOG(0, 0);
-             /* Normal exit from the bgwriter is here */
-             proc_exit(0);        /* done */
-         }

!         /*
!          * Force a checkpoint if too much time has elapsed since the last one.
!          * Note that we count a timed checkpoint in stats only when this
!          * occurs without an external request, but we set the CAUSE_TIME flag
!          * bit even if there is also an external request.
!          */
!         now = (pg_time_t) time(NULL);
!         elapsed_secs = now - last_checkpoint_time;
!         if (elapsed_secs >= CheckPointTimeout)
          {
!             if (!do_checkpoint)
!                 BgWriterStats.m_timed_checkpoints++;
!             do_checkpoint = true;
!             flags |= CHECKPOINT_CAUSE_TIME;
!         }

!         /*
!          * Do a checkpoint if requested, otherwise do one cycle of
!          * dirty-buffer writing.
!          */
!         if (do_checkpoint)
!         {
!             /* use volatile pointer to prevent code rearrangement */
!             volatile BgWriterShmemStruct *bgs = BgWriterShmem;

!             /*
!              * Atomically fetch the request flags to figure out what kind of a
!              * checkpoint we should perform, and increase the started-counter
!              * to acknowledge that we've started a new checkpoint.
!              */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             flags |= bgs->ckpt_flags;
!             bgs->ckpt_flags = 0;
!             bgs->ckpt_started++;
!             SpinLockRelease(&bgs->ckpt_lck);

!             /*
!              * We will warn if (a) too soon since last checkpoint (whatever
!              * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
!              * since the last checkpoint start.  Note in particular that this
!              * implementation will not generate warnings caused by
!              * CheckPointTimeout < CheckPointWarning.
!              */
!             if ((flags & CHECKPOINT_CAUSE_XLOG) &&
!                 elapsed_secs < CheckPointWarning)
!                 ereport(LOG,
!                         (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
!                                 elapsed_secs),
!                          errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));

              /*
!              * Initialize bgwriter-private variables used during checkpoint.
               */
!             ckpt_active = true;
!             ckpt_start_recptr = GetInsertRecPtr();
!             ckpt_start_time = now;
!             ckpt_cached_elapsed = 0;

!             /*
!              * Do the checkpoint.
!              */
!             CreateCheckPoint(flags);

              /*
!              * After any checkpoint, close all smgr files.    This is so we
!              * won't hang onto smgr references to deleted files indefinitely.
               */
!             smgrcloseall();

              /*
!              * Indicate checkpoint completion to any waiting backends.
               */
!             SpinLockAcquire(&bgs->ckpt_lck);
!             bgs->ckpt_done = bgs->ckpt_started;
!             SpinLockRelease(&bgs->ckpt_lck);

!             ckpt_active = false;

!             /*
!              * Note we record the checkpoint start time not end time as
!              * last_checkpoint_time.  This is so that time-driven checkpoints
!              * happen at a predictable spacing.
!              */
!             last_checkpoint_time = now;
          }
-         else
-             BgBufferSync();
-
-         /* Check for archive_timeout and switch xlog files if necessary. */
-         CheckArchiveTimeout();
-
-         /* Nap for the configured time. */
-         BgWriterNap();
      }
  }

--- 382,595 ----
          if (!PostmasterIsAlive(true))
              exit(1);

          if (got_SIGHUP)
          {
              got_SIGHUP = false;
              ProcessConfigFile(PGC_SIGHUP);
          }

!         if (BgWriterRecoveryMode)
          {
!             if (shutdown_requested)
!             {
!                 /*
!                  * From here on, elog(ERROR) should end with exit(1), not send
!                  * control back to the sigsetjmp block above
!                  */
!                 ExitOnAnyError = true;
!                 /* Normal exit from the bgwriter is here */
!                 proc_exit(0);        /* done */
!             }

!             if (!IsRecoveryProcessingMode())
!             {
!                 elog(DEBUG2, "bgwriter changing from recovery to normal mode");
!
!                 InitXLOGAccess();
!                 BgWriterRecoveryMode = false;
!
!                 /*
!                  * Start time-driven events from now
!                  */
!                 last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
!
!                 /*
!                  * Notice that we do *not* act on a checkpoint_requested
!                  * state at this point. We have changed mode, so we wish to
!                  * perform a checkpoint not a restartpoint.
!                  */
!                 continue;
!             }

!             if (checkpoint_requested)
!             {
!                 XLogRecPtr        ReadPtr;
!                 CheckPoint        restartPoint;
!
!                 checkpoint_requested = false;
!
!                 /*
!                  * Initialize bgwriter-private variables used during checkpoint.
!                  */
!                 ckpt_active = true;
!                 ckpt_start_time = (pg_time_t) time(NULL);
!                 ckpt_cached_elapsed = 0;
!
!                 /*
!                  * Get the requested values from shared memory that the
!                  * Startup process has put there for us.
!                  */
!                 SpinLockAcquire(&BgWriterShmem->ckpt_lck);
!                 ReadPtr = BgWriterShmem->ReadPtr;
!                 memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint));
!                 SpinLockRelease(&BgWriterShmem->ckpt_lck);
!
!                 /* Use smoothed writes, until interrupted if ever */
!                 CreateRestartPoint(ReadPtr, &restartPoint, 0);
!
!                 /*
!                  * After any checkpoint, close all smgr files.    This is so we
!                  * won't hang onto smgr references to deleted files indefinitely.
!                  */
!                 smgrcloseall();
!
!                 ckpt_active = false;
!                 checkpoint_requested = false;
!             }
!             else
!             {
!                 /* Clean buffers dirtied by recovery */
!                 BgBufferSync();

!                 /* Nap for the configured time. */
!                 BgWriterNap();
!             }
!         }
!         else    /* Normal processing */
!         {
!             bool        do_checkpoint = false;
!             int            flags = 0;
!             pg_time_t    now;
!             int            elapsed_secs;

              /*
!              * Process any requests or signals received recently.
               */
!             AbsorbFsyncRequests();

!             if (checkpoint_requested)
!             {
!                 checkpoint_requested = false;
!                 do_checkpoint = true;
!                 BgWriterStats.m_requested_checkpoints++;
!             }
!             if (shutdown_requested)
!             {
!                 /*
!                  * From here on, elog(ERROR) should end with exit(1), not send
!                  * control back to the sigsetjmp block above
!                  */
!                 ExitOnAnyError = true;
!                 /* Close down the database */
!                 ShutdownXLOG(0, 0);
!                 /* Normal exit from the bgwriter is here */
!                 proc_exit(0);        /* done */
!             }

              /*
!              * Force a checkpoint if too much time has elapsed since the last one.
!              * Note that we count a timed checkpoint in stats only when this
!              * occurs without an external request, but we set the CAUSE_TIME flag
!              * bit even if there is also an external request.
               */
!             now = (pg_time_t) time(NULL);
!             elapsed_secs = now - last_checkpoint_time;
!             if (elapsed_secs >= CheckPointTimeout)
!             {
!                 if (!do_checkpoint)
!                     BgWriterStats.m_timed_checkpoints++;
!                 do_checkpoint = true;
!                 flags |= CHECKPOINT_CAUSE_TIME;
!             }

              /*
!              * Do a checkpoint if requested, otherwise do one cycle of
!              * dirty-buffer writing.
               */
!             if (do_checkpoint)
!             {
!                 /* use volatile pointer to prevent code rearrangement */
!                 volatile BgWriterShmemStruct *bgs = BgWriterShmem;
!
!                 /*
!                  * Atomically fetch the request flags to figure out what kind of a
!                  * checkpoint we should perform, and increase the started-counter
!                  * to acknowledge that we've started a new checkpoint.
!                  */
!                 SpinLockAcquire(&bgs->ckpt_lck);
!                 flags |= bgs->ckpt_flags;
!                 bgs->ckpt_flags = 0;
!                 bgs->ckpt_started++;
!                 SpinLockRelease(&bgs->ckpt_lck);
!
!                 /*
!                  * We will warn if (a) too soon since last checkpoint (whatever
!                  * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
!                  * since the last checkpoint start.  Note in particular that this
!                  * implementation will not generate warnings caused by
!                  * CheckPointTimeout < CheckPointWarning.
!                  */
!                 if ((flags & CHECKPOINT_CAUSE_XLOG) &&
!                     elapsed_secs < CheckPointWarning)
!                     ereport(LOG,
!                             (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
!                                     elapsed_secs),
!                              errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!
!                 /*
!                  * Initialize bgwriter-private variables used during checkpoint.
!                  */
!                 ckpt_active = true;
!                 ckpt_start_recptr = GetInsertRecPtr();
!                 ckpt_start_time = now;
!                 ckpt_cached_elapsed = 0;
!
!                 /*
!                  * Do the checkpoint.
!                  */
!                 CreateCheckPoint(flags);
!
!                 /*
!                  * After any checkpoint, close all smgr files.    This is so we
!                  * won't hang onto smgr references to deleted files indefinitely.
!                  */
!                 smgrcloseall();
!
!                 /*
!                  * Indicate checkpoint completion to any waiting backends.
!                  */
!                 SpinLockAcquire(&bgs->ckpt_lck);
!                 bgs->ckpt_done = bgs->ckpt_started;
!                 SpinLockRelease(&bgs->ckpt_lck);
!
!                 ckpt_active = false;
!
!                 /*
!                  * Note we record the checkpoint start time not end time as
!                  * last_checkpoint_time.  This is so that time-driven checkpoints
!                  * happen at a predictable spacing.
!                  */
!                 last_checkpoint_time = now;
!             }
!             else
!                 BgBufferSync();

!             /* Check for archive_timeout and switch xlog files if necessary. */
!             CheckArchiveTimeout();

!             /* Nap for the configured time. */
!             BgWriterNap();
          }
      }
  }

***************
*** 586,592 **** BgWriterNap(void)
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

--- 682,689 ----
          (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
              break;
          pg_usleep(1000000L);
!         if (!IsRecoveryProcessingMode())
!             AbsorbFsyncRequests();
          udelay -= 1000000L;
      }

***************
*** 640,645 **** CheckpointWriteDelay(int flags, double progress)
--- 737,755 ----
      if (!am_bg_writer)
          return;

+     /* Perform minimal duties during recovery and skip wait if requested */
+     if (IsRecoveryProcessingMode())
+     {
+         BgBufferSync();
+
+         if (!shutdown_requested &&
+             !checkpoint_requested &&
+             IsCheckpointOnSchedule(progress))
+             BgWriterNap();
+
+         return;
+     }
+
      /*
       * Perform the usual bgwriter duties and take a nap, unless we're behind
       * schedule, in which case we just try to catch up as quickly as possible.
***************
*** 714,729 **** IsCheckpointOnSchedule(double progress)
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     recptr = GetInsertRecPtr();
!     elapsed_xlogs =
!         (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!          ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!         CheckPointSegments;
!
!     if (progress < elapsed_xlogs)
      {
!         ckpt_cached_elapsed = elapsed_xlogs;
!         return false;
      }

      /*
--- 824,842 ----
       * However, it's good enough for our purposes, we're only calculating an
       * estimate anyway.
       */
!     if (!IsRecoveryProcessingMode())
      {
!         recptr = GetInsertRecPtr();
!         elapsed_xlogs =
!             (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
!              ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
!             CheckPointSegments;
!
!         if (progress < elapsed_xlogs)
!         {
!             ckpt_cached_elapsed = elapsed_xlogs;
!             return false;
!         }
      }

      /*
***************
*** 989,994 **** RequestCheckpoint(int flags)
--- 1102,1180 ----
  }

  /*
+  * Always runs in Startup process (see xlog.c)
+  */
+ void
+ RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter)
+ {
+     /*
+      * Should we just do it ourselves?
+      */
+     if (!IsPostmasterEnvironment || !sendToBGWriter)
+     {
+         CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE);
+         return;
+     }
+
+     /*
+      * Push requested values into shared memory, then signal to request restartpoint.
+      */
+     if (BgWriterShmem->bgwriter_pid == 0)
+         elog(LOG, "could not request restartpoint because bgwriter not running");
+
+     SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+     BgWriterShmem->ReadPtr = ReadPtr;
+     memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint));
+     SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+     if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+         elog(LOG, "could not signal for restartpoint: %m");
+ }
+
+ /*
+  * Sends another checkpoint request signal to bgwriter, which causes it
+  * to avoid smoothed writes and continue processing as if it had been
+  * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery.
+  */
+ void
+ RequestRestartPointCompletion(void)
+ {
+     if (BgWriterShmem->bgwriter_pid != 0 &&
+         kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+         elog(LOG, "could not signal for restartpoint immediate: %m");
+ }
+
+ XLogRecPtr
+ GetRedoLocationForArchiveCheckpoint(void)
+ {
+     XLogRecPtr    redo;
+
+     SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+     redo = BgWriterShmem->ReadPtr;
+     SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+     return redo;
+ }
+
+ /*
+  * Store the information needed for a checkpoint at the end of recovery.
+  * Returns true if bgwriter can perform checkpoint, or false if bgwriter
+  * not active or otherwise unable to comply.
+  */
+ bool
+ SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo)
+ {
+     SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+     BgWriterShmem->ReadPtr = redo;
+     SpinLockRelease(&BgWriterShmem->ckpt_lck);
+
+     if (BgWriterShmem->bgwriter_pid == 0 || !IsPostmasterEnvironment)
+         return false;
+
+     return true;
+ }
+
+ /*
   * ForwardFsyncRequest
   *        Forward a file-fsync request from a backend to the bgwriter
   *
*** src/backend/postmaster/postmaster.c
--- src/backend/postmaster/postmaster.c
***************
*** 230,237 **** static bool FatalError = false; /* T if recovering from backend crash */
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
!  * Normal child backends can only be launched when we are in PM_RUN state.
!  * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
--- 230,239 ----
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
!  * Normal child backends can only be launched when we are in PM_RUN or
!  * PM_RECOVERY state. Any transaction started in PM_RECOVERY state will
!  * be read-only for the whole of its life.  (We also allow launch of normal
!  * child backends in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
***************
*** 254,259 **** typedef enum
--- 256,266 ----
  {
      PM_INIT,                    /* postmaster starting */
      PM_STARTUP,                    /* waiting for startup subprocess */
+     PM_RECOVERY,                /* consistent recovery mode; state only
+                                  * entered for archive and streaming recovery,
+                                  * and only after the point where the
+                                  * all data is in consistent state.
+                                  */
      PM_RUN,                        /* normal "database is alive" state */
      PM_WAIT_BACKUP,                /* waiting for online backup mode to end */
      PM_WAIT_BACKENDS,            /* waiting for live backends to exit */
***************
*** 1302,1308 **** ServerLoop(void)
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 && pmState == PM_RUN)
              BgWriterPID = StartBackgroundWriter();

          /*
--- 1309,1315 ----
           * state that prevents it, start one.  It doesn't matter if this
           * fails, we'll just try again later.
           */
!         if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
              BgWriterPID = StartBackgroundWriter();

          /*
***************
*** 1651,1661 **** retry1:
                      (errcode(ERRCODE_CANNOT_CONNECT_NOW),
                       errmsg("the database system is shutting down")));
              break;
-         case CAC_RECOVERY:
-             ereport(FATAL,
-                     (errcode(ERRCODE_CANNOT_CONNECT_NOW),
-                      errmsg("the database system is in recovery mode")));
-             break;
          case CAC_TOOMANY:
              ereport(FATAL,
                      (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
--- 1658,1663 ----
***************
*** 1664,1669 **** retry1:
--- 1666,1672 ----
          case CAC_WAITBACKUP:
              /* OK for now, will check in InitPostgres */
              break;
+         case CAC_RECOVERY:
          case CAC_OK:
              break;
      }
***************
*** 1982,1991 **** pmdie(SIGNAL_ARGS)
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN)
              {
                  /* autovacuum workers are told to shut down immediately */
!                 SignalAutovacWorkers(SIGTERM);
                  /* and the autovac launcher too */
                  if (AutoVacPID != 0)
                      signal_child(AutoVacPID, SIGTERM);
--- 1985,1995 ----
              ereport(LOG,
                      (errmsg("received smart shutdown request")));

!             if (pmState == PM_RUN || pmState == PM_RECOVERY)
              {
                  /* autovacuum workers are told to shut down immediately */
!                 if (pmState == PM_RUN)
!                     SignalAutovacWorkers(SIGTERM);
                  /* and the autovac launcher too */
                  if (AutoVacPID != 0)
                      signal_child(AutoVacPID, SIGTERM);
***************
*** 2019,2025 **** pmdie(SIGNAL_ARGS)

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
--- 2023,2029 ----

              if (StartupPID != 0)
                  signal_child(StartupPID, SIGTERM);
!             if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_WAIT_BACKUP)
              {
                  ereport(LOG,
                          (errmsg("aborting any active transactions")));
***************
*** 2115,2122 **** reaper(SIGNAL_ARGS)
           */
          if (pid == StartupPID)
          {
              StartupPID = 0;
!             Assert(pmState == PM_STARTUP);

              /* FATAL exit of startup is treated as catastrophic */
              if (!EXIT_STATUS_0(exitstatus))
--- 2119,2129 ----
           */
          if (pid == StartupPID)
          {
+             bool    leavingRecovery = (pmState == PM_RECOVERY);
+
              StartupPID = 0;
!             Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
!                    pmState == PM_WAIT_BACKUP || pmState == PM_WAIT_BACKENDS);

              /* FATAL exit of startup is treated as catastrophic */
              if (!EXIT_STATUS_0(exitstatus))
***************
*** 2124,2130 **** reaper(SIGNAL_ARGS)
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
                  ereport(LOG,
!                 (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }

--- 2131,2137 ----
                  LogChildExit(LOG, _("startup process"),
                               pid, exitstatus);
                  ereport(LOG,
!                         (errmsg("aborting startup due to startup process failure")));
                  ExitPostmaster(1);
              }

***************
*** 2157,2166 **** reaper(SIGNAL_ARGS)
              load_role();

              /*
!              * Crank up the background writer.    It doesn't matter if this
!              * fails, we'll just try again later.
               */
!             Assert(BgWriterPID == 0);
              BgWriterPID = StartBackgroundWriter();

              /*
--- 2164,2173 ----
              load_role();

              /*
!              * Check whether we need to start background writer, if not
!              * already running.
               */
!             if (BgWriterPID == 0)
              BgWriterPID = StartBackgroundWriter();

              /*
***************
*** 2177,2184 **** reaper(SIGNAL_ARGS)
                  PgStatPID = pgstat_start();

              /* at this point we are really open for business */
!             ereport(LOG,
!                  (errmsg("database system is ready to accept connections")));

              continue;
          }
--- 2184,2195 ----
                  PgStatPID = pgstat_start();

              /* at this point we are really open for business */
!             if (leavingRecovery)
!                 ereport(LOG,
!                      (errmsg("database can now be accessed with read and write transactions")));
!             else
!                 ereport(LOG,
!                      (errmsg("database system is ready to accept connections")));

              continue;
          }
***************
*** 2898,2904 **** BackendStartup(Port *port)
      bn->pid = pid;
      bn->cancel_key = MyCancelKey;
      bn->is_autovacuum = false;
!     bn->dead_end = (port->canAcceptConnections != CAC_OK &&
                      port->canAcceptConnections != CAC_WAITBACKUP);
      DLAddHead(BackendList, DLNewElem(bn));
  #ifdef EXEC_BACKEND
--- 2909,2916 ----
      bn->pid = pid;
      bn->cancel_key = MyCancelKey;
      bn->is_autovacuum = false;
!     bn->dead_end = (!(port->canAcceptConnections == CAC_RECOVERY ||
!                       port->canAcceptConnections == CAC_OK) &&
                      port->canAcceptConnections != CAC_WAITBACKUP);
      DLAddHead(BackendList, DLNewElem(bn));
  #ifdef EXEC_BACKEND
***************
*** 3845,3850 **** sigusr1_handler(SIGNAL_ARGS)
--- 3857,3910 ----

      PG_SETMASK(&BlockSig);

+     if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+     {
+         Assert(pmState == PM_STARTUP);
+
+         /*
+          * Go to shutdown mode if a shutdown request was pending.
+          */
+         if (Shutdown > NoShutdown)
+         {
+             pmState = PM_WAIT_BACKENDS;
+             /* PostmasterStateMachine logic does the rest */
+         }
+         else
+         {
+             /*
+              * Startup process has entered recovery
+              */
+             pmState = PM_RECOVERY;
+
+             /*
+              * Load the flat authorization file into postmaster's cache. The
+              * startup process won't have recomputed this from the database
+              * yet, so it may change following recovery. We'll reload it
+              * after the startup process ends.
+              */
+             load_role();
+
+             /*
+              * Crank up the background writer.    It doesn't matter if this
+              * fails, we'll just try again later.
+              */
+             Assert(BgWriterPID == 0);
+             BgWriterPID = StartBackgroundWriter();
+
+             /*
+              * Likewise, start other special children as needed.
+              */
+             Assert(PgStatPID == 0);
+             PgStatPID = pgstat_start();
+
+             /* We can now accept read-only connections */
+             ereport(LOG,
+                  (errmsg("database system is ready to accept connections")));
+             ereport(LOG,
+                  (errmsg("database can now be accessed with read only transactions")));
+         }
+     }
+
      if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
      {
          /*
*** src/backend/utils/init/flatfiles.c
--- src/backend/utils/init/flatfiles.c
***************
*** 678,686 **** write_auth_file(Relation rel_authid, Relation rel_authmem)
  /*
   * This routine is called once during database startup, after completing
   * WAL replay if needed.  Its purpose is to sync the flat files with the
!  * current state of the database tables.  This is particularly important
!  * during PITR operation, since the flat files will come from the
!  * base backup which may be far out of sync with the current state.
   *
   * In theory we could skip rebuilding the flat files if no WAL replay
   * occurred, but it seems best to just do it always.  We have to
--- 678,687 ----
  /*
   * This routine is called once during database startup, after completing
   * WAL replay if needed.  Its purpose is to sync the flat files with the
!  * current state of the database tables.
!  *
!  * In 8.4 we also run this during xact_redo_commit() if the transaction
!  * wrote a new database or auth flat file.
   *
   * In theory we could skip rebuilding the flat files if no WAL replay
   * occurred, but it seems best to just do it always.  We have to
***************
*** 716,723 **** BuildFlatFiles(bool database_only)
      /*
       * We don't have any hope of running a real relcache, but we can use the
       * same fake-relcache facility that WAL replay uses.
-      *
-      * No locking is needed because no one else is alive yet.
       */
      rel_db = CreateFakeRelcacheEntry(rnode);
      write_database_file(rel_db, true);
--- 717,722 ----
***************
*** 832,845 **** AtEOXact_UpdateFlatFiles(bool isCommit)
      /* Okay to write the files */
      if (database_file_update_subid != InvalidSubTransactionId)
      {
!         database_file_update_subid = InvalidSubTransactionId;
          write_database_file(drel, false);
          heap_close(drel, NoLock);
      }

      if (auth_file_update_subid != InvalidSubTransactionId)
      {
!         auth_file_update_subid = InvalidSubTransactionId;
          write_auth_file(arel, mrel);
          heap_close(arel, NoLock);
          heap_close(mrel, NoLock);
--- 831,844 ----
      /* Okay to write the files */
      if (database_file_update_subid != InvalidSubTransactionId)
      {
!         /* reset database_file_update_subid later during commit */
          write_database_file(drel, false);
          heap_close(drel, NoLock);
      }

      if (auth_file_update_subid != InvalidSubTransactionId)
      {
!         /* reset auth_file_update_subid later during commit */
          write_auth_file(arel, mrel);
          heap_close(arel, NoLock);
          heap_close(mrel, NoLock);
*** src/include/access/xact.h
--- src/include/access/xact.h
***************
*** 17,22 ****
--- 17,23 ----
  #include "access/xlog.h"
  #include "nodes/pg_list.h"
  #include "storage/relfilenode.h"
+ #include "utils/snapshot.h"
  #include "utils/timestamp.h"


***************
*** 84,111 **** typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
  #define XLOG_XACT_ABORT                0x20
  #define XLOG_XACT_COMMIT_PREPARED    0x30
  #define XLOG_XACT_ABORT_PREPARED    0x40

  typedef struct xl_xact_commit
  {
!     TimestampTz xact_time;        /* time of commit */
!     int            nrels;            /* number of RelFileNodes */
!     int            nsubxacts;        /* number of subtransaction XIDs */
!     /* Array of RelFileNode(s) to drop at commit */
!     RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
!     /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;

  #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)

  typedef struct xl_xact_abort
  {
      TimestampTz xact_time;        /* time of abort */
      int            nrels;            /* number of RelFileNodes */
      int            nsubxacts;        /* number of subtransaction XIDs */
      /* Array of RelFileNode(s) to drop at abort */
      RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;

  #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)

--- 85,162 ----
  #define XLOG_XACT_ABORT                0x20
  #define XLOG_XACT_COMMIT_PREPARED    0x30
  #define XLOG_XACT_ABORT_PREPARED    0x40
+ #define XLOG_XACT_ASSIGNMENT        0x50
+ #define XLOG_XACT_RUNNING_XACTS        0x60
+ /* 0x70 can also be used, if required */
+
+ typedef struct xl_xact_assignment
+ {
+     TransactionId    xassign;    /* assigned xid */
+     TransactionId    xparent;    /* assigned xids parent, if any */
+     bool            isSubXact;    /* is a subtransaction */
+     int                slotId;        /* slotId in procarray */
+ } xl_xact_assignment;
+
+ /*
+  * xl_xact_running_xacts is in utils/snapshot.h so it can be passed
+  * around to the same places as snapshots. Not snapmgr.h
+  */

  typedef struct xl_xact_commit
  {
!       TimestampTz xact_time;        /* time of commit */
!      int            slotId;            /* slotId in procarray */
!      uint        xinfo;            /* info flags */
!       int            nrels;            /* number of RelFileForks */
!       int            nsubxacts;        /* number of subtransaction XIDs */
!     int            nmsgs;            /* number of shared inval msgs */
!       /* Array of RelFileFork(s) to drop at commit */
!       RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
!       /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
!     /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
  } xl_xact_commit;

  #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
+ #define OffsetSharedInvalInXactCommit() \
+ ( \
+     MinSizeOfXactCommit +  \
+     (xlrec->nsubxacts * sizeof(TransactionId)) + \
+     (xlrec->nrels * sizeof(RelFileNode)) \
+ )
+
+ /*
+  * These flags are set in the xinfo fields of transaction
+  * completion WAL records. They indicate a number of actions
+  * that need to occur when emulating transaction completion.
+  * They are named XactCompletion... to differentiate them from
+  * EOXact... routines which run at the end of the original
+  * transaction completion.
+  */
+ #define XACT_COMPLETION_UNMARKED_SUBXIDS        0x01
+
+ /* These next states only occur on commit record types */
+ #define XACT_COMPLETION_UPDATE_DB_FILE            0x02
+ #define XACT_COMPLETION_UPDATE_AUTH_FILE        0x04
+ #define XACT_COMPLETION_UPDATE_RELCACHE_FILE    0x08
+
+ /* Access macros for above flags */
+ #define XactCompletionHasUnMarkedSubxids(xlrec)        ((xlrec)->xinfo & XACT_COMPLETION_UNMARKED_SUBXIDS)
+ #define XactCompletionUpdateDBFile(xlrec)             ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_DB_FILE)
+ #define XactCompletionUpdateAuthFile(xlrec)         ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_AUTH_FILE)
+ #define XactCompletionRelcacheInitFileInval(xlrec)    ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE)

  typedef struct xl_xact_abort
  {
      TimestampTz xact_time;        /* time of abort */
+     int            slotId;            /* slotId in procarray */
+     uint        xinfo;            /* info flags */
      int            nrels;            /* number of RelFileNodes */
      int            nsubxacts;        /* number of subtransaction XIDs */
      /* Array of RelFileNode(s) to drop at abort */
      RelFileNode    xnodes[1];        /* VARIABLE LENGTH ARRAY */
      /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;
+ /* Note the intentional lack of an invalidation message array c.f. commit */

  #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)

***************
*** 185,190 **** extern TransactionId RecordTransactionCommit(void);
--- 236,252 ----

  extern int    xactGetCommittedChildren(TransactionId **ptr);

+ extern void LogCurrentRunningXacts(void);
+ extern bool IsRunningXactDataValid(void);
+ extern void GetStandbyInfoForTransaction(RmgrId rmid, uint8 info,
+                             XLogRecData *rdata,
+                             TransactionId *xid2,
+                             uint16 *info2);
+
+ extern void InitRecoveryTransactionEnvironment(void);
+ extern void XactResolveRecoveryConflicts(TransactionId latestRemovedXid, Oid recDatabaseOid);
+ extern void RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record);
+
  extern void xact_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec);

*** src/include/access/xlog.h
--- src/include/access/xlog.h
***************
*** 133,139 **** typedef struct XLogRecData
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
--- 133,147 ----
  } XLogRecData;

  extern TimeLineID ThisTimeLineID;        /* current TLI */
! /*
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;
! extern bool InArchiveRecovery;
  extern XLogRecPtr XactLastRecEnd;

  /* these variables are GUC parameters related to XLOG */
***************
*** 166,171 **** extern bool XLOG_DEBUG;
--- 174,180 ----
  /* These indicate the cause of a checkpoint request */
  #define CHECKPOINT_CAUSE_XLOG    0x0010    /* XLOG consumption */
  #define CHECKPOINT_CAUSE_TIME    0x0020    /* Elapsed time */
+ #define CHECKPOINT_RESTARTPOINT    0x0040    /* Restartpoint during recovery */

  /* Checkpoint statistics */
  typedef struct CheckpointStatsData
***************
*** 197,202 **** extern void XLogSetAsyncCommitLSN(XLogRecPtr record);
--- 206,214 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);

+ extern bool IsRecoveryProcessingMode(void);
+
+
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
*** src/include/access/xlog_internal.h
--- src/include/access/xlog_internal.h
***************
*** 17,22 ****
--- 17,23 ----
  #define XLOG_INTERNAL_H

  #include "access/xlog.h"
+ #include "catalog/pg_control.h"
  #include "fmgr.h"
  #include "pgtime.h"
  #include "storage/block.h"
***************
*** 245,250 **** extern const RmgrData RmgrTable[];
--- 246,254 ----
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);

+ extern void CreateRestartPoint(const XLogRecPtr ReadPtr,
+                 const CheckPoint *restartPoint, int flags);
+
  /*
   * These aren't in xlog.h because I'd rather not include fmgr.h there.
   */
*** src/include/catalog/pg_control.h
--- src/include/catalog/pg_control.h
***************
*** 46,51 **** typedef struct CheckPoint
--- 46,52 ----
  #define XLOG_NOOP                        0x20
  #define XLOG_NEXTOID                    0x30
  #define XLOG_SWITCH                        0x40
+ #define XLOG_RECOVERY_END                0x50


  /* System status indicator */
***************
*** 102,107 **** typedef struct ControlFileData
--- 103,109 ----
      CheckPoint    checkPointCopy; /* copy of last check point record */

      XLogRecPtr    minRecoveryPoint;        /* must replay xlog to here */
+     XLogRecPtr    minSafeStartPoint;        /* safe point after recovery crashes */

      /*
       * This data is used to check for hardware-architecture compatibility of
*** src/include/postmaster/bgwriter.h
--- src/include/postmaster/bgwriter.h
***************
*** 12,17 ****
--- 12,18 ----
  #ifndef _BGWRITER_H
  #define _BGWRITER_H

+ #include "catalog/pg_control.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"

***************
*** 25,30 **** extern double CheckPointCompletionTarget;
--- 26,36 ----
  extern void BackgroundWriterMain(void);

  extern void RequestCheckpoint(int flags);
+ extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter);
+ extern void RequestRestartPointCompletion(void);
+ extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void);
+ extern bool SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo);
+
  extern void CheckpointWriteDelay(int flags, double progress);

  extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
*** src/include/storage/pmsignal.h
--- src/include/storage/pmsignal.h
***************
*** 22,27 ****
--- 22,28 ----
   */
  typedef enum
  {
+     PMSIGNAL_RECOVERY_START,    /* move to PM_RECOVERY state */
      PMSIGNAL_PASSWORD_CHANGE,    /* pg_auth file has changed */
      PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
      PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
pgsql-hackers by date:
From: Jeff Davis
Date: 22 December 2008, 16:17:05
Subject: Re: affected rows count
From: Andrew Dunstan
Date: 22 December 2008, 17:22:20
Subject: HAVE_FSEEKO for WIN32
Re: [PATCHES] Infrastructure changes for recovery (v8) - Mailing list pgsql-hackers

Previous

Next