Home > mailing lists
Re: Spread checkpoint sync - Mailing list pgsql-hackers

From	Greg Smith
Subject	Re: Spread checkpoint sync
Date	February 7, 2011 03:07:54
Msg-id	4D4F9A3D.5070700@2ndquadrant.com Whole thread Raw
In response to	Re: Spread checkpoint sync (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Spread checkpoint sync Re: Spread checkpoint sync
List	pgsql-hackers
Tree view
Robert Haas wrote:
> With the fsync queue compaction patch applied, I think most of this is
> now not needed.  Attached please find an attempt to isolate the
> portion that looks like it might still be useful.  The basic idea of
> what remains here is to make the background writer still do its normal
> stuff even when it's checkpointing.  In particular, with this patch
> applied, PG will:
>
> 1. Absorb fsync requests a lot more often during the sync phase.
> 2. Still try to run the cleaning scan during the sync phase.
> 3. Pause for 3 seconds after every fsync.
>

Yes, the bits you extracted were the remaining useful parts from the
original patch.  Today was quiet here because there were sports on or
something, and I added full auto-tuning magic to the attached update.  I
need to kick off benchmarks and report back tomorrow to see how well
this does, but any additional patch here would only be code cleanup on
the messy stuff I did in here (plus proper implementation of the pair of
GUCs).  This has finally gotten to the exact logic I've been meaning to
complete as spread sync since the idea was first postponed in 8.3, with
the benefit of some fsync aborption improvements along the way too

The automatic timing is modeled on the existing
checkpoint_completion_target concept, except with a new tunable (not yet
added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right
now.  What I think I want to do is make the existing
checkpoint_completion_target now be the target for the end of the sync
phase, matching its name; people who bumped it up won't necessarily even
have to change anything.  Then the new guc can be
checkpoint_write_target, representing the target that is in there right now.

I tossed the earlier idea of counting relations to sync based on the
write phase data as too inaccurate after testing, and with it for now
goes checkpoint sorting.  Instead, I just take a first pass over
pendingOpsTable to get a total number of things to sync, which will
always match the real count barring strange circumstances (like dropping
a table).

As for the automatically determining the interval, I take the number of
syncs that have finished so far, divide by the total, and get a number
between 0.0 and 1.0 that represents progress on the sync phase.  I then
use the same basic CheckpointWriteDelay logic that is there for
spreading writes out, except with the new sync target.  I realized that
if we assume the checkpoint writes should have finished in
CheckPointCompletionTarget worth of time or segments, we can compute a
new progress metric with the formula:

progress = CheckPointCompletionTarget + (1.0 -
CheckPointCompletionTarget) * finished / goal;

Where "finished" is the number of segments written out, while "goal" is
the total.  To turn this into an example, let's say the default
parameters are set, we've finished the writes, and  finished 1 out of 4
syncs; that much work will be considered:

progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625

On a scale that effectively aimes to be finished sync work by 0.8.

I don't use quite the same logic as the CheckpointWriteDelay though.  It
turns out the existing checkpoint_completion implementation doesn't
always work like I thought it did, which provide some very interesting
insight into why my attempts to work around checkpoint problems haven't
worked as well as expected the last few years.  I thought that what it
did was wait until an amount of time determined by the target was
reached until it did the next write.  That's not quite it; what it
actually does is check progress against the target, then sleep exactly
one nap interval if it is is ahead of schedule.  That is only the same
thing if you have a lot of buffers to write relative to the amount of
time involved.  There's some alternative logic if you don't have
bgwriter_lru_maxpages set, but in the normal situation it effectively it
means that:

maximum write spread time=bgwriter_delay * checkpoint dirty blocks

No matter how far apart you try to spread the checkpoints.  Now,
typically, when people run into these checkpoint spikes in production,
reducing shared_buffers improves that.  But I now realize that doing so
will then reduce the average number of dirty blocks participating in the
checkpoint, and therefore potentially pull the spread down at the same
time!  Also, if you try and tune bgwriter_delay down to get better
background cleaning, you're also reducing the maximum spread.  Between
this issue and the bad behavior when the fsync queue fills, no wonder
this has been so hard to tune out of production systems.  At some point,
the reduction in spread defeats further attempts to reduce the size of
what's written at checkpoint time, by lowering the amount of data involved.

What I do instead is nap until just after the planned schedule, then
execute the sync.  What ends up happening then is that there can be a
long pause between the end of the write phase and when syncs start to
happen, which I consider a good thing.  Gives the kernel a little more
time to try and get writes moving out to disk.  Here's what that looks
like on my development desktop:

2011-02-07 00:46:24 EST: LOG:  checkpoint starting: time
2011-02-07 00:48:04 EST: DEBUG:  checkpoint sync:  estimated segments=10
2011-02-07 00:48:24 EST: DEBUG:  checkpoint sync: naps=99
2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=1
file=base/16736/16749.1 time=12033.898 msec
2011-02-07 00:48:36 EST: DEBUG:  checkpoint sync: number=2
file=base/16736/16749 time=60.799 msec
2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: naps=59
2011-02-07 00:48:48 EST: DEBUG:  checkpoint sync: number=3
file=base/16736/16756 time=0.003 msec
2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:00 EST: DEBUG:  checkpoint sync: number=4
file=base/16736/16750 time=0.003 msec
2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:12 EST: DEBUG:  checkpoint sync: number=5
file=base/16736/16737 time=0.004 msec
2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:24 EST: DEBUG:  checkpoint sync: number=6
file=base/16736/16749_fsm time=0.004 msec
2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:36 EST: DEBUG:  checkpoint sync: number=7
file=base/16736/16740 time=0.003 msec
2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:49:48 EST: DEBUG:  checkpoint sync: number=8
file=base/16736/16749_vm time=0.003 msec
2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:50:00 EST: DEBUG:  checkpoint sync: number=9
file=base/16736/16752 time=0.003 msec
2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: naps=60
2011-02-07 00:50:12 EST: DEBUG:  checkpoint sync: number=10
file=base/16736/16754 time=0.003 msec
2011-02-07 00:50:12 EST: LOG:  checkpoint complete: wrote 14335 buffers
(43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled;
write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10,
longest=12.033 s, average=1.209 s

Since this is ext3 the spike during the first sync is brutal, anyway,
but it tried very hard to avoid that:  it waited 99 * 200ms = 19.8
seconds between writing the last buffer and when it started syncing them
(00:42:04 to 00:48:24).  Given the slow write for #1, it was then
behind, so it immediately moved onto #2.  But after that, it was able to
insert a moderate nap time between successive syncs--60 naps is 12
seconds, and it keeps that pace for the remainder of the sync.  This is
the same sort of thing I'd worked out as optimal on the system this
patch originated from, except it had a lot more dirty relations; that's
why its naptime was the 3 seconds hard-coded into earlier versions of
this patch.

Results on XFS with mini-server class hardware should be interesting...

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 4df69c2..f58ac3e 100644
*** a/src/backend/postmaster/bgwriter.c
--- b/src/backend/postmaster/bgwriter.c
*************** static bool am_bg_writer = false;
*** 168,173 ****
--- 168,175 ----

  static bool ckpt_active = false;

+ static int checkpoint_flags = 0;
+
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
  static XLogRecPtr ckpt_start_recptr;
*************** static pg_time_t last_xlog_switch_time;
*** 180,186 ****

  static void CheckArchiveTimeout(void);
  static void BgWriterNap(void);
! static bool IsCheckpointOnSchedule(double progress);
  static bool ImmediateCheckpointRequested(void);
  static bool CompactBgwriterRequestQueue(void);

--- 182,188 ----

  static void CheckArchiveTimeout(void);
  static void BgWriterNap(void);
! static bool IsCheckpointOnSchedule(double progress,double target);
  static bool ImmediateCheckpointRequested(void);
  static bool CompactBgwriterRequestQueue(void);

*************** CheckpointWriteDelay(int flags, double p
*** 691,696 ****
--- 693,701 ----
      if (!am_bg_writer)
          return;

+     /* Cache this value for a later spread sync */
+     checkpoint_flags=flags;
+
      /*
       * Perform the usual bgwriter duties and take a nap, unless we're behind
       * schedule, in which case we just try to catch up as quickly as possible.
*************** CheckpointWriteDelay(int flags, double p
*** 698,704 ****
      if (!(flags & CHECKPOINT_IMMEDIATE) &&
          !shutdown_requested &&
          !ImmediateCheckpointRequested() &&
!         IsCheckpointOnSchedule(progress))
      {
          if (got_SIGHUP)
          {
--- 703,709 ----
      if (!(flags & CHECKPOINT_IMMEDIATE) &&
          !shutdown_requested &&
          !ImmediateCheckpointRequested() &&
!         IsCheckpointOnSchedule(progress,CheckPointCompletionTarget))
      {
          if (got_SIGHUP)
          {
*************** CheckpointWriteDelay(int flags, double p
*** 726,731 ****
--- 731,799 ----
  }

  /*
+  * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+  *
+  * This function is called after each file sync performed by mdsync().
+  * It is responsible for keeping the bgwriter's normal activities in
+  * progress during a long checkpoint.
+  */
+ void
+ CheckpointSyncDelay(int finished,int goal)
+ {
+     int flags = checkpoint_flags;
+     int nap_count = 0;
+     double progress;
+     double CheckPointSyncTarget = 0.8;
+
+     /* Do nothing if checkpoint is being executed by non-bgwriter process */
+     if (!am_bg_writer)
+         return;
+
+     /*
+      * Limit progress to the goal, which
+      * may be possible if the segments to sync were calculated wrong.
+      */
+     ckpt_cached_elapsed = 0;
+     if (finished > goal) finished=goal;
+
+     /*
+      * Base our progress on the assumption that the write took
+      * checkpoint_completion_target worth of time, and that sync
+      * progress is advancing from beyond that point.
+      */
+     progress = CheckPointCompletionTarget +
+         (1.0 - CheckPointCompletionTarget) * finished / goal;
+
+     /*
+      * Perform the usual bgwriter duties and nap until we've just
+      * crossed our deadline.
+      */
+     elog(DEBUG2,"checkpoint sync: considering a nap after progress=%.1f",progress);
+     while (!(flags & CHECKPOINT_IMMEDIATE) &&
+             !shutdown_requested &&
+             !ImmediateCheckpointRequested() &&
+             (IsCheckpointOnSchedule(progress,CheckPointSyncTarget)))
+     {
+         if (got_SIGHUP)
+         {
+             got_SIGHUP = false;
+             ProcessConfigFile(PGC_SIGHUP);
+         }
+
+         elog(DEBUG2,"checkpoint sync: nap count=%d",nap_count);
+         nap_count++;
+
+         AbsorbFsyncRequests();
+
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+     }
+     if (nap_count > 0)
+         elog(DEBUG1,"checkpoint sync: naps=%d",nap_count);
+ }
+
+ /*
   * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
   *         in time?
   *
*************** CheckpointWriteDelay(int flags, double p
*** 734,740 ****
   * than the elapsed time/segments.
   */
  static bool
! IsCheckpointOnSchedule(double progress)
  {
      XLogRecPtr    recptr;
      struct timeval now;
--- 802,808 ----
   * than the elapsed time/segments.
   */
  static bool
! IsCheckpointOnSchedule(double progress,double target)
  {
      XLogRecPtr    recptr;
      struct timeval now;
*************** IsCheckpointOnSchedule(double progress)
*** 743,750 ****

      Assert(ckpt_active);

!     /* Scale progress according to checkpoint_completion_target. */
!     progress *= CheckPointCompletionTarget;

      /*
       * Check against the cached value first. Only do the more expensive
--- 811,820 ----

      Assert(ckpt_active);

!     /* Scale progress according to given target. */
!     progress *= target;
!
!     elog(DEBUG2,"checkpoint schedule check: scaled progress=%.1f target=%.1f",progress,target);

      /*
       * Check against the cached value first. Only do the more expensive
*************** IsCheckpointOnSchedule(double progress)
*** 773,778 ****
--- 843,850 ----
               ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
              CheckPointSegments;

+         elog(DEBUG2,"checkpoint schedule: elapsed xlogs=%.1f",elapsed_xlogs);
+
          if (progress < elapsed_xlogs)
          {
              ckpt_cached_elapsed = elapsed_xlogs;
*************** IsCheckpointOnSchedule(double progress)
*** 787,792 ****
--- 859,866 ----
      elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
                      now.tv_usec / 1000000.0) / CheckPointTimeout;

+     elog(DEBUG2,"checkpoint schedule: elapsed time=%.1f",elapsed_time);
+
      if (progress < elapsed_time)
      {
          ckpt_cached_elapsed = elapsed_time;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9d585b6..f294f6f 100644
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 31,39 ****
  #include "pg_trace.h"


- /* interval for calling AbsorbFsyncRequests in mdsync */
- #define FSYNCS_PER_ABSORB        10
-
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
--- 31,36 ----
*************** mdsync(void)
*** 932,938 ****

      HASH_SEQ_STATUS hstat;
      PendingOperationEntry *entry;
-     int            absorb_counter;

      /* Statistics on sync times */
      int            processed = 0;
--- 929,934 ----
*************** mdsync(void)
*** 943,948 ****
--- 939,948 ----
      uint64        longest = 0;
      uint64        total_elapsed = 0;

+     /* Sync spreading counters */
+     int            sync_segments = 0;
+     int            current_segment = 0;
+
      /*
       * This is only called during checkpoints, and checkpoints should only
       * occur in processes that have created a pendingOpsTable.
*************** mdsync(void)
*** 1001,1008 ****
      /* Set flag to detect failure if we don't reach the end of the loop */
      mdsync_in_progress = true;

      /* Now scan the hashtable for fsync requests to process */
-     absorb_counter = FSYNCS_PER_ABSORB;
      hash_seq_init(&hstat, pendingOpsTable);
      while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
      {
--- 1001,1033 ----
      /* Set flag to detect failure if we don't reach the end of the loop */
      mdsync_in_progress = true;

+     /* For spread sync timing purposes, make a scan through the
+      * hashtable to count its entries.  Using hash_get_num_entries
+      * instead would require a stronger lock than we want to have at
+      * this point, and we don't want to count requests destined for
+      * next cycle anyway
+      *
+      * XXX Should we skip this if there is no sync spreading, or if
+      *     fsync is off?
+      */
+     hash_seq_init(&hstat, pendingOpsTable);
+     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+     {
+         if (entry->cycle_ctr == mdsync_cycle_ctr)
+             continue;
+         sync_segments++;
+     }
+
+     /*
+      * In the unexpected situation where the above estimate says there
+      * is nothing to sync, avoid division by zero errors in the
+      * progress computation below.
+      */
+     if (sync_segments == 0)
+         sync_segments = 1;
+     elog(DEBUG1, "checkpoint sync:  estimated segments=%d",sync_segments);
+
      /* Now scan the hashtable for fsync requests to process */
      hash_seq_init(&hstat, pendingOpsTable);
      while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
      {
*************** mdsync(void)
*** 1027,1043 ****
              int            failures;

              /*
!              * If in bgwriter, we want to absorb pending requests every so
!              * often to prevent overflow of the fsync request queue.  It is
!              * unspecified whether newly-added entries will be visited by
!              * hash_seq_search, but we don't care since we don't need to
!              * process them anyway.
               */
!             if (--absorb_counter <= 0)
!             {
!                 AbsorbFsyncRequests();
!                 absorb_counter = FSYNCS_PER_ABSORB;
!             }

              /*
               * The fsync table could contain requests to fsync segments that
--- 1052,1060 ----
              int            failures;

              /*
!              * If in bgwriter, perform normal duties.
               */
!             CheckpointSyncDelay(current_segment,sync_segments);

              /*
               * The fsync table could contain requests to fsync segments that
*************** mdsync(void)
*** 1131,1140 ****
                  pfree(path);

                  /*
!                  * Absorb incoming requests and check to see if canceled.
                   */
!                 AbsorbFsyncRequests();
!                 absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */

                  if (entry->canceled)
                      break;
--- 1148,1156 ----
                  pfree(path);

                  /*
!                  * If in bgwriter, perform normal duties.
                   */
!                 CheckpointSyncDelay(current_segment,sync_segments);

                  if (entry->canceled)
                      break;
*************** mdsync(void)
*** 1149,1154 ****
--- 1165,1172 ----
          if (hash_search(pendingOpsTable, &entry->tag,
                          HASH_REMOVE, NULL) == NULL)
              elog(ERROR, "pendingOpsTable corrupted");
+
+         current_segment++;
      }                            /* end loop over hashtable entries */

      /* Return sync performance metrics for report at checkpoint end */
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..5da0aa2 100644
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
*************** extern void BackgroundWriterMain(void);
*** 26,31 ****
--- 26,32 ----

  extern void RequestCheckpoint(int flags);
  extern void CheckpointWriteDelay(int flags, double progress);
+ extern void CheckpointSyncDelay(int finished,int goal);

  extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                      BlockNumber segno);
pgsql-hackers by date:
From: Shigeru HANADA
Date: 07 February 2011, 03:01:33
Subject: Re: SQL/MED - file_fdw
From: Radosław Smogura
Date: 07 February 2011, 03:13:29
Subject: Varlena and binary
Re: Spread checkpoint sync - Mailing list pgsql-hackers

Previous

Next