Re: Spread checkpoint sync - Mailing list pgsql-hackers
From | Greg Smith |
---|---|
Subject | Re: Spread checkpoint sync |
Date | |
Msg-id | 4D4F9A3D.5070700@2ndquadrant.com Whole thread Raw |
In response to | Re: Spread checkpoint sync (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Spread checkpoint sync
Re: Spread checkpoint sync |
List | pgsql-hackers |
Robert Haas wrote: > With the fsync queue compaction patch applied, I think most of this is > now not needed. Attached please find an attempt to isolate the > portion that looks like it might still be useful. The basic idea of > what remains here is to make the background writer still do its normal > stuff even when it's checkpointing. In particular, with this patch > applied, PG will: > > 1. Absorb fsync requests a lot more often during the sync phase. > 2. Still try to run the cleaning scan during the sync phase. > 3. Pause for 3 seconds after every fsync. > Yes, the bits you extracted were the remaining useful parts from the original patch. Today was quiet here because there were sports on or something, and I added full auto-tuning magic to the attached update. I need to kick off benchmarks and report back tomorrow to see how well this does, but any additional patch here would only be code cleanup on the messy stuff I did in here (plus proper implementation of the pair of GUCs). This has finally gotten to the exact logic I've been meaning to complete as spread sync since the idea was first postponed in 8.3, with the benefit of some fsync aborption improvements along the way too The automatic timing is modeled on the existing checkpoint_completion_target concept, except with a new tunable (not yet added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is make the existing checkpoint_completion_target now be the target for the end of the sync phase, matching its name; people who bumped it up won't necessarily even have to change anything. Then the new guc can be checkpoint_write_target, representing the target that is in there right now. I tossed the earlier idea of counting relations to sync based on the write phase data as too inaccurate after testing, and with it for now goes checkpoint sorting. Instead, I just take a first pass over pendingOpsTable to get a total number of things to sync, which will always match the real count barring strange circumstances (like dropping a table). As for the automatically determining the interval, I take the number of syncs that have finished so far, divide by the total, and get a number between 0.0 and 1.0 that represents progress on the sync phase. I then use the same basic CheckpointWriteDelay logic that is there for spreading writes out, except with the new sync target. I realized that if we assume the checkpoint writes should have finished in CheckPointCompletionTarget worth of time or segments, we can compute a new progress metric with the formula: progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * finished / goal; Where "finished" is the number of segments written out, while "goal" is the total. To turn this into an example, let's say the default parameters are set, we've finished the writes, and finished 1 out of 4 syncs; that much work will be considered: progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 On a scale that effectively aimes to be finished sync work by 0.8. I don't use quite the same logic as the CheckpointWriteDelay though. It turns out the existing checkpoint_completion implementation doesn't always work like I thought it did, which provide some very interesting insight into why my attempts to work around checkpoint problems haven't worked as well as expected the last few years. I thought that what it did was wait until an amount of time determined by the target was reached until it did the next write. That's not quite it; what it actually does is check progress against the target, then sleep exactly one nap interval if it is is ahead of schedule. That is only the same thing if you have a lot of buffers to write relative to the amount of time involved. There's some alternative logic if you don't have bgwriter_lru_maxpages set, but in the normal situation it effectively it means that: maximum write spread time=bgwriter_delay * checkpoint dirty blocks No matter how far apart you try to spread the checkpoints. Now, typically, when people run into these checkpoint spikes in production, reducing shared_buffers improves that. But I now realize that doing so will then reduce the average number of dirty blocks participating in the checkpoint, and therefore potentially pull the spread down at the same time! Also, if you try and tune bgwriter_delay down to get better background cleaning, you're also reducing the maximum spread. Between this issue and the bad behavior when the fsync queue fills, no wonder this has been so hard to tune out of production systems. At some point, the reduction in spread defeats further attempts to reduce the size of what's written at checkpoint time, by lowering the amount of data involved. What I do instead is nap until just after the planned schedule, then execute the sync. What ends up happening then is that there can be a long pause between the end of the write phase and when syncs start to happen, which I consider a good thing. Gives the kernel a little more time to try and get writes moving out to disk. Here's what that looks like on my development desktop: 2011-02-07 00:46:24 EST: LOG: checkpoint starting: time 2011-02-07 00:48:04 EST: DEBUG: checkpoint sync: estimated segments=10 2011-02-07 00:48:24 EST: DEBUG: checkpoint sync: naps=99 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=1 file=base/16736/16749.1 time=12033.898 msec 2011-02-07 00:48:36 EST: DEBUG: checkpoint sync: number=2 file=base/16736/16749 time=60.799 msec 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: naps=59 2011-02-07 00:48:48 EST: DEBUG: checkpoint sync: number=3 file=base/16736/16756 time=0.003 msec 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:00 EST: DEBUG: checkpoint sync: number=4 file=base/16736/16750 time=0.003 msec 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:12 EST: DEBUG: checkpoint sync: number=5 file=base/16736/16737 time=0.004 msec 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:24 EST: DEBUG: checkpoint sync: number=6 file=base/16736/16749_fsm time=0.004 msec 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:36 EST: DEBUG: checkpoint sync: number=7 file=base/16736/16740 time=0.003 msec 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:49:48 EST: DEBUG: checkpoint sync: number=8 file=base/16736/16749_vm time=0.003 msec 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:50:00 EST: DEBUG: checkpoint sync: number=9 file=base/16736/16752 time=0.003 msec 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: naps=60 2011-02-07 00:50:12 EST: DEBUG: checkpoint sync: number=10 file=base/16736/16754 time=0.003 msec 2011-02-07 00:50:12 EST: LOG: checkpoint complete: wrote 14335 buffers (43.7%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=47.873 s, sync=127.819 s, total=227.990 s; sync files=10, longest=12.033 s, average=1.209 s Since this is ext3 the spike during the first sync is brutal, anyway, but it tried very hard to avoid that: it waited 99 * 200ms = 19.8 seconds between writing the last buffer and when it started syncing them (00:42:04 to 00:48:24). Given the slow write for #1, it was then behind, so it immediately moved onto #2. But after that, it was able to insert a moderate nap time between successive syncs--60 naps is 12 seconds, and it keeps that pace for the remainder of the sync. This is the same sort of thing I'd worked out as optimal on the system this patch originated from, except it had a lot more dirty relations; that's why its naptime was the 3 seconds hard-coded into earlier versions of this patch. Results on XFS with mini-server class hardware should be interesting... -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4df69c2..f58ac3e 100644 *** a/src/backend/postmaster/bgwriter.c --- b/src/backend/postmaster/bgwriter.c *************** static bool am_bg_writer = false; *** 168,173 **** --- 168,175 ---- static bool ckpt_active = false; + static int checkpoint_flags = 0; + /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; static XLogRecPtr ckpt_start_recptr; *************** static pg_time_t last_xlog_switch_time; *** 180,186 **** static void CheckArchiveTimeout(void); static void BgWriterNap(void); ! static bool IsCheckpointOnSchedule(double progress); static bool ImmediateCheckpointRequested(void); static bool CompactBgwriterRequestQueue(void); --- 182,188 ---- static void CheckArchiveTimeout(void); static void BgWriterNap(void); ! static bool IsCheckpointOnSchedule(double progress,double target); static bool ImmediateCheckpointRequested(void); static bool CompactBgwriterRequestQueue(void); *************** CheckpointWriteDelay(int flags, double p *** 691,696 **** --- 693,701 ---- if (!am_bg_writer) return; + /* Cache this value for a later spread sync */ + checkpoint_flags=flags; + /* * Perform the usual bgwriter duties and take a nap, unless we're behind * schedule, in which case we just try to catch up as quickly as possible. *************** CheckpointWriteDelay(int flags, double p *** 698,704 **** if (!(flags & CHECKPOINT_IMMEDIATE) && !shutdown_requested && !ImmediateCheckpointRequested() && ! IsCheckpointOnSchedule(progress)) { if (got_SIGHUP) { --- 703,709 ---- if (!(flags & CHECKPOINT_IMMEDIATE) && !shutdown_requested && !ImmediateCheckpointRequested() && ! IsCheckpointOnSchedule(progress,CheckPointCompletionTarget)) { if (got_SIGHUP) { *************** CheckpointWriteDelay(int flags, double p *** 726,731 **** --- 731,799 ---- } /* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ + void + CheckpointSyncDelay(int finished,int goal) + { + int flags = checkpoint_flags; + int nap_count = 0; + double progress; + double CheckPointSyncTarget = 0.8; + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + /* + * Limit progress to the goal, which + * may be possible if the segments to sync were calculated wrong. + */ + ckpt_cached_elapsed = 0; + if (finished > goal) finished=goal; + + /* + * Base our progress on the assumption that the write took + * checkpoint_completion_target worth of time, and that sync + * progress is advancing from beyond that point. + */ + progress = CheckPointCompletionTarget + + (1.0 - CheckPointCompletionTarget) * finished / goal; + + /* + * Perform the usual bgwriter duties and nap until we've just + * crossed our deadline. + */ + elog(DEBUG2,"checkpoint sync: considering a nap after progress=%.1f",progress); + while (!(flags & CHECKPOINT_IMMEDIATE) && + !shutdown_requested && + !ImmediateCheckpointRequested() && + (IsCheckpointOnSchedule(progress,CheckPointSyncTarget))) + { + if (got_SIGHUP) + { + got_SIGHUP = false; + ProcessConfigFile(PGC_SIGHUP); + } + + elog(DEBUG2,"checkpoint sync: nap count=%d",nap_count); + nap_count++; + + AbsorbFsyncRequests(); + + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + } + if (nap_count > 0) + elog(DEBUG1,"checkpoint sync: naps=%d",nap_count); + } + + /* * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint * in time? * *************** CheckpointWriteDelay(int flags, double p *** 734,740 **** * than the elapsed time/segments. */ static bool ! IsCheckpointOnSchedule(double progress) { XLogRecPtr recptr; struct timeval now; --- 802,808 ---- * than the elapsed time/segments. */ static bool ! IsCheckpointOnSchedule(double progress,double target) { XLogRecPtr recptr; struct timeval now; *************** IsCheckpointOnSchedule(double progress) *** 743,750 **** Assert(ckpt_active); ! /* Scale progress according to checkpoint_completion_target. */ ! progress *= CheckPointCompletionTarget; /* * Check against the cached value first. Only do the more expensive --- 811,820 ---- Assert(ckpt_active); ! /* Scale progress according to given target. */ ! progress *= target; ! ! elog(DEBUG2,"checkpoint schedule check: scaled progress=%.1f target=%.1f",progress,target); /* * Check against the cached value first. Only do the more expensive *************** IsCheckpointOnSchedule(double progress) *** 773,778 **** --- 843,850 ---- ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / CheckPointSegments; + elog(DEBUG2,"checkpoint schedule: elapsed xlogs=%.1f",elapsed_xlogs); + if (progress < elapsed_xlogs) { ckpt_cached_elapsed = elapsed_xlogs; *************** IsCheckpointOnSchedule(double progress) *** 787,792 **** --- 859,866 ---- elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) + now.tv_usec / 1000000.0) / CheckPointTimeout; + elog(DEBUG2,"checkpoint schedule: elapsed time=%.1f",elapsed_time); + if (progress < elapsed_time) { ckpt_cached_elapsed = elapsed_time; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 9d585b6..f294f6f 100644 *** a/src/backend/storage/smgr/md.c --- b/src/backend/storage/smgr/md.c *************** *** 31,39 **** #include "pg_trace.h" - /* interval for calling AbsorbFsyncRequests in mdsync */ - #define FSYNCS_PER_ABSORB 10 - /* * Special values for the segno arg to RememberFsyncRequest. * --- 31,36 ---- *************** mdsync(void) *** 932,938 **** HASH_SEQ_STATUS hstat; PendingOperationEntry *entry; - int absorb_counter; /* Statistics on sync times */ int processed = 0; --- 929,934 ---- *************** mdsync(void) *** 943,948 **** --- 939,948 ---- uint64 longest = 0; uint64 total_elapsed = 0; + /* Sync spreading counters */ + int sync_segments = 0; + int current_segment = 0; + /* * This is only called during checkpoints, and checkpoints should only * occur in processes that have created a pendingOpsTable. *************** mdsync(void) *** 1001,1008 **** /* Set flag to detect failure if we don't reach the end of the loop */ mdsync_in_progress = true; /* Now scan the hashtable for fsync requests to process */ - absorb_counter = FSYNCS_PER_ABSORB; hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { --- 1001,1033 ---- /* Set flag to detect failure if we don't reach the end of the loop */ mdsync_in_progress = true; + /* For spread sync timing purposes, make a scan through the + * hashtable to count its entries. Using hash_get_num_entries + * instead would require a stronger lock than we want to have at + * this point, and we don't want to count requests destined for + * next cycle anyway + * + * XXX Should we skip this if there is no sync spreading, or if + * fsync is off? + */ + hash_seq_init(&hstat, pendingOpsTable); + while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) + { + if (entry->cycle_ctr == mdsync_cycle_ctr) + continue; + sync_segments++; + } + + /* + * In the unexpected situation where the above estimate says there + * is nothing to sync, avoid division by zero errors in the + * progress computation below. + */ + if (sync_segments == 0) + sync_segments = 1; + elog(DEBUG1, "checkpoint sync: estimated segments=%d",sync_segments); + /* Now scan the hashtable for fsync requests to process */ hash_seq_init(&hstat, pendingOpsTable); while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL) { *************** mdsync(void) *** 1027,1043 **** int failures; /* ! * If in bgwriter, we want to absorb pending requests every so ! * often to prevent overflow of the fsync request queue. It is ! * unspecified whether newly-added entries will be visited by ! * hash_seq_search, but we don't care since we don't need to ! * process them anyway. */ ! if (--absorb_counter <= 0) ! { ! AbsorbFsyncRequests(); ! absorb_counter = FSYNCS_PER_ABSORB; ! } /* * The fsync table could contain requests to fsync segments that --- 1052,1060 ---- int failures; /* ! * If in bgwriter, perform normal duties. */ ! CheckpointSyncDelay(current_segment,sync_segments); /* * The fsync table could contain requests to fsync segments that *************** mdsync(void) *** 1131,1140 **** pfree(path); /* ! * Absorb incoming requests and check to see if canceled. */ ! AbsorbFsyncRequests(); ! absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */ if (entry->canceled) break; --- 1148,1156 ---- pfree(path); /* ! * If in bgwriter, perform normal duties. */ ! CheckpointSyncDelay(current_segment,sync_segments); if (entry->canceled) break; *************** mdsync(void) *** 1149,1154 **** --- 1165,1172 ---- if (hash_search(pendingOpsTable, &entry->tag, HASH_REMOVE, NULL) == NULL) elog(ERROR, "pendingOpsTable corrupted"); + + current_segment++; } /* end loop over hashtable entries */ /* Return sync performance metrics for report at checkpoint end */ diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index eaf2206..5da0aa2 100644 *** a/src/include/postmaster/bgwriter.h --- b/src/include/postmaster/bgwriter.h *************** extern void BackgroundWriterMain(void); *** 26,31 **** --- 26,32 ---- extern void RequestCheckpoint(int flags); extern void CheckpointWriteDelay(int flags, double progress); + extern void CheckpointSyncDelay(int finished,int goal); extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno);
pgsql-hackers by date: