Load Distributed Checkpoints, take 3 - Mailing list pgsql-patches
From | Heikki Linnakangas |
---|---|
Subject | Load Distributed Checkpoints, take 3 |
Date | |
Msg-id | 46792FF3.8000301@enterprisedb.com Whole thread Raw |
Responses |
Re: Load Distributed Checkpoints, take 3
Re: Load Distributed Checkpoints, take 3 |
List | pgsql-patches |
Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. In fact, I think there's a small race condition in CVS HEAD: 1. pg_start_backup() is called, which calls RequestCheckpoint 2. RequestCheckpoint takes note of the old value of ckpt_started 3. bgwriter wakes up from pg_usleep, and sees that we've exceeded checkpoint_timeout. 4. bgwriter increases ckpt_started to note that a new checkpoint has started 5. RequestCheckpoint signals bgwriter to start a new checkpoint 6. bgwriter calls CreateCheckpoint, with the force-flag set to false because this checkpoint was triggered by timeout 7. RequestCheckpoint sees that ckpt_started has increased, and starts to wait for ckpt_done to reach the new value. 8. CreateCheckpoint finishes immediately, because there was no XLOG activity since last checkpoint. 9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns. 10. pg_start_backup() continues, with potentially the same redo location and thus history filename as previous backup. Now I admit that the chances for that to happen are extremely small, people don't usually do two pg_start_backup calls without *any* WAL logged activity in between them, for example. But as we add the new flags, avoiding scenarios like that becomes harder. Since last patch, I did some clean up and refactoring, and added a bunch of comments, and user documentation. I haven't yet changed GetInsertRecPtr to use the almost up-to-date value protected by the info_lck per Simon's suggestion, and I need to do some correctness testing. After that, I'm done with the patch. Ps. In case you wonder what took me so long since last revision, I've spent a lot of time reviewing HOT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: doc/src/sgml/config.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.126 diff -c -r1.126 config.sgml *** doc/src/sgml/config.sgml 7 Jun 2007 19:19:56 -0000 1.126 --- doc/src/sgml/config.sgml 19 Jun 2007 14:24:31 -0000 *************** *** 1565,1570 **** --- 1565,1608 ---- </listitem> </varlistentry> + <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing"> + <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term> + <indexterm> + <primary><varname>checkpoint_smoothing</> configuration parameter</primary> + </indexterm> + <listitem> + <para> + Specifies the target length of checkpoints, as a fraction of + the checkpoint interval. The default is 0.3. + + This parameter can only be set in the <filename>postgresql.conf</> + file or on the server command line. + </para> + </listitem> + </varlistentry> + + <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term> + <indexterm> + <primary><varname>checkpoint_rate</> configuration parameter</primary> + </indexterm> + <listitem> + <para> + Specifies the minimum I/O rate used to flush dirty buffers during a + checkpoint, when there's not many dirty buffers in the buffer cache. + The default is 512 KB/s. + + Note: the accuracy of this setting depends on + <varname>bgwriter_delay</varname. This value is converted internally + to pages / bgwriter_delay, so if for examply the minimum allowed + bgwriter_delay setting of 10ms is used, the effective minimum + checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s. + + This parameter can only be set in the <filename>postgresql.conf</> + file or on the server command line. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning"> <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term> <indexterm> Index: doc/src/sgml/wal.sgml =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.43 diff -c -r1.43 wal.sgml *** doc/src/sgml/wal.sgml 31 Jan 2007 20:56:19 -0000 1.43 --- doc/src/sgml/wal.sgml 19 Jun 2007 14:26:45 -0000 *************** *** 217,225 **** </para> <para> There will be at least one WAL segment file, and will normally not be more than 2 * <varname>checkpoint_segments</varname> + 1 ! files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for <acronym>WAL</acronym>. Ordinarily, when old log segment files are no longer needed, they --- 217,245 ---- </para> <para> + If there is a lot of dirty buffers in the buffer cache, flushing them + all at checkpoint will cause a heavy burst of I/O that can disrupt other + activity in the system. To avoid that, the checkpoint I/O can be distributed + over a longer period of time, defined with + <varname>checkpoint_smoothing</varname>. It's given as a fraction of the + checkpoint interval, as defined by <varname>checkpoint_timeout</varname> + and <varname>checkpoint_segments</varname>. The WAL segment consumption + and elapsed time is monitored and the I/O rate is adjusted during + checkpoint so that it's finished when the given fraction of elapsed time + or WAL segments has passed, whichever is sooner. However, that could lead + to unnecessarily prolonged checkpoints when there's not many dirty buffers + in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used + to set the minimum I/O rate used. Note that prolonging checkpoints + affects recovery time, because the longer the checkpoint takes, more WAL + need to be kept around and replayed in recovery. + </para> + + <para> There will be at least one WAL segment file, and will normally not be more than 2 * <varname>checkpoint_segments</varname> + 1 ! files, though there can be more if a large ! <varname>checkpoint_smoothing</varname> setting is used. ! Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for <acronym>WAL</acronym>. Ordinarily, when old log segment files are no longer needed, they Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.272 diff -c -r1.272 xlog.c *** src/backend/access/transam/xlog.c 31 May 2007 15:13:01 -0000 1.272 --- src/backend/access/transam/xlog.c 20 Jun 2007 10:44:40 -0000 *************** *** 398,404 **** static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); ! static void CheckPointGuts(XLogRecPtr checkPointRedo); static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); --- 398,404 ---- static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); ! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate); static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); *************** *** 1608,1614 **** if (XLOG_DEBUG) elog(LOG, "time for a checkpoint, signaling bgwriter"); #endif ! RequestCheckpoint(false, true); } } } --- 1608,1614 ---- if (XLOG_DEBUG) elog(LOG, "time for a checkpoint, signaling bgwriter"); #endif ! RequestXLogFillCheckpoint(); } } } *************** *** 5110,5116 **** * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. */ ! CreateCheckPoint(true, true); /* * Close down recovery environment --- 5110,5116 ---- * the rule that TLI only changes in shutdown checkpoints, which * allows some extra error checking in xlog_redo. */ ! CreateCheckPoint(true, true, true); /* * Close down recovery environment *************** *** 5319,5324 **** --- 5319,5340 ---- } /* + * GetInsertRecPtr -- Returns the current insert position. + */ + XLogRecPtr + GetInsertRecPtr(void) + { + XLogCtlInsert *Insert = &XLogCtl->Insert; + XLogRecPtr recptr; + + LWLockAcquire(WALInsertLock, LW_SHARED); + INSERT_RECPTR(recptr, Insert, Insert->curridx); + LWLockRelease(WALInsertLock); + + return recptr; + } + + /* * Get the time of the last xlog segment switch */ time_t *************** *** 5383,5389 **** ereport(LOG, (errmsg("shutting down"))); ! CreateCheckPoint(true, true); ShutdownCLOG(); ShutdownSUBTRANS(); ShutdownMultiXact(); --- 5399,5405 ---- ereport(LOG, (errmsg("shutting down"))); ! CreateCheckPoint(true, true, true); ShutdownCLOG(); ShutdownSUBTRANS(); ShutdownMultiXact(); *************** *** 5395,5405 **** /* * Perform a checkpoint --- either during shutdown, or on-the-fly * * If force is true, we force a checkpoint regardless of whether any XLOG * activity has occurred since the last one. */ void ! CreateCheckPoint(bool shutdown, bool force) { CheckPoint checkPoint; XLogRecPtr recptr; --- 5411,5424 ---- /* * Perform a checkpoint --- either during shutdown, or on-the-fly * + * If immediate is true, we try to finish the checkpoint as fast as we can, + * ignoring checkpoint_smoothing parameter. + * * If force is true, we force a checkpoint regardless of whether any XLOG * activity has occurred since the last one. */ void ! CreateCheckPoint(bool shutdown, bool immediate, bool force) { CheckPoint checkPoint; XLogRecPtr recptr; *************** *** 5591,5597 **** */ END_CRIT_SECTION(); ! CheckPointGuts(checkPoint.redo); START_CRIT_SECTION(); --- 5610,5616 ---- */ END_CRIT_SECTION(); ! CheckPointGuts(checkPoint.redo, immediate); START_CRIT_SECTION(); *************** *** 5693,5708 **** /* * Flush all data in shared memory to disk, and fsync * * This is the common code shared between regular checkpoints and * recovery restartpoints. */ static void ! CheckPointGuts(XLogRecPtr checkPointRedo) { CheckPointCLOG(); CheckPointSUBTRANS(); CheckPointMultiXact(); ! FlushBufferPool(); /* performs all required fsyncs */ /* We deliberately delay 2PC checkpointing as long as possible */ CheckPointTwoPhase(checkPointRedo); } --- 5712,5730 ---- /* * Flush all data in shared memory to disk, and fsync * + * If immediate is true, try to finish as quickly as possible, ignoring + * the GUC variables to throttle checkpoint I/O. + * * This is the common code shared between regular checkpoints and * recovery restartpoints. */ static void ! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate) { CheckPointCLOG(); CheckPointSUBTRANS(); CheckPointMultiXact(); ! FlushBufferPool(immediate); /* performs all required fsyncs */ /* We deliberately delay 2PC checkpointing as long as possible */ CheckPointTwoPhase(checkPointRedo); } *************** *** 5710,5716 **** /* * Set a recovery restart point if appropriate * ! * This is similar to CreateCheckpoint, but is used during WAL recovery * to establish a point from which recovery can roll forward without * replaying the entire recovery log. This function is called each time * a checkpoint record is read from XLOG; it must determine whether a --- 5732,5738 ---- /* * Set a recovery restart point if appropriate * ! * This is similar to CreateCheckPoint, but is used during WAL recovery * to establish a point from which recovery can roll forward without * replaying the entire recovery log. This function is called each time * a checkpoint record is read from XLOG; it must determine whether a *************** *** 5751,5757 **** /* * OK, force data out to disk */ ! CheckPointGuts(checkPoint->redo); /* * Update pg_control so that any subsequent crash will restart from this --- 5773,5779 ---- /* * OK, force data out to disk */ ! CheckPointGuts(checkPoint->redo, true); /* * Update pg_control so that any subsequent crash will restart from this *************** *** 6177,6183 **** * have different checkpoint positions and hence different history * file names, even if nothing happened in between. */ ! RequestCheckpoint(true, false); /* * Now we need to fetch the checkpoint record location, and also its --- 6199,6205 ---- * have different checkpoint positions and hence different history * file names, even if nothing happened in between. */ ! RequestLazyCheckpoint(); /* * Now we need to fetch the checkpoint record location, and also its Index: src/backend/bootstrap/bootstrap.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v retrieving revision 1.233 diff -c -r1.233 bootstrap.c *** src/backend/bootstrap/bootstrap.c 7 Mar 2007 13:35:02 -0000 1.233 --- src/backend/bootstrap/bootstrap.c 19 Jun 2007 15:29:51 -0000 *************** *** 489,495 **** /* Perform a checkpoint to ensure everything's down to disk */ SetProcessingMode(NormalProcessing); ! CreateCheckPoint(true, true); /* Clean up and exit */ cleanup(); --- 489,495 ---- /* Perform a checkpoint to ensure everything's down to disk */ SetProcessingMode(NormalProcessing); ! CreateCheckPoint(true, true, true); /* Clean up and exit */ cleanup(); Index: src/backend/commands/dbcommands.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v retrieving revision 1.195 diff -c -r1.195 dbcommands.c *** src/backend/commands/dbcommands.c 1 Jun 2007 19:38:07 -0000 1.195 --- src/backend/commands/dbcommands.c 20 Jun 2007 09:36:24 -0000 *************** *** 404,410 **** * up-to-date for the copy. (We really only need to flush buffers for the * source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); /* * Once we start copying subdirectories, we need to be able to clean 'em --- 404,410 ---- * up-to-date for the copy. (We really only need to flush buffers for the * source database, but bufmgr.c provides no API for that.) */ ! BufferSync(true); /* * Once we start copying subdirectories, we need to be able to clean 'em *************** *** 507,513 **** * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ ! RequestCheckpoint(true, false); /* * Close pg_database, but keep lock till commit (this is important to --- 507,513 ---- * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ ! RequestImmediateCheckpoint(); /* * Close pg_database, but keep lock till commit (this is important to *************** *** 661,667 **** * open files, which would cause rmdir() to fail. */ #ifdef WIN32 ! RequestCheckpoint(true, false); #endif /* --- 661,667 ---- * open files, which would cause rmdir() to fail. */ #ifdef WIN32 ! RequestImmediateCheckpoint(); #endif /* *************** *** 1427,1433 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); /* * Copy this subdirectory to the new location --- 1427,1433 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(true); /* * Copy this subdirectory to the new location Index: src/backend/postmaster/bgwriter.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v retrieving revision 1.38 diff -c -r1.38 bgwriter.c *** src/backend/postmaster/bgwriter.c 27 May 2007 03:50:39 -0000 1.38 --- src/backend/postmaster/bgwriter.c 20 Jun 2007 12:58:20 -0000 *************** *** 44,49 **** --- 44,50 ---- #include "postgres.h" #include <signal.h> + #include <sys/time.h> #include <time.h> #include <unistd.h> *************** *** 59,64 **** --- 60,66 ---- #include "storage/pmsignal.h" #include "storage/shmem.h" #include "storage/smgr.h" + #include "storage/spin.h" #include "tcop/tcopprot.h" #include "utils/guc.h" #include "utils/memutils.h" *************** *** 112,122 **** { pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */ ! sig_atomic_t ckpt_started; /* advances when checkpoint starts */ ! sig_atomic_t ckpt_done; /* advances when checkpoint done */ ! sig_atomic_t ckpt_failed; /* advances when checkpoint fails */ ! sig_atomic_t ckpt_time_warn; /* warn if too soon since last ckpt? */ int num_requests; /* current # of requests */ int max_requests; /* allocated array size */ --- 114,128 ---- { pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */ ! slock_t ckpt_lck; /* protects all the ckpt_* fields */ ! int ckpt_started; /* advances when checkpoint starts */ ! int ckpt_done; /* advances when checkpoint done */ ! int ckpt_failed; /* advances when checkpoint fails */ ! ! bool ckpt_rqst_time_warn; /* warn if too soon since last ckpt */ ! bool ckpt_rqst_immediate; /* an immediate ckpt has been requested */ ! bool ckpt_rqst_force; /* checkpoint even if no WAL activity */ int num_requests; /* current # of requests */ int max_requests; /* allocated array size */ *************** *** 131,136 **** --- 137,143 ---- int BgWriterDelay = 200; int CheckPointTimeout = 300; int CheckPointWarning = 30; + double CheckPointSmoothing = 0.3; /* * Flags set by interrupt handlers for later service in the main loop. *************** *** 146,154 **** --- 153,176 ---- static bool ckpt_active = false; + /* Current time and WAL insert location when checkpoint was started */ + static time_t ckpt_start_time; + static XLogRecPtr ckpt_start_recptr; + + static double ckpt_cached_elapsed; + static time_t last_checkpoint_time; static time_t last_xlog_switch_time; + /* Prototypes for private functions */ + + static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force); + static void CheckArchiveTimeout(void); + static void BgWriterNap(void); + static bool IsCheckpointOnSchedule(double progress); + static bool ImmediateCheckpointRequested(void); + + /* Signal handlers */ static void bg_quickdie(SIGNAL_ARGS); static void BgSigHupHandler(SIGNAL_ARGS); *************** *** 170,175 **** --- 192,198 ---- Assert(BgWriterShmem != NULL); BgWriterShmem->bgwriter_pid = MyProcPid; + SpinLockInit(&BgWriterShmem->ckpt_lck); am_bg_writer = true; /* *************** *** 281,288 **** --- 304,314 ---- /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; + SpinLockAcquire(&BgWriterShmem->ckpt_lck); bgs->ckpt_failed++; bgs->ckpt_done = bgs->ckpt_started; + SpinLockRelease(&bgs->ckpt_lck); + ckpt_active = false; } *************** *** 328,337 **** for (;;) { bool do_checkpoint = false; - bool force_checkpoint = false; time_t now; int elapsed_secs; - long udelay; /* * Emergency bailout if postmaster has died. This is to avoid the --- 354,361 ---- *************** *** 354,360 **** { checkpoint_requested = false; do_checkpoint = true; - force_checkpoint = true; BgWriterStats.m_requested_checkpoints++; } if (shutdown_requested) --- 378,383 ---- *************** *** 377,387 **** */ now = time(NULL); elapsed_secs = now - last_checkpoint_time; ! if (elapsed_secs >= CheckPointTimeout) { do_checkpoint = true; ! if (!force_checkpoint) ! BgWriterStats.m_timed_checkpoints++; } /* --- 400,409 ---- */ now = time(NULL); elapsed_secs = now - last_checkpoint_time; ! if (!do_checkpoint && elapsed_secs >= CheckPointTimeout) { do_checkpoint = true; ! BgWriterStats.m_timed_checkpoints++; } /* *************** *** 390,395 **** --- 412,445 ---- */ if (do_checkpoint) { + /* use volatile pointer to prevent code rearrangement */ + volatile BgWriterShmemStruct *bgs = BgWriterShmem; + bool time_warn; + bool immediate; + bool force; + + /* + * Atomically check the request flags to figure out what + * kind of a checkpoint we should perform, and increase the + * started-counter to acknowledge that we've started + * a new checkpoint. + */ + + SpinLockAcquire(&bgs->ckpt_lck); + + time_warn = bgs->ckpt_rqst_time_warn; + bgs->ckpt_rqst_time_warn = false; + + immediate = bgs->ckpt_rqst_immediate; + bgs->ckpt_rqst_immediate = false; + + force = bgs->ckpt_rqst_force; + bgs->ckpt_rqst_force = false; + + bgs->ckpt_started++; + + SpinLockRelease(&bgs->ckpt_lck); + /* * We will warn if (a) too soon since last checkpoint (whatever * caused it) and (b) somebody has set the ckpt_time_warn flag *************** *** 397,417 **** * implementation will not generate warnings caused by * CheckPointTimeout < CheckPointWarning. */ ! if (BgWriterShmem->ckpt_time_warn && elapsed_secs < CheckPointWarning) ereport(LOG, (errmsg("checkpoints are occurring too frequently (%d seconds apart)", elapsed_secs), errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! BgWriterShmem->ckpt_time_warn = false; /* * Indicate checkpoint start to any waiting backends. */ ckpt_active = true; - BgWriterShmem->ckpt_started++; ! CreateCheckPoint(false, force_checkpoint); /* * After any checkpoint, close all smgr files. This is so we --- 447,474 ---- * implementation will not generate warnings caused by * CheckPointTimeout < CheckPointWarning. */ ! if (time_warn && elapsed_secs < CheckPointWarning) ereport(LOG, (errmsg("checkpoints are occurring too frequently (%d seconds apart)", elapsed_secs), errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! /* * Indicate checkpoint start to any waiting backends. */ ckpt_active = true; ! ckpt_start_recptr = GetInsertRecPtr(); ! ckpt_start_time = now; ! ckpt_cached_elapsed = 0; ! ! elog(DEBUG1, "CHECKPOINT: start"); ! ! CreateCheckPoint(false, immediate, force); ! ! elog(DEBUG1, "CHECKPOINT: end"); /* * After any checkpoint, close all smgr files. This is so we *************** *** 422,428 **** /* * Indicate checkpoint completion to any waiting backends. */ ! BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started; ckpt_active = false; /* --- 479,487 ---- /* * Indicate checkpoint completion to any waiting backends. */ ! SpinLockAcquire(&bgs->ckpt_lck); ! bgs->ckpt_done = bgs->ckpt_started; ! SpinLockRelease(&bgs->ckpt_lck); ckpt_active = false; /* *************** *** 433,446 **** last_checkpoint_time = now; } else ! BgBufferSync(); /* ! * Check for archive_timeout, if so, switch xlog files. First we do a ! * quick check using possibly-stale local state. */ ! if (XLogArchiveTimeout > 0 && ! (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout) { /* * Update local state ... note that last_xlog_switch_time is the --- 492,530 ---- last_checkpoint_time = now; } else ! { ! BgAllSweep(); ! BgLruSweep(); ! } /* ! * Check for archive_timeout and switch xlog files if necessary. */ ! CheckArchiveTimeout(); ! ! /* Nap for the configured time. */ ! BgWriterNap(); ! } ! } ! ! /* ! * CheckArchiveTimeout -- check for archive_timeout and switch xlog files ! * if needed ! */ ! static void ! CheckArchiveTimeout(void) ! { ! time_t now; ! ! if (XLogArchiveTimeout <= 0) ! return; ! ! now = time(NULL); ! ! /* First we do a quick check using possibly-stale local state. */ ! if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout) ! return; ! { /* * Update local state ... note that last_xlog_switch_time is the *************** *** 450,459 **** last_xlog_switch_time = Max(last_xlog_switch_time, last_time); - /* if we did a checkpoint, 'now' might be stale too */ - if (do_checkpoint) - now = time(NULL); - /* Now we can do the real check */ if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout) { --- 534,539 ---- *************** *** 478,483 **** --- 558,572 ---- last_xlog_switch_time = now; } } + } + + /* + * BgWriterNap -- Nap for the configured time or until a signal is received. + */ + static void + BgWriterNap(void) + { + long udelay; /* * Send off activity statistics to the stats collector *************** *** 496,502 **** * We absorb pending requests after each short sleep. */ if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) || ! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0)) udelay = BgWriterDelay * 1000L; else if (XLogArchiveTimeout > 0) udelay = 1000000L; /* One second */ --- 585,592 ---- * We absorb pending requests after each short sleep. */ if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) || ! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) || ! ckpt_active) udelay = BgWriterDelay * 1000L; else if (XLogArchiveTimeout > 0) udelay = 1000000L; /* One second */ *************** *** 505,522 **** while (udelay > 999999L) { ! if (got_SIGHUP || checkpoint_requested || shutdown_requested) break; pg_usleep(1000000L); AbsorbFsyncRequests(); udelay -= 1000000L; } ! if (!(got_SIGHUP || checkpoint_requested || shutdown_requested)) pg_usleep(udelay); } } /* -------------------------------- * signal handler routines --- 595,766 ---- while (udelay > 999999L) { ! /* If a checkpoint is active, postpone reloading the config ! * until the checkpoint is finished, and don't care about ! * non-immediate checkpoint requests. ! */ ! if (shutdown_requested || ! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) || ! (ckpt_active && ImmediateCheckpointRequested())) break; + pg_usleep(1000000L); AbsorbFsyncRequests(); udelay -= 1000000L; } ! ! if (!(shutdown_requested || ! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) || ! (ckpt_active && ImmediateCheckpointRequested()))) pg_usleep(udelay); + } + + /* + * Returns true if an immediate checkpoint request is pending. + */ + static bool + ImmediateCheckpointRequested() + { + if (checkpoint_requested) + { + volatile BgWriterShmemStruct *bgs = BgWriterShmem; + + /* + * We're only looking at a single field, so we don't need to + * acquire the lock in this case. + */ + if (bgs->ckpt_rqst_immediate) + return true; } + return false; } + /* + * CheckpointWriteDelay -- periodical sleep in checkpoint write phase + * + * During checkpoint, this is called periodically by the buffer manager while + * writing out dirty buffers from the shared buffer cache. We estimate if we've + * made enough progress so that we're going to finish this checkpoint in time + * before the next one is due, taking checkpoint_smoothing into account. + * If so, we perform one round of normal bgwriter activity including LRU- + * cleaning of buffer cache, switching xlog segment if archive_timeout has + * passed, and sleeping for BgWriterDelay msecs. + * + * 'progress' is an estimate of how much of the writes has been done, as a + * fraction between 0.0 meaning none, and 1.0 meaning all done. + */ + void + CheckpointWriteDelay(double progress) + { + /* + * Return immediately if we should finish the checkpoint ASAP. + */ + if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested || + ImmediateCheckpointRequested()) + return; + + elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress); + + /* Take a nap and perform the usual bgwriter duties, unless we're behind + * schedule, in which case we just try to catch up as quickly as possible. + */ + if (IsCheckpointOnSchedule(progress)) + { + CheckArchiveTimeout(); + BgLruSweep(); + BgWriterNap(); + } + } + + /* + * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint + * in time? + * + * Compares the current progress against the time/segments elapsed since last + * checkpoint, and returns true if the progress we've made this far is greater + * than the elapsed time/segments. + * + * If another checkpoint has already been requested, always return false. + */ + static bool + IsCheckpointOnSchedule(double progress) + { + struct timeval now; + XLogRecPtr recptr; + double progress_in_time, + progress_in_xlog; + + Assert(ckpt_active); + + /* scale progress according to CheckPointSmoothing */ + progress *= CheckPointSmoothing; + + /* + * Check against the cached value first. Only do the more expensive + * calculations once we reach the target previously calculated. Since + * neither time or WAL insert pointer moves backwards, a freshly + * calculated value can only be greater than or equal to the cached value. + */ + if (progress < ckpt_cached_elapsed) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f", + ckpt_cached_elapsed, progress); + return false; + } + + gettimeofday(&now, NULL); + + /* + * Check progress against time elapsed and checkpoint_timeout. + */ + progress_in_time = ((double) (now.tv_sec - ckpt_start_time) + + now.tv_usec / 1000000.0) / CheckPointTimeout; + + if (progress < progress_in_time) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f", + progress_in_time, progress); + + ckpt_cached_elapsed = progress_in_time; + + return false; + } + + /* + * Check progress against WAL segments written and checkpoint_segments. + * + * We compare the current WAL insert location against the location + * computed before calling CreateCheckPoint. The code in XLogInsert that + * actually triggers a checkpoint when checkpoint_segments is exceeded + * compares against RedoRecptr, so this is not completely accurate. + * However, it's good enough for our purposes, we're only calculating + * an estimate anyway. + */ + recptr = GetInsertRecPtr(); + progress_in_xlog = + (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile + + ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / + CheckPointSegments; + + if (progress < progress_in_xlog) + { + elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f", + progress_in_xlog, progress); + + ckpt_cached_elapsed = progress_in_xlog; + + return false; + } + + + /* It looks like we're on schedule. */ + + elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f", + progress_in_time, progress_in_xlog, progress); + + return true; + } /* -------------------------------- * signal handler routines *************** *** 618,625 **** } /* * RequestCheckpoint ! * Called in backend processes to request an immediate checkpoint * * If waitforit is true, wait until the checkpoint is completed * before returning; otherwise, just signal the request and return --- 862,910 ---- } /* + * RequestImmediateCheckpoint + * Called in backend processes to request an immediate checkpoint. + * + * Returns when the checkpoint is finished. + */ + void + RequestImmediateCheckpoint() + { + RequestCheckpoint(true, false, true, true); + } + + /* + * RequestImmediateCheckpoint + * Called in backend processes to request a lazy checkpoint. + * + * This is essentially the same as RequestImmediateCheckpoint, except + * that this form obeys the checkpoint_smoothing GUC variable, and + * can therefore take a lot longer time. + * + * Returns when the checkpoint is finished. + */ + void + RequestLazyCheckpoint() + { + RequestCheckpoint(true, false, false, true); + } + + /* + * RequestXLogFillCheckpoint + * Signals the bgwriter that we've reached checkpoint_segments + * + * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return + * immediately without waiting for the checkpoint to finish. + */ + void + RequestXLogFillCheckpoint() + { + RequestCheckpoint(false, true, false, false); + } + + /* * RequestCheckpoint ! * Common subroutine for all the above Request*Checkpoint variants. * * If waitforit is true, wait until the checkpoint is completed * before returning; otherwise, just signal the request and return *************** *** 628,648 **** * If warnontime is true, and it's "too soon" since the last checkpoint, * the bgwriter will log a warning. This should be true only for checkpoints * caused due to xlog filling, else the warning will be misleading. */ ! void ! RequestCheckpoint(bool waitforit, bool warnontime) { /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! sig_atomic_t old_failed = bgs->ckpt_failed; ! sig_atomic_t old_started = bgs->ckpt_started; /* * If in a standalone backend, just do it ourselves. */ if (!IsPostmasterEnvironment) { ! CreateCheckPoint(false, true); /* * After any checkpoint, close all smgr files. This is so we won't --- 913,942 ---- * If warnontime is true, and it's "too soon" since the last checkpoint, * the bgwriter will log a warning. This should be true only for checkpoints * caused due to xlog filling, else the warning will be misleading. + * + * If immediate is true, the checkpoint should be finished ASAP. + * + * If force is true, force a checkpoint even if no XLOG activity has occured + * since the last one. */ ! static void ! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force) { /* use volatile pointer to prevent code rearrangement */ volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! int old_failed, old_started; /* * If in a standalone backend, just do it ourselves. */ if (!IsPostmasterEnvironment) { ! /* ! * There's no point in doing lazy checkpoints in a standalone ! * backend, because there's no other backends the checkpoint could ! * disrupt. ! */ ! CreateCheckPoint(false, true, true); /* * After any checkpoint, close all smgr files. This is so we won't *************** *** 653,661 **** return; } ! /* Set warning request flag if appropriate */ if (warnontime) ! bgs->ckpt_time_warn = true; /* * Send signal to request checkpoint. When waitforit is false, we --- 947,974 ---- return; } ! /* ! * Atomically set the request flags, and take a snapshot of the counters. ! * This ensures that when we see that ckpt_started > old_started, ! * we know the flags we set here have been seen by bgwriter. ! * ! * Note that we effectively OR the flags with any existing flags, to ! * avoid overriding a "stronger" request by another backend. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! ! old_failed = bgs->ckpt_failed; ! old_started = bgs->ckpt_started; ! ! /* Set request flags as appropriate */ if (warnontime) ! bgs->ckpt_rqst_time_warn = true; ! if (immediate) ! bgs->ckpt_rqst_immediate = true; ! if (force) ! bgs->ckpt_rqst_force = true; ! ! SpinLockRelease(&bgs->ckpt_lck); /* * Send signal to request checkpoint. When waitforit is false, we *************** *** 674,701 **** */ if (waitforit) { ! while (bgs->ckpt_started == old_started) { CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } - old_started = bgs->ckpt_started; /* ! * We are waiting for ckpt_done >= old_started, in a modulo sense. ! * This is a little tricky since we don't know the width or signedness ! * of sig_atomic_t. We make the lowest common denominator assumption ! * that it is only as wide as "char". This means that this algorithm ! * will cope correctly as long as we don't sleep for more than 127 ! * completed checkpoints. (If we do, we will get another chance to ! * exit after 128 more checkpoints...) */ ! while (((signed char) (bgs->ckpt_done - old_started)) < 0) { CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } ! if (bgs->ckpt_failed != old_failed) ereport(ERROR, (errmsg("checkpoint request failed"), errhint("Consult recent messages in the server log for details."))); --- 987,1031 ---- */ if (waitforit) { ! int new_started, new_failed; ! ! /* Wait for a new checkpoint to start. */ ! for(;;) { + SpinLockAcquire(&bgs->ckpt_lck); + new_started = bgs->ckpt_started; + SpinLockRelease(&bgs->ckpt_lck); + + if (new_started != old_started) + break; + CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } /* ! * We are waiting for ckpt_done >= new_started, in a modulo sense. ! * This algorithm will cope correctly as long as we don't sleep for ! * more than MAX_INT completed checkpoints. (If we do, we will get ! * another chance to exit after MAX_INT more checkpoints...) */ ! for(;;) { + int new_done; + + SpinLockAcquire(&bgs->ckpt_lck); + new_done = bgs->ckpt_done; + new_failed = bgs->ckpt_failed; + SpinLockRelease(&bgs->ckpt_lck); + + if(new_done - new_started >= 0) + break; + CHECK_FOR_INTERRUPTS(); pg_usleep(100000L); } ! ! if (new_failed != old_failed) ereport(ERROR, (errmsg("checkpoint request failed"), errhint("Consult recent messages in the server log for details."))); Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.220 diff -c -r1.220 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 30 May 2007 20:11:58 -0000 1.220 --- src/backend/storage/buffer/bufmgr.c 20 Jun 2007 12:47:43 -0000 *************** *** 32,38 **** * * BufferSync() -- flush all dirty buffers in the buffer pool. * ! * BgBufferSync() -- flush some dirty buffers in the buffer pool. * * InitBufferPool() -- Init the buffer module. * --- 32,40 ---- * * BufferSync() -- flush all dirty buffers in the buffer pool. * ! * BgAllSweep() -- write out some dirty buffers in the pool. ! * ! * BgLruSweep() -- write out some lru dirty buffers in the pool. * * InitBufferPool() -- Init the buffer module. * *************** *** 74,79 **** --- 76,82 ---- double bgwriter_all_percent = 0.333; int bgwriter_lru_maxpages = 5; int bgwriter_all_maxpages = 5; + int checkpoint_rate = 512; /* in pages/s */ long NDirectFileRead; /* some I/O's are direct file access. bypass *************** *** 645,651 **** * at 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR); buf->flags |= BM_TAG_VALID; buf->usage_count = 1; --- 648,654 ---- * at 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR); buf->flags |= BM_TAG_VALID; buf->usage_count = 1; *************** *** 1000,1037 **** * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. */ void ! BufferSync(void) { ! int buf_id; int num_to_scan; int absorb_counter; /* * Find out where to start the circular scan. */ ! buf_id = StrategySyncStart(); /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over all buffers. */ num_to_scan = NBuffers; absorb_counter = WRITES_PER_ABSORB; while (num_to_scan-- > 0) { ! if (SyncOneBuffer(buf_id, false)) { BgWriterStats.m_buf_written_checkpoints++; /* * If in bgwriter, absorb pending fsync requests after each * WRITES_PER_ABSORB write operations, to prevent overflow of the * fsync request queue. If not in bgwriter process, this is a * no-op. */ if (--absorb_counter <= 0) { --- 1003,1127 ---- * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. + * If 'immediate' is true, write them all ASAP, otherwise throttle the + * I/O rate according to checkpoint_write_rate GUC variable, and perform + * normal bgwriter duties periodically. */ void ! BufferSync(bool immediate) { ! int buf_id, start_id; int num_to_scan; + int num_to_write; + int num_written; int absorb_counter; + int num_written_since_nap; + int writes_per_nap; + + /* + * Convert checkpoint_write_rate to number writes of writes to perform in + * a period of BgWriterDelay. The result is an integer, so we lose some + * precision here. There's a lot of other factors as well that affect the + * real rate, for example granularity of OS timer used for BgWriterDelay, + * whether any of the writes block, and time spent in CheckpointWriteDelay + * performing normal bgwriter duties. + */ + writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); /* * Find out where to start the circular scan. */ ! start_id = StrategySyncStart(); /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over all buffers, and mark the ones that need to be written with ! * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we ! * can estimate how much work needs to be done. ! * ! * This allows us to only write those pages that were dirty when the ! * checkpoint began, and haven't been flushed to disk since. Whenever a ! * page with BM_CHECKPOINT_NEEDED is written out by normal backends or ! * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after ! * this scan don't have the flag set. ! */ ! num_to_scan = NBuffers; ! num_to_write = 0; ! buf_id = start_id; ! while (num_to_scan-- > 0) ! { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; ! ! /* ! * Header spinlock is enough to examine BM_DIRTY, see comment in ! * SyncOneBuffer. ! */ ! LockBufHdr(bufHdr); ! ! if (bufHdr->flags & BM_DIRTY) ! { ! bufHdr->flags |= BM_CHECKPOINT_NEEDED; ! num_to_write++; ! } ! ! UnlockBufHdr(bufHdr); ! ! if (++buf_id >= NBuffers) ! buf_id = 0; ! } ! ! elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers); ! ! /* ! * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. */ num_to_scan = NBuffers; + num_written = num_written_since_nap = 0; absorb_counter = WRITES_PER_ABSORB; + buf_id = start_id; while (num_to_scan-- > 0) { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; ! bool needs_flush; ! ! /* We don't need to acquire the lock here, because we're ! * only looking at a single bit. It's possible that someone ! * else writes the buffer and clears the flag right after we ! * check, but that doesn't matter. This assumes that no-one ! * clears the flag and sets it again while holding info_lck, ! * expecting no-one to see the intermediary state. ! */ ! needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0; ! ! if (needs_flush && SyncOneBuffer(buf_id, false)) { BgWriterStats.m_buf_written_checkpoints++; + num_written++; + + /* + * Perform normal bgwriter duties and sleep to throttle + * our I/O rate. + */ + if (!immediate && ++num_written_since_nap >= writes_per_nap) + { + num_written_since_nap = 0; + CheckpointWriteDelay((double) (num_written) / num_to_write); + } /* * If in bgwriter, absorb pending fsync requests after each * WRITES_PER_ABSORB write operations, to prevent overflow of the * fsync request queue. If not in bgwriter process, this is a * no-op. + * + * AbsorbFsyncRequests is also called inside CheckpointWriteDelay, + * so this is partially redundant. However, we can't totally trust + * on the call in CheckpointWriteDelay, because it's only made + * before sleeping. In case CheckpointWriteDelay doesn't sleep, + * we need to absorb pending requests ourselves. */ if (--absorb_counter <= 0) { *************** *** 1045,1059 **** } /* ! * BgBufferSync -- Write out some dirty buffers in the pool. * * This is called periodically by the background writer process. */ void ! BgBufferSync(void) { static int buf_id1 = 0; - int buf_id2; int num_to_scan; int num_written; --- 1135,1152 ---- } /* ! * BgAllSweep -- Write out some dirty buffers in the pool. * + * Runs the bgwriter all-sweep algorithm to write dirty buffers to + * minimize work at checkpoint time. * This is called periodically by the background writer process. + * + * XXX: Is this really needed with load distributed checkpoints? */ void ! BgAllSweep(void) { static int buf_id1 = 0; int num_to_scan; int num_written; *************** *** 1063,1072 **** /* * To minimize work at checkpoint time, we want to try to keep all the * buffers clean; this motivates a scan that proceeds sequentially through ! * all buffers. But we are also charged with ensuring that buffers that ! * will be recycled soon are clean when needed; these buffers are the ones ! * just ahead of the StrategySyncStart point. We make a separate scan ! * through those. */ /* --- 1156,1162 ---- /* * To minimize work at checkpoint time, we want to try to keep all the * buffers clean; this motivates a scan that proceeds sequentially through ! * all buffers. */ /* *************** *** 1098,1103 **** --- 1188,1210 ---- } BgWriterStats.m_buf_written_all += num_written; } + } + + /* + * BgLruSweep -- Write out some lru dirty buffers in the pool. + */ + void + BgLruSweep(void) + { + int buf_id2; + int num_to_scan; + int num_written; + + /* + * The purpose of this sweep is to ensure that buffers that + * will be recycled soon are clean when needed; these buffers are the ones + * just ahead of the StrategySyncStart point. + */ /* * This loop considers only unpinned buffers close to the clock sweep *************** *** 1341,1349 **** * flushed. */ void ! FlushBufferPool(void) { ! BufferSync(); smgrsync(); } --- 1448,1459 ---- * flushed. */ void ! FlushBufferPool(bool immediate) { ! elog(DEBUG1, "CHECKPOINT: write phase"); ! BufferSync(immediate || CheckPointSmoothing <= 0); ! ! elog(DEBUG1, "CHECKPOINT: sync phase"); smgrsync(); } *************** *** 2132,2138 **** Assert(buf->flags & BM_IO_IN_PROGRESS); buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED)) ! buf->flags &= ~BM_DIRTY; buf->flags |= set_flag_bits; UnlockBufHdr(buf); --- 2242,2248 ---- Assert(buf->flags & BM_IO_IN_PROGRESS); buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR); if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED)) ! buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED); buf->flags |= set_flag_bits; UnlockBufHdr(buf); Index: src/backend/tcop/utility.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v retrieving revision 1.280 diff -c -r1.280 utility.c *** src/backend/tcop/utility.c 30 May 2007 20:12:01 -0000 1.280 --- src/backend/tcop/utility.c 20 Jun 2007 09:36:31 -0000 *************** *** 1089,1095 **** ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); ! RequestCheckpoint(true, false); break; case T_ReindexStmt: --- 1089,1095 ---- ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); ! RequestImmediateCheckpoint(); break; case T_ReindexStmt: Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.396 diff -c -r1.396 guc.c *** src/backend/utils/misc/guc.c 8 Jun 2007 18:23:52 -0000 1.396 --- src/backend/utils/misc/guc.c 20 Jun 2007 10:14:06 -0000 *************** *** 1487,1492 **** --- 1487,1503 ---- 30, 0, INT_MAX, NULL, NULL }, + + { + {"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."), + NULL, + GUC_UNIT_BLOCKS + }, + &checkpoint_rate, + 100, 0.0, 100000, NULL, NULL + }, + { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), *************** *** 1866,1871 **** --- 1877,1891 ---- 0.1, 0.0, 100.0, NULL, NULL }, + { + {"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS, + gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."), + NULL + }, + &CheckPointSmoothing, + 0.3, 0.0, 0.9, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL Index: src/backend/utils/misc/postgresql.conf.sample =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v retrieving revision 1.216 diff -c -r1.216 postgresql.conf.sample *** src/backend/utils/misc/postgresql.conf.sample 3 Jun 2007 17:08:15 -0000 1.216 --- src/backend/utils/misc/postgresql.conf.sample 20 Jun 2007 10:03:17 -0000 *************** *** 168,173 **** --- 168,175 ---- #checkpoint_segments = 3 # in logfile segments, min 1, 16MB each #checkpoint_timeout = 5min # range 30s-1h + #checkpoint_smoothing = 0.3 # checkpoint duration, range 0.0 - 0.9 + #checkpoint_rate = 512.0KB # min. checkpoint write rate per second #checkpoint_warning = 30s # 0 is off # - Archiving - Index: src/include/access/xlog.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v retrieving revision 1.78 diff -c -r1.78 xlog.h *** src/include/access/xlog.h 30 May 2007 20:12:02 -0000 1.78 --- src/include/access/xlog.h 19 Jun 2007 14:10:07 -0000 *************** *** 171,179 **** extern void StartupXLOG(void); extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); ! extern void CreateCheckPoint(bool shutdown, bool force); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr GetRedoRecPtr(void); extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch); #endif /* XLOG_H */ --- 171,180 ---- extern void StartupXLOG(void); extern void ShutdownXLOG(int code, Datum arg); extern void InitXLOGAccess(void); ! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force); extern void XLogPutNextOid(Oid nextOid); extern XLogRecPtr GetRedoRecPtr(void); + extern XLogRecPtr GetInsertRecPtr(void); extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch); #endif /* XLOG_H */ Index: src/include/postmaster/bgwriter.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v retrieving revision 1.9 diff -c -r1.9 bgwriter.h *** src/include/postmaster/bgwriter.h 5 Jan 2007 22:19:57 -0000 1.9 --- src/include/postmaster/bgwriter.h 20 Jun 2007 09:27:20 -0000 *************** *** 20,29 **** extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; extern void BackgroundWriterMain(void); ! extern void RequestCheckpoint(bool waitforit, bool warnontime); extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno); extern void AbsorbFsyncRequests(void); --- 20,33 ---- extern int BgWriterDelay; extern int CheckPointTimeout; extern int CheckPointWarning; + extern double CheckPointSmoothing; extern void BackgroundWriterMain(void); ! extern void RequestImmediateCheckpoint(void); ! extern void RequestLazyCheckpoint(void); ! extern void RequestXLogFillCheckpoint(void); ! extern void CheckpointWriteDelay(double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno); extern void AbsorbFsyncRequests(void); Index: src/include/storage/buf_internals.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v retrieving revision 1.90 diff -c -r1.90 buf_internals.h *** src/include/storage/buf_internals.h 30 May 2007 20:12:03 -0000 1.90 --- src/include/storage/buf_internals.h 12 Jun 2007 11:42:23 -0000 *************** *** 35,40 **** --- 35,41 ---- #define BM_IO_ERROR (1 << 4) /* previous I/O failed */ #define BM_JUST_DIRTIED (1 << 5) /* dirtied since write started */ #define BM_PIN_COUNT_WAITER (1 << 6) /* have waiter for sole pin */ + #define BM_CHECKPOINT_NEEDED (1 << 7) /* this needs to be written in checkpoint */ typedef bits16 BufFlags; Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.104 diff -c -r1.104 bufmgr.h *** src/include/storage/bufmgr.h 30 May 2007 20:12:03 -0000 1.104 --- src/include/storage/bufmgr.h 20 Jun 2007 10:28:43 -0000 *************** *** 36,41 **** --- 36,42 ---- extern double bgwriter_all_percent; extern int bgwriter_lru_maxpages; extern int bgwriter_all_maxpages; + extern int checkpoint_rate; /* in buf_init.c */ extern DLLIMPORT char *BufferBlocks; *************** *** 136,142 **** extern void ResetBufferUsage(void); extern void AtEOXact_Buffers(bool isCommit); extern void PrintBufferLeakWarning(Buffer buffer); ! extern void FlushBufferPool(void); extern BlockNumber BufferGetBlockNumber(Buffer buffer); extern BlockNumber RelationGetNumberOfBlocks(Relation relation); extern void RelationTruncate(Relation rel, BlockNumber nblocks); --- 137,143 ---- extern void ResetBufferUsage(void); extern void AtEOXact_Buffers(bool isCommit); extern void PrintBufferLeakWarning(Buffer buffer); ! extern void FlushBufferPool(bool immediate); extern BlockNumber BufferGetBlockNumber(Buffer buffer); extern BlockNumber RelationGetNumberOfBlocks(Relation relation); extern void RelationTruncate(Relation rel, BlockNumber nblocks); *************** *** 161,168 **** extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern void BufferSync(void); ! extern void BgBufferSync(void); extern void AtProcExit_LocalBuffers(void); --- 162,170 ---- extern void AbortBufferIO(void); extern void BufmgrCommit(void); ! extern void BufferSync(bool immediate); ! extern void BgAllSweep(void); ! extern void BgLruSweep(void); extern void AtProcExit_LocalBuffers(void);
pgsql-patches by date: