Thread: Too frequent checkpoints ?
I was looking at the vacuum/visibility bug that Jeff Janes reported and brought up the server with the data directory he has shared. With his configuration, # - Checkpoints - checkpoint_segments = 1 # in logfile segments, min 1, 16MB each checkpoint_timeout = 30s # range 30s-1h # - Archiving - archive_mode = on # allows archiving to be done # (change requires restart) archive_command = 'echo archive_command %p %f `date`' # command to use to archive a logfile segment # placeholders: %p = path of file to archive # %f = file name only # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f' archive_timeout = 30 # force a logfile segment switch after this # number of seconds; 0 disables I see the following messages coming on the console: 3092 2013-02-08 02:29:01.239 PST:LOG: checkpoints are occurring too frequently (0 seconds apart) 3092 2013-02-08 02:29:01.239 PST:HINT: Consider increasing the configuration parameter "checkpoint_segments". 3092 2013-02-08 02:29:01.239 PST:LOG: checkpoint starting: xlog 3092 2013-02-08 02:29:01.246 PST:LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s archive_command pg_xlog/0000000100000057000000F5 0000000100000057000000F5 Fri Feb 8 15:59:01 IST 2013 3092 2013-02-08 02:29:31.278 PST:LOG: checkpoints are occurring too frequently (0 seconds apart) 3092 2013-02-08 02:29:31.278 PST:HINT: Consider increasing the configuration parameter "checkpoint_segments". 3092 2013-02-08 02:29:31.278 PST:LOG: checkpoint starting: xlog 3092 2013-02-08 02:29:31.285 PST:LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.006 s; sync files=0, longest=0.000 s, average=0.000 s archive_command pg_xlog/0000000100000057000000F6 0000000100000057000000F6 Fri Feb 8 15:59:31 IST 2013 3092 2013-02-08 02:30:01.316 PST:LOG: checkpoints are occurring too frequently (0 seconds apart) 3092 2013-02-08 02:30:01.316 PST:HINT: Consider increasing the configuration parameter "checkpoint_segments". 3092 2013-02-08 02:30:01.316 PST:LOG: checkpoint starting: xlog archive_command pg_xlog/0000000100000057000000F7 0000000100000057000000F7 Fri Feb 8 16:00:01 IST 2013 3092 2013-02-08 02:30:01.323 PST:LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.006 s; sync files=0, longest=0.000 s, average=0.000 s 3092 2013-02-08 02:30:31.327 PST:LOG: checkpoints are occurring too frequently (0 seconds apart) 3092 2013-02-08 02:30:31.327 PST:HINT: Consider increasing the configuration parameter "checkpoint_segments". 3092 2013-02-08 02:30:31.327 PST:LOG: checkpoint starting: xlog archive_command pg_xlog/0000000100000057000000F8 0000000100000057000000F8 Fri Feb 8 16:00:31 IST 2013 3092 2013-02-08 02:30:31.334 PST:LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s I wonder if this is all expected. The database is getting ZERO activity. There are no connections open at this time. The checkpoints are happening at every 30 seconds and new WAL files are being created, AFAIK because the old ones are getting archived. Can't we be smart about not archiving new files if we did not generate any new WAL since the last archive ? Also, the log says "checkpoints are occurring too frequently (0 seconds apart)". But that looks wrong too. Checkpoints are really happening at 30 seconds apart and not 0 as the log message claims. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On 02/08/2013 02:37 AM, Pavan Deolasee wrote: > I wonder if this is all expected. The database is getting ZERO > activity. There are no connections open at this time. The checkpoints > are happening at every 30 seconds and new WAL files are being created, > AFAIK because the old ones are getting archived. Can't we be smart > about not archiving new files if we did not generate any new WAL since > the last archive ? It is because you have it set up so that the longest time you can go without checkpoints is 30 seconds. So no matter what, if you have 1000 checkpoint segments you are still going to checkpoint ever 30 seconds. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579
On Fri, Feb 8, 2013 at 7:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > I wonder if this is all expected. The database is getting ZERO > activity. There are no connections open at this time. The checkpoints > are happening at every 30 seconds and new WAL files are being created, > AFAIK because the old ones are getting archived. Can't we be smart > about not archiving new files if we did not generate any new WAL since > the last archive ? AFAIR this problem has been discussed several times before, but has not been fixed yet. BTW, the cause of the problem is that the following sequences happens. 1. archive_timeout switches WAL file because checkpoint WAL record has has been written since last switch 2. Checkpoint occurs and writes its WAL record because new WAL file was generated since last checkpoint, i.e., checkpointthinks that there is at least some activity since last checkpoint. 3. Back to #1 after archive_timeout passed through. Regards, -- Fujii Masao
On Fri, Feb 8, 2013 at 2:37 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > I was looking at the vacuum/visibility bug that Jeff Janes reported > and brought up the server with the data directory he has shared. With > his configuration, > .... > 3092 2013-02-08 02:30:31.327 PST:LOG: checkpoints are occurring too > frequently (0 seconds apart) > 3092 2013-02-08 02:30:31.327 PST:HINT: Consider increasing the > configuration parameter "checkpoint_segments". ... > > I wonder if this is all expected. The database is getting ZERO > activity. There are no connections open at this time. The checkpoints > are happening at every 30 seconds and new WAL files are being created, > AFAIK because the old ones are getting archived. Can't we be smart > about not archiving new files if we did not generate any new WAL since > the last archive ? Yeah, this is a known issue. We skip two consecutive checkpoints (forced by checkpoint_timeout) with no intervening activity, and we skip two consecutive log switches (forced by archive_timeout) with no intervening activity, but when you put them both together neither one gets skipped because each one counts as "intervening activity" from the other ones perspective. I think that at one point this was considered desirable, as the arrival of log files in the archive, even if they had no real content, was viewed as proof of life. You can set up archive_command to send you an email if an archive fails, but the same thing that prevents the archive might prevent the email from arriving. > Also, the log says "checkpoints are occurring too frequently (0 > seconds apart)". But that looks wrong too. Checkpoints are really > happening at 30 seconds apart and not 0 as the log message claims. I noticed that as well, and I think it might be a regression. I'll run a bisection on it if I get a chance. Cheers, Jeff
> > >> Also, the log says "checkpoints are occurring too frequently (0 >> seconds apart)". But that looks wrong too. Checkpoints are really >> happening at 30 seconds apart and not 0 as the log message claims. > > I noticed that as well, and I think it might be a regression. I'll > run a bisection on it if I get a chance. > Looking further into it, seems there are two checkpoints happening almost immediately. The first happens because of timeoutand the second happens immediately because of checkpoint segments (set to 1 in this case) Given that no one has bothered to fix these issues, may be they are non issues after all. Thanks, Pavan > Cheers, > > Jeff
On Fri, Feb 8, 2013 at 10:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote: .> > BTW, the cause of the problem is that the following sequences happens. > > 1. archive_timeout switches WAL file because checkpoint WAL record has > has been written since last switch Thank you for explaining that. I also noticed that the WAL file switch happens after archive_timeout seconds irrespective of whether archive_mode is turned ON or not. This happens because we don't check if XLogArchivingActive() in CheckArchiveTimeout() function. It looks wrong to me. But if it's that way for a good reason, shouldn't we at least document this behaviour ? Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Mon, Feb 11, 2013 at 5:46 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Fri, Feb 8, 2013 at 10:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > .> >> BTW, the cause of the problem is that the following sequences happens. >> >> 1. archive_timeout switches WAL file because checkpoint WAL record has >> has been written since last switch > > Thank you for explaining that. I also noticed that the WAL file switch > happens after archive_timeout seconds irrespective of whether > archive_mode is turned ON or not. This happens because we don't check > if XLogArchivingActive() in CheckArchiveTimeout() function. It looks > wrong to me. +1 to fix this. I've not heard the use case where archive_timeout needs to be used even in not archive mode... Regards, -- Fujii Masao