Thread: Too frequent checkpoints ?

Too frequent checkpoints ?

From

Pavan Deolasee

Date:

08 February 2013, 10:38:18

I was looking at the vacuum/visibility bug that Jeff Janes reported
and brought up the server with the data directory he has shared. With
his configuration,

# - Checkpoints -

checkpoint_segments = 1     # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30s        # range 30s-1h

# - Archiving -

archive_mode = on       # allows archiving to be done               # (change requires restart)
archive_command = 'echo archive_command %p %f `date`'       # command
to use to archive a logfile segment               # placeholders: %p = path of file to archive               #
    %f = file name only               # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
 
/mnt/server/archivedir/%f'
archive_timeout = 30        # force a logfile segment switch after this               # number of seconds; 0 disables


I see the following messages coming on the console:

3092  2013-02-08 02:29:01.239 PST:LOG:  checkpoints are occurring too
frequently (0 seconds apart)
3092  2013-02-08 02:29:01.239 PST:HINT:  Consider increasing the
configuration parameter "checkpoint_segments".
3092  2013-02-08 02:29:01.239 PST:LOG:  checkpoint starting: xlog
3092  2013-02-08 02:29:01.246 PST:LOG:  checkpoint complete: wrote 0
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0,
longest=0.000 s, average=0.000 s
archive_command pg_xlog/0000000100000057000000F5
0000000100000057000000F5 Fri Feb 8 15:59:01 IST 2013
3092  2013-02-08 02:29:31.278 PST:LOG:  checkpoints are occurring too
frequently (0 seconds apart)
3092  2013-02-08 02:29:31.278 PST:HINT:  Consider increasing the
configuration parameter "checkpoint_segments".
3092  2013-02-08 02:29:31.278 PST:LOG:  checkpoint starting: xlog
3092  2013-02-08 02:29:31.285 PST:LOG:  checkpoint complete: wrote 0
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=0.000 s, sync=0.000 s, total=0.006 s; sync files=0,
longest=0.000 s, average=0.000 s
archive_command pg_xlog/0000000100000057000000F6
0000000100000057000000F6 Fri Feb 8 15:59:31 IST 2013
3092  2013-02-08 02:30:01.316 PST:LOG:  checkpoints are occurring too
frequently (0 seconds apart)
3092  2013-02-08 02:30:01.316 PST:HINT:  Consider increasing the
configuration parameter "checkpoint_segments".
3092  2013-02-08 02:30:01.316 PST:LOG:  checkpoint starting: xlog
archive_command pg_xlog/0000000100000057000000F7
0000000100000057000000F7 Fri Feb 8 16:00:01 IST 2013
3092  2013-02-08 02:30:01.323 PST:LOG:  checkpoint complete: wrote 0
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=0.000 s, sync=0.000 s, total=0.006 s; sync files=0,
longest=0.000 s, average=0.000 s
3092  2013-02-08 02:30:31.327 PST:LOG:  checkpoints are occurring too
frequently (0 seconds apart)
3092  2013-02-08 02:30:31.327 PST:HINT:  Consider increasing the
configuration parameter "checkpoint_segments".
3092  2013-02-08 02:30:31.327 PST:LOG:  checkpoint starting: xlog
archive_command pg_xlog/0000000100000057000000F8
0000000100000057000000F8 Fri Feb 8 16:00:31 IST 2013
3092  2013-02-08 02:30:31.334 PST:LOG:  checkpoint complete: wrote 0
buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0,
longest=0.000 s, average=0.000 s

I wonder if this is all expected. The database is getting ZERO
activity. There are no connections open at this time. The checkpoints
are happening at every 30 seconds and new WAL files are being created,
AFAIK because the old ones are getting archived. Can't we be smart
about not archiving new files if we did not generate any new WAL since
the last archive ?

Also, the log says "checkpoints are occurring too frequently (0
seconds apart)". But that looks wrong too. Checkpoints are really
happening at 30 seconds apart and not 0 as the log message claims.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Too frequent checkpoints ?

From

"Joshua D. Drake"

Date:

08 February 2013, 16:12:20

On 02/08/2013 02:37 AM, Pavan Deolasee wrote:

> I wonder if this is all expected. The database is getting ZERO
> activity. There are no connections open at this time. The checkpoints
> are happening at every 30 seconds and new WAL files are being created,
> AFAIK because the old ones are getting archived. Can't we be smart
> about not archiving new files if we did not generate any new WAL since
> the last archive ?

It is because you have it set up so that the longest time you can go 
without checkpoints is 30 seconds. So no matter what, if you have 1000 
checkpoint segments you are still going to checkpoint ever 30 seconds.

JD

-- 
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579

Re: Too frequent checkpoints ?

From

Fujii Masao

Date:

08 February 2013, 16:55:36

On Fri, Feb 8, 2013 at 7:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I wonder if this is all expected. The database is getting ZERO
> activity. There are no connections open at this time. The checkpoints
> are happening at every 30 seconds and new WAL files are being created,
> AFAIK because the old ones are getting archived. Can't we be smart
> about not archiving new files if we did not generate any new WAL since
> the last archive ?

AFAIR this problem has been discussed several times before, but has not
been fixed yet.

BTW, the cause of the problem is that the following sequences happens.

1. archive_timeout switches WAL file because checkpoint WAL record has   has been written since last switch
2. Checkpoint occurs and writes its WAL record because new WAL file was   generated since last checkpoint, i.e.,
checkpointthinks that

there is at least   some activity since last checkpoint.
3. Back to #1 after archive_timeout passed through.

Regards,

-- 
Fujii Masao

Re: Too frequent checkpoints ?

From

Jeff Janes

Date:

08 February 2013, 17:02:00

On Fri, Feb 8, 2013 at 2:37 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> I was looking at the vacuum/visibility bug that Jeff Janes reported
> and brought up the server with the data directory he has shared. With
> his configuration,
>
....
> 3092  2013-02-08 02:30:31.327 PST:LOG:  checkpoints are occurring too
> frequently (0 seconds apart)
> 3092  2013-02-08 02:30:31.327 PST:HINT:  Consider increasing the
> configuration parameter "checkpoint_segments".
...
>
> I wonder if this is all expected. The database is getting ZERO
> activity. There are no connections open at this time. The checkpoints
> are happening at every 30 seconds and new WAL files are being created,
> AFAIK because the old ones are getting archived. Can't we be smart
> about not archiving new files if we did not generate any new WAL since
> the last archive ?

Yeah, this is a known issue.  We skip two consecutive checkpoints
(forced by checkpoint_timeout) with no intervening activity, and we
skip two consecutive log switches (forced by archive_timeout) with no
intervening activity, but when you put them both together neither one
gets skipped because each one counts as "intervening activity" from
the other ones perspective.  I think that at one point this was
considered desirable, as the arrival of log files in the archive, even
if they had no real content, was viewed as proof of life.  You can set
up archive_command to send you an email if an archive fails, but the
same thing that prevents the archive might prevent the email from
arriving.

> Also, the log says "checkpoints are occurring too frequently (0
> seconds apart)". But that looks wrong too. Checkpoints are really
> happening at 30 seconds apart and not 0 as the log message claims.

I noticed that as well, and I think it might be a regression.  I'll
run a bisection on it if I get a chance.

Cheers,

Jeff

Re: Too frequent checkpoints ?

From

Pavan Deolasee

Date:

09 February 2013, 05:28:18

>
>
>> Also, the log says "checkpoints are occurring too frequently (0
>> seconds apart)". But that looks wrong too. Checkpoints are really
>> happening at 30 seconds apart and not 0 as the log message claims.
>
> I noticed that as well, and I think it might be a regression.  I'll
> run a bisection on it if I get a chance.
>

Looking further into it, seems there are two checkpoints happening almost immediately. The first happens because of
timeoutand the second happens immediately because of checkpoint segments (set to 1 in this case) 

Given that no one has bothered to fix these issues, may be they are non issues after all.

Thanks,
Pavan

> Cheers,
>
> Jeff

Re: Too frequent checkpoints ?

From

Pavan Deolasee

Date:

11 February 2013, 08:46:31

On Fri, Feb 8, 2013 at 10:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
.>
> BTW, the cause of the problem is that the following sequences happens.
>
> 1. archive_timeout switches WAL file because checkpoint WAL record has
>     has been written since last switch

Thank you for explaining that. I also noticed that the WAL file switch
happens after archive_timeout seconds irrespective of whether
archive_mode is turned ON or not. This happens because we don't check
if XLogArchivingActive() in CheckArchiveTimeout() function. It looks
wrong to me. But if it's that way for a good reason, shouldn't we at
least document this behaviour ?

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Too frequent checkpoints ?

From

Fujii Masao

Date:

14 February 2013, 18:19:02

On Mon, Feb 11, 2013 at 5:46 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Fri, Feb 8, 2013 at 10:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> .>
>> BTW, the cause of the problem is that the following sequences happens.
>>
>> 1. archive_timeout switches WAL file because checkpoint WAL record has
>>     has been written since last switch
>
> Thank you for explaining that. I also noticed that the WAL file switch
> happens after archive_timeout seconds irrespective of whether
> archive_mode is turned ON or not. This happens because we don't check
> if XLogArchivingActive() in CheckArchiveTimeout() function. It looks
> wrong to me.

+1 to fix this. I've not heard the use case where archive_timeout needs to
be used even in not archive mode...

Regards,

-- 
Fujii Masao