Home > mailing lists

Controlling Load Distributed Checkpoints - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Controlling Load Distributed Checkpoints
Date	June 6, 2007 10:19:51
Msg-id	4666B450.8070506@enterprisedb.com Whole thread Raw
Responses	Re: Controlling Load Distributed Checkpoints Re: Controlling Load Distributed Checkpoints Re: Controlling Load Distributed Checkpoints
List	pgsql-hackers

Tree view

I'm again looking at way the GUC variables work in load distributed 
checkpoints patch. We've discussed them a lot already, but I don't think 
they're still quite right.

Write-phase
-----------
I like the way the write-phase is controlled in general. Writes are 
throttled so that we spend the specified percentage of checkpoint 
interval doing the writes. But we always write at a specified minimum 
rate to avoid spreading out the writes unnecessarily when there's little 
work to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate. 
I think we should have a separate variable, checkpoint_write_min_rate, 
in KB/s, instead.

Nap phase
---------
This is trickier. The purpose of the sleep between writes and fsyncs is 
to give the OS a chance to flush the pages to disk in it's own pace, 
hopefully limiting the affect on concurrent activity. The sleep 
shouldn't last too long, because any concurrent activity can be dirtying 
and writing more pages, and we might end up fsyncing more than necessary 
which is bad for performance. The optimal delay depends on many factors, 
but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write and 
sync phases is controlled as a percentage of checkpoint interval. Given 
that the optimal delay is in the range of seconds, and 
checkpoint_timeout can be up to 60 minutes, the useful values of that 
percentage would be very small, like 0.5% or even less. Furthermore, the 
optimal value doesn't depend that much on the checkpoint interval, it's 
more dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of as 
a percentage of checkpoint interval.

Sync phase
----------
This is also tricky. As with the nap phase, we don't want to spend too 
much time fsyncing, because concurrent activity will write more dirty 
pages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses the 
file size as a measure of that, but as we discussed that doesn't 
necessarily have anything to do with reality. fsyncing a 1GB file with 
one dirty block isn't any more expensive than fsyncing a file with a 
single block.

Another problem is the granularity of an fsync. If we fsync a 1GB file 
that's full of dirty pages, we can't limit the affect on other activity. 
The best we can do is to sleep between fsyncs, but sleeping more than a 
few seconds is hardly going to be useful, no matter how bad an I/O storm 
each fsync causes.

Because of the above, I'm thinking we should ditch the 
checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay  # max. sleep between fsyncs, in milliseconds


In all phases, the normal bgwriter activities are performed: 
lru-cleaning and switching xlog segments if archive_timeout expires. If 
a new checkpoint request arrives while the previous one is still in 
progress, we skip all the delays and finish the previous checkpoint as 
soon as possible.


GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50         # % of checkpoint interval to spread out 
writes
checkpoint_write_min_rate = 1000    # minimum I/O rate to write dirty 
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2         # delay between write and sync phase, in 
seconds
checkpoint_fsync_period = 30        # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500        # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a 
way to tune them automatically. Maybe we could just hard-code the last 
one, it doesn't seem that critical, but that still leaves us 4 variables.

Thoughts?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: "Zeugswetter Andreas ADI SD"
Date: 06 June 2007, 05:43:28
Subject: Re: TOAST usage setting

From: "Florian G. Pflug"
Date: 06 June 2007, 11:11:28
Subject: [RFC] GSoC Work on readonly queries done so far

Controlling Load Distributed Checkpoints - Mailing list pgsql-hackers

Previous

Next