Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Redesigning checkpoint_segments |
Date | |
Msg-id | 5217CF4E.5080709@vmware.com Whole thread Raw |
In response to | Re: Redesigning checkpoint_segments (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: Redesigning checkpoint_segments
Re: Redesigning checkpoint_segments |
List | pgsql-hackers |
On 03.07.2013 21:28, Peter Eisentraut wrote: > On 6/6/13 4:09 PM, Heikki Linnakangas wrote: >> Here's a patch implementing that. Docs not updated yet. I did not change >> the way checkpoint_segments triggers checkpoints - that'll can be a >> separate patch. This only decouples the segment preallocation behavior >> from checkpoint_segments. With the patch, you can set >> checkpoint_segments really high, without consuming that much disk space >> all the time. > > I don't understand what this patch, by itself, will accomplish in terms > of the originally stated goals of making checkpoint_segments easier to > tune, and controlling disk space used. To some degree, it makes both of > these things worse, because you can no longer use checkpoint_segments to > control the disk space. Instead, it is replaced by magic. The patch addressed the third point in my first post: > A third point is that even if you have 10 GB of disk space reserved > for WAL, you don't want to actually consume all that 10 GB, if it's > not required to run the database smoothly. There are several reasons > for that: backups based on a filesystem-level snapshot are larger > than necessary, if there are a lot of preallocated WAL segments and > in a virtualized or shared system, there might be other VMs or > applications that could make use of the disk space. On the other > hand, you don't want to run out of disk space while writing WAL - > that can lead to a PANIC in the worst case. > What sort of behavior are you expecting to come out of this? In testing, > I didn't see much of a difference. Although I'd expect that this would > actually preallocate fewer segments than the old formula. For example, if you set checkpoint_segments to 200, and you temporarily generate 100 segments of WAL during an initial data load, but the normal workload generates only 20 segments between checkpoints. Without the patch, you will permanently have about 120 segments in pg_xlog, created by the spike. With the patch, the extra segments will be gradually removed after the data load, down to the level needed by the constant workload. That would be about 50 segments, assuming the default checkpoint_completion_target=0.5. Here's a bigger patch, which does more. It is based on the ideas in the post I started this thread with, with feedback incorporated from the long discussion. With this patch, WAL disk space usage is controlled by two GUCs: min_recycle_wal_size checkpoint_wal_size These GUCs act as soft minimum and maximum on overall WAL size. At each checkpoint, the checkpointer removes enough old WAL files to keep pg_xlog usage below checkpoint_wal_size, and recycles enough new WAL files to reach min_recycle_wal_size. Between those limits, there is a self-tuning mechanism to recycle just enough WAL files to get to end of the next checkpoint without running out of preallocated WAL files. To estimate how many files are needed for that, a moving average of how much WAL is generated between checkpoints is calculated. The moving average is updated with "fast-rise slow-decline" behavior, to cater for peak rather than true average use to some extent. As today, checkpoints are triggered based on time or WAL usage, whichever comes first. WAL-based checkpoints are triggered based on the good old formula: CheckPointSegments = (checkpoint_max_wal_size / (2.0 + checkpoint_completion_target)) / 16MB. CheckPointSegments controls that like before, but it is now an internal variable derived from checkpoint_wal_size, not visible to users. These settings are fairly intuitive for a DBA to tune. You begin by figuring out how much disk space you can afford to spend on WAL, and set checkpoint_wal_size to that (with some safety margin, of course). Then you set checkpoint_timeout based on how long you're willing to wait for recovery to finish. Finally, if you have infrequent batch jobs that need a lot more WAL than the system otherwise needs, you can set min_recycle_wal_size to keep enough WAL preallocated for the spikes. You can also set min_recycle_wal_size = checkpoint_wal_size, which gets you the same behavior as without the patch, except that it's more intuitive to set it in terms of "MB of WAL space required", instead of "# of segments between checkpoints". Does that make sense? I'd love to hear feedback on how people setting up production databases would like to tune these things. The reason for the auto-tuning between the min and max is to be able to set reasonable defaults e.g for embedded systems that don't have a DBA to do tuning. Currently, it's very difficult to come up with a reasonable default value for checkpoint_segments which would work well for a wide range of systems. The PostgreSQL default of 3 is way way too low for most systems. On the other hand, if you set it to, say, 20, that's a lot of wasted space for a small database that's not updated much. With this patch, you can set "max_wal_size=1GB" and if the database ends up actually only needing 100 MB of WAL, it will only use that much and not waste 900 MB for useless preallocated WAL files. These GUCs are still soft limits. If the system is busy enough that the checkpointer can't reach its target, it can exceed checkpoint_wal_size. Making it a hard limit is a much bigger task than I'm willing to tackle right now. - Heikki
Attachment
pgsql-hackers by date: