Thread: Avoiding adjacent checkpoint records

Avoiding adjacent checkpoint records

From

Tom Lane

Date:

06 June 2012, 16:09:02

In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
rule for when to skip checkpoints on the grounds that not enough
activity has happened since the last one.  However, that commit left the
comment block about it in a nonsensical state:
   * If this isn't a shutdown or forced checkpoint, and we have not switched   * to the next WAL file since the start
ofthe last checkpoint, skip the   * checkpoint.  The idea here is to avoid inserting duplicate checkpoints   * when the
systemis idle. That wastes log space, and more importantly it   * exposes us to possible loss of both current and
previouscheckpoint   * records if the machine crashes just as we're writing the update.   * (Perhaps it'd make even
moresense to checkpoint only when the previous   * checkpoint record is in a different xlog page?)
 

The new code entirely fails to prevent writing adjacent checkpoint
records, because what it checks is the distance from the previous
checkpoint's REDO pointer, not the previous checkpoint record itself.
So the concern raised in the last two sentences of the comment isn't
being addressed at all: if we corrupt the current page of WAL while
trying to write the new checkpoint record, we risk losing the previous
checkpoint record too.  Should the system then crash, there is enough
logic to back up to the second previous checkpoint record and roll
forward from there --- but since we've lost the last checkpoint and up
to one page's worth of preceding WAL records, there is no guarantee that
we'll manage to reach a database state that is consistent with data
already flushed out to disk during the last checkpoint.

I started to make a quick patch to add an additional check on the
location of the previous checkpoint record, so that we'd skip a new
checkpoint unless we'd moved to a new page of WAL.  However, if we
really want to take this risk seriously, ISTM that allowing adjacent
checkpoint records is bad all the time, not only for non-forced
checkpoints.

What I'm now thinking is that a more appropriate way to address that
risk is to force a skip to a new page (not segment) of WAL after we
write a checkpoint record.  This won't waste much WAL space in view
of the new rule to avoid checkpoints more than once per segment on
average.

On the other hand, you could argue that this concern is entirely
hypothetical, and we're already basically assuming that once a WAL
record has been flushed to disk it's safe there even if we're still
writing more stuff into the same page.  If we don't want to assume
that, then any XLogFlush would have to include skip-to-new-page,
and that's not going to be cheap.

Thoughts?
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

06 June 2012, 16:47:14

On Wed, Jun 6, 2012 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
> rule for when to skip checkpoints on the grounds that not enough
> activity has happened since the last one.  However, that commit left the
> comment block about it in a nonsensical state:
>
>    * If this isn't a shutdown or forced checkpoint, and we have not switched
>    * to the next WAL file since the start of the last checkpoint, skip the
>    * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
>    * when the system is idle. That wastes log space, and more importantly it
>    * exposes us to possible loss of both current and previous checkpoint
>    * records if the machine crashes just as we're writing the update.
>    * (Perhaps it'd make even more sense to checkpoint only when the previous
>    * checkpoint record is in a different xlog page?)

IIRC, the inspiration for the change was that we were getting a
never-ending series of checkpoints even when nothing was happening at
all:

http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php

I felt (and still feel) that this was misguided.  I understand why
people don't want a completely idle system to checkpoint; but I can't
recall a single complaint about a checkpoint on a system low but not
zero activity.  Checkpoints are pretty cheap when there isn't much
data to flush.  The flip side is that I know of real customers who
would have suffered real data loss had this code been present in the
server version they were using.  Checkpoints are the *only* mechanism
by which SLRU pages get flushed to disk on a mostly-idle system.  That
means if something happens to your pg_xlog directory, and you haven't
had a checkpoint, you're screwed.  Letting data sit in memory for
hours, days, weeks, or months because we haven't filled up a WAL
segment is just terrible.  The first user who loses a transaction that
was committed a month ago after running pg_resetxlog is going to hit
the ceiling, and I don't blame them.

It wouldn't be so bad if we had background writing for SLRU pages,
because then you could figure that the OS would eventually have a
chance to write the page out... but we don't.  It'll just sit there in
shared memory, dirty, forever.  CLOG data in particular is FAR too
precious to take that kind of chance with.

I don't think there's much sense in doing push-ups to avoid having the
current and previous checkpoint records are on the same XLOG page.  If
the system is so nearly idle that you get two checkpoint records in
the same 8k block, and that block gets corrupted, it is extremely
likely that you can run pg_resetxlog and be OK.  If not, that means
there were more XLOG records after the corrupted page, and you're not
going to be able to replay those anyway, whether the checkpoint
records are in the same 8k block or not.  So I'm not seeing how your
proposal is buying us any additional measure of safety that we don't
already have.  Of course, if we had a way to skip over the corrupted
portion of WAL and pick up replaying records after that, that would be
very useful (even though you'd have to view the resulting database
with extreme suspicion), but without that I don't see that finding the
previous checkpoint record is doing much for us.  Either way, you've
potentially lost changes that were covered by WAL records emitted
after the most recent checkpoint.  The only thing we can really do is
make sure that there aren't likely to be too many more unreplayed
records after the last checkpoint segment, which goes back to my
previous complaint.

As a side point, another reason not to make the checkpoint record
consume the rest of the page is that, for scalability reasons, we want
to minimize the amount of calculation that has to be done while
holding WALInsertLock, and have as much of the computation as possible
get done before acquiring it.  XLogInsert() is already way more
complicated than anything anyone ought to be doing while holding a
heavily-contended LWLock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

06 June 2012, 17:25:07

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 6, 2012 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
>> rule for when to skip checkpoints on the grounds that not enough
>> activity has happened since the last one.

> IIRC, the inspiration for the change was that we were getting a
> never-ending series of checkpoints even when nothing was happening at
> all:
> http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php

Right.

> I felt (and still feel) that this was misguided.

Looking at it again, I'm inclined to agree.  The behavior was entirely
correct up until somebody decided to emit a continuing stream of
XLOG_RUNNING_XACTS WAL records even when the system is idle.  Why did
we not fix it by fixing that?

> I don't think there's much sense in doing push-ups to avoid having the
> current and previous checkpoint records are on the same XLOG page.

Perhaps not.  I only got exercised about it after noting that the commit
hadn't updated the comment about it to match what the code is doing.
If we end up reverting that commit and doing something else to fix the
useless-checkpoint problem, I'm happy to let the subject rest, at least
until we get some evidence of a real problem in the area.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

06 June 2012, 17:28:07

On Wed, Jun 6, 2012 at 4:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I felt (and still feel) that this was misguided.
>
> Looking at it again, I'm inclined to agree.  The behavior was entirely
> correct up until somebody decided to emit a continuing stream of
> XLOG_RUNNING_XACTS WAL records even when the system is idle.  Why did
> we not fix it by fixing that?

That's exactly what I think we should have done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

06 June 2012, 18:05:05

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 6, 2012 at 4:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I felt (and still feel) that this was misguided.

>> Looking at it again, I'm inclined to agree. �The behavior was entirely
>> correct up until somebody decided to emit a continuing stream of
>> XLOG_RUNNING_XACTS WAL records even when the system is idle. �Why did
>> we not fix it by fixing that?

> That's exactly what I think we should have done.

Actually, it looks like there is an extremely simple way to handle this,
which is to move the call of LogStandbySnapshot (which generates the WAL
record in question) to before the checkpoint's REDO pointer is set, but
after we have decided that we need a checkpoint.  That will result in
later iterations not thinking that some work had happened while the
checkpoint is in progress.  It looks like we would need an extra
release/reacquire of WALInsertLock to avoid holding that lock while
doing LogStandbySnapshot, but that seems relatively negligible in
comparison to the total cost of a checkpoint.

There might be some still-better way to manage all this, but this one
seems safe enough to consider as a post-beta patch.  So I recommend
we revert the change in the when-to-skip-checkpoint test in favor of
reordering these operations.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

06 June 2012, 19:46:57

I wrote:
> Actually, it looks like there is an extremely simple way to handle this,
> which is to move the call of LogStandbySnapshot (which generates the WAL
> record in question) to before the checkpoint's REDO pointer is set, but
> after we have decided that we need a checkpoint.

On further contemplation, there is a downside to that idea, which
probably explains why the code was written as it was: if we place the
XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather
than after the checkpoint's REDO point, then a hot standby slave
starting up from that checkpoint won't process the XLOG_RUNNING_XACTS
record.  That means its KnownAssignedXids machinery won't be fully
operational until the master starts another checkpoint, which might be
awhile.  So this could result in undesirable delay in hot standby mode
becoming active.

I am not sure how significant this really is though.  Comments?

If we don't like that, I can think of a couple of other ways to get there,
but they have their own downsides:

* Instead of trying to detect after-the-fact whether any concurrent
WAL activity happened during the last checkpoint, we could detect it
during the checkpoint and then keep the info in a static variable in
the checkpointer process until next time.  However, I don't see any
bulletproof way to do this without adding at least one or two lines
of code within XLogInsert, which I'm sure Robert will complain about.

* We could expand checkpoint records to contain two different REDO
pointers, one to be used by hot standby slaves and one for normal
crash recovery.  (The LogStandbySnapshot records would appear between
these two points; we'd still be moving them up to the start of the
checkpoint sequence.)  This is a relatively clean solution but would
force pg_upgrade between beta2 and beta3, so that's not so nice.

* Combining the two ideas, we could take the nominal REDO pointer,
run LogStandbySnapshot, make a fresh note of where the insert point
is (real REDO point, which is what we publish in shared memory for
the bufmgr to compare LSNs to), complete the checkpoint, and write
the checkpoint record using the nominal REDO pointer so that that's
where any crash or HS slave starts from.  But save the real REDO
pointer in checkpointer static state, and in the next checkpoint use
that rather than the nominal pointer to decide if anything's happened
that would force a new checkpoint.  I think this dodges both of the
above complaints, but it feels pretty baroque.

Thoughts, other ideas?
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Simon Riggs

Date:

07 June 2012, 06:03:27

On 6 June 2012 20:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
> rule for when to skip checkpoints on the grounds that not enough
> activity has happened since the last one.  However, that commit left the
> comment block about it in a nonsensical state:
>
>    * If this isn't a shutdown or forced checkpoint, and we have not switched
>    * to the next WAL file since the start of the last checkpoint, skip the
>    * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
>    * when the system is idle. That wastes log space, and more importantly it
>    * exposes us to possible loss of both current and previous checkpoint
>    * records if the machine crashes just as we're writing the update.
>    * (Perhaps it'd make even more sense to checkpoint only when the previous
>    * checkpoint record is in a different xlog page?)
>
> The new code entirely fails to prevent writing adjacent checkpoint
> records, because what it checks is the distance from the previous
> checkpoint's REDO pointer, not the previous checkpoint record itself.

You were completely involved in the original discussion of this, as
were others.
I made the change to use the redo pointer at your request. So when you
say Simon did this, what you mean is Simon acted according to group
consensus on an agreed feature.

How come this is a valid discussion? Why does making changes here make
sense when other changes are said to destabilise the code and delay
release?

Should I revisit all the things others have done that I disagree with as well?

No, I should not. Nor should we be trawling through changes made by me either.

I'll look some more at this, but you should consider why this thread
even exists.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Avoiding adjacent checkpoint records

From

"Kevin Grittner"

Date:

07 June 2012, 11:00:21

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> there is no guarantee that we'll manage to reach a database state
> that is consistent with data already flushed out to disk during
> the last checkpoint.
Robert Haas <robertmhaas@gmail.com> wrote: 
> I know of real customers who would have suffered real data loss
> had this code been present in the server version they were using. 
> Checkpoints are the *only* mechanism by which SLRU pages get
> flushed to disk on a mostly-idle system. That means if something
> happens to your pg_xlog directory, and you haven't had a
> checkpoint, you're screwed. 
Simon Riggs <simon@2ndQuadrant.com> wrote: 
> How come this is a valid discussion? Why does making changes here
> make sense when other changes are said to destabilise the code and
> delay release?
I think the standard has been pretty clear -- at this point in a
release we fix bugs and regressions from previous releases.  The bar
for introducing anything which doesn't qualify under either of those
is very high in terms of being obviously useful and *very* low risk.
> Should I revisit all the things others have done that I disagree
> with as well?
If you can spot any serious bugs, I would sure appreciate it if you
point them out while we're still in beta.  I think we all would.
> I'll look some more at this, but you should consider why this
> thread even exists.
Because the beta release contains a new data loss bug which needs to
be fixed.
-Kevin

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

07 June 2012, 11:04:36

On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Actually, it looks like there is an extremely simple way to handle this,
>> which is to move the call of LogStandbySnapshot (which generates the WAL
>> record in question) to before the checkpoint's REDO pointer is set, but
>> after we have decided that we need a checkpoint.
>
> On further contemplation, there is a downside to that idea, which
> probably explains why the code was written as it was: if we place the
> XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather
> than after the checkpoint's REDO point, then a hot standby slave
> starting up from that checkpoint won't process the XLOG_RUNNING_XACTS
> record.  That means its KnownAssignedXids machinery won't be fully
> operational until the master starts another checkpoint, which might be
> awhile.  So this could result in undesirable delay in hot standby mode
> becoming active.
>
> I am not sure how significant this really is though.  Comments?

I suspect that's pretty significant.

> If we don't like that, I can think of a couple of other ways to get there,
> but they have their own downsides:
>
> * Instead of trying to detect after-the-fact whether any concurrent
> WAL activity happened during the last checkpoint, we could detect it
> during the checkpoint and then keep the info in a static variable in
> the checkpointer process until next time.  However, I don't see any
> bulletproof way to do this without adding at least one or two lines
> of code within XLogInsert, which I'm sure Robert will complain about.

My main concern here is to avoid doing anything that will make things
harder for Heikki's WAL insert scaling patch, which I'm hoping will
get done for 9.3.

What do you have in mind, exactly?  I feel like ProcLastRecPtr might
be enough information.  After logging running xacts, we can check
whether ProcLastRecPtr is equal to the redo pointer.  If so, then
nothing got written to WAL between the time we began the checkpoint
and the time we wrote that record.  If, through further, similar
gyrations, we can then work out whether the checkpoint record
immediately follows the running-xacts record, we're there.  That's
pretty ugly, I guess, but it seems possible.

Alternatively, we could keep a flag in XLogCtl->Insert indicating
whether anything that requires a new checkpoint has happened since the
last checkpoint.  This could be set inside the existing block that
tests whether RedoRecPtr is out of date, so any given backend would
only do it once per checkpoint cycle.  We'd have a hard-coded special
case that would skip setting the flag for a running-xacts record.  I
kind of hate to shove even that much extra code in there, but as long
as it doesn't mess up what Heikki has in mind maybe it'd be OK...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

07 June 2012, 12:48:19

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we don't like that, I can think of a couple of other ways to get there,
>> but they have their own downsides:
>> 
>> * Instead of trying to detect after-the-fact whether any concurrent
>> WAL activity happened during the last checkpoint, we could detect it
>> during the checkpoint and then keep the info in a static variable in
>> the checkpointer process until next time. �However, I don't see any
>> bulletproof way to do this without adding at least one or two lines
>> of code within XLogInsert, which I'm sure Robert will complain about.

> My main concern here is to avoid doing anything that will make things
> harder for Heikki's WAL insert scaling patch, which I'm hoping will
> get done for 9.3.

Yeah, I'm not very happy either with adding new requirements to
XLogInsert, even if it is just tracking one more bit of information.

> What do you have in mind, exactly?  I feel like ProcLastRecPtr might
> be enough information.  After logging running xacts, we can check
> whether ProcLastRecPtr is equal to the redo pointer.  If so, then
> nothing got written to WAL between the time we began the checkpoint
> and the time we wrote that record.  If, through further, similar
> gyrations, we can then work out whether the checkpoint record
> immediately follows the running-xacts record, we're there.  That's
> pretty ugly, I guess, but it seems possible.

It's fairly messy because of the issues around "holes" being left at
page ends etc --- so the fact that ProcLastRecPtr is different from the
previous insert pointer doesn't immediately prove whether something else
got inserted first, or we just had to leave some dead space.  Heikki
mentioned this morning that he'd like to remove some of those rules in
9.3, but that doesn't help us today.  Note also that a closer look at
LogStandbySnapshot shows it emits more than one WAL record, meaning
we'd have to do this dance more than once, and the changes to do that
would be pretty deadly to the modularity of the functions
LogStandbySnapshot calls.

The conclusion I'd come to yesterday was that we'd want XLogInsert to
do something likeif (Insert->PrevRecord is different from ProcLastRecPtr)    SomebodyElseWroteWAL = true;
where SomebodyElseWroteWAL is a process-local boolean that we reset
at the start of a checkpoint sequence, and then check after we've
written out the LogStandbySnapshot records and the checkpoint record.
(We'd also have to hack ProcLastRecPtr by initializing it to
Insert->PrevRecord at the time we reset SomebodyElseWroteWAL, which is
sort of ugly in that it messes up the relatively clean definition of
that variable.)  So that's not exactly a lot of new code in the critical
section, but it's still new code.

In the end I think I like the last idea I mentioned (keeping two
different "REDO" values during a checkpoint) the best.  It's a bit
complicated but the grottiness is localized in CreateCheckpoint.
Or at least I think it will be, without having written a patch yet.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Simon Riggs

Date:

07 June 2012, 13:00:51

On 7 June 2012 14:59, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> there is no guarantee that we'll manage to reach a database state
>> that is consistent with data already flushed out to disk during
>> the last checkpoint.
>
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> I know of real customers who would have suffered real data loss
>> had this code been present in the server version they were using.
>> Checkpoints are the *only* mechanism by which SLRU pages get
>> flushed to disk on a mostly-idle system. That means if something
>> happens to your pg_xlog directory, and you haven't had a
>> checkpoint, you're screwed.

If that is the concern, then its a one line fix to add the missing clog flush.

The other suggestions I've skim read seem fairly invasive at this
stage of the release.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

07 June 2012, 13:27:57

Simon Riggs <simon@2ndQuadrant.com> writes:
>> Robert Haas <robertmhaas@gmail.com> wrote:
>>> I know of real customers who would have suffered real data loss
>>> had this code been present in the server version they were using.

> If that is the concern, then its a one line fix to add the missing clog flush.

To where, and what performance impact will that have?

> The other suggestions I've skim read seem fairly invasive at this
> stage of the release.

The issue here is that we committed a not-very-well-thought-out fix
to the original problem.  I think a reasonable argument could be made
for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
and postponing any of these other ideas to 9.3.  The useless-checkpoints
problem has been there since 9.0 and can surely wait another release
cycle to get fixed.  But I concur with Robert that changing the system
behavior so that checkpointing of committed changes might be delayed
indefinitely is a high-risk choice.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Simon Riggs

Date:

07 June 2012, 13:52:46

On 7 June 2012 17:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> Robert Haas <robertmhaas@gmail.com> wrote:
>>>> I know of real customers who would have suffered real data loss
>>>> had this code been present in the server version they were using.
>
>> If that is the concern, then its a one line fix to add the missing clog flush.
>
> To where, and what performance impact will that have?

To the point where we decide to skip the other parts of the
checkpoint. How would that cause a performance impact exactly? It's
less work than the original behaviour would be.

>> The other suggestions I've skim read seem fairly invasive at this
>> stage of the release.
>
> The issue here is that we committed a not-very-well-thought-out fix
> to the original problem.  I think a reasonable argument could be made
> for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
> and postponing any of these other ideas to 9.3.  The useless-checkpoints
> problem has been there since 9.0 and can surely wait another release
> cycle to get fixed.  But I concur with Robert that changing the system
> behavior so that checkpointing of committed changes might be delayed
> indefinitely is a high-risk choice.

Clearly, delaying checkpoint indefinitely would be a high risk choice.
But they won't be delayed indefinitely, since changes cause WAL
records to be written and that would soon cause another checkpoint.

Robert has shown a bug exists, so it should be fixed directly,
especially if we believe in least invasive changes. If
not-fixing-the-actual-bug happened before, its happening again here as
well, with some poor sounding logic to justify it.

Please act as you see fit.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

07 June 2012, 14:03:56

On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Clearly, delaying checkpoint indefinitely would be a high risk choice.
> But they won't be delayed indefinitely, since changes cause WAL
> records to be written and that would soon cause another checkpoint.

But that's exactly the problem - it might not be soon at all.  We have
customers who process about one write transaction per day.  The
current state of play in 9.2 is that we'll checkpoint when we get to
the next WAL segment.  But at one write transaction per day, that
could take weeks or months.  Someone invented a setting called
'checkpoint_timeout' for a very good reason - people don't want huge
amounts of time to pass between checkpoints.  That setting is
currently capped at one hour; the proposition that someone who sets it
to 5 minutes on a low-write-volume system is OK with the effective
value being 5 months does not seem likely to me.

Your suggestion of just checkpointing CLOG seems like it would
mitigate the worst of the damage, but I think I agree with Tom that
just reverting the whole patch would be a better solution, if we can't
figure out anything cleaner.  It's better to have a few unnecessary
checkpoints than to risk losing somebody's data, especially since the
unnecessary checkpoints only happen with wal_level=hot_standby, but
the data loss risk exists for everyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

07 June 2012, 14:10:36

Simon Riggs <simon@2ndQuadrant.com> writes:
> On 7 June 2012 17:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> If that is the concern, then its a one line fix to add the missing clog flush.

>> To where, and what performance impact will that have?

> To the point where we decide to skip the other parts of the
> checkpoint. How would that cause a performance impact exactly? It's
> less work than the original behaviour would be.

It's not clear to me exactly which parts of the checkpoint process would
need to be duplicated here to be safe.  What you're proposing is
basically a partial checkpoint, which would need quite a lot of thought
to be sure it did everything that should be done and nothing that
shouldn't be.  And it would be a continuing risk spot for future bugs of
omission.

>> The issue here is that we committed a not-very-well-thought-out fix
>> to the original problem. �I think a reasonable argument could be made
>> for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724
>> and postponing any of these other ideas to 9.3. �The useless-checkpoints
>> problem has been there since 9.0 and can surely wait another release
>> cycle to get fixed. �But I concur with Robert that changing the system
>> behavior so that checkpointing of committed changes might be delayed
>> indefinitely is a high-risk choice.

> Clearly, delaying checkpoint indefinitely would be a high risk choice.
> But they won't be delayed indefinitely, since changes cause WAL
> records to be written and that would soon cause another checkpoint.

No, the situation of most concern is where we make some change and then
nothing happens for a very long time.  If there's a continuing flow of
updates, then yes a checkpoint will happen ... eventually.  Even with
the assumption of continuing updates, the time until a commit is
checkpointed might be vastly longer than the configured checkpoint
interval, and that's an interval in which loss of the WAL log is more
dangerous than it was in any previous release.  So basically, this fix
is giving up some hard-to-quantify amount of data security.  Per this
thread, there are other ways in which we can fix the useless-checkpoint
problem without giving up any such security.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

07 June 2012, 14:21:48

Robert Haas <robertmhaas@gmail.com> writes:
> ... It's better to have a few unnecessary
> checkpoints than to risk losing somebody's data, especially since the
> unnecessary checkpoints only happen with wal_level=hot_standby, but
> the data loss risk exists for everyone.

Yeah, that's another point here: the benefit of the patch accrues to
a different set of people than the ones paying the penalty.  If you've
got hot standby enabled, presumably you are replicating to at least one
slave and so the prospect of data loss via WAL loss is mitigated for you.

I also note that the other work done in 9.2 to reduce idle-system load
did not address replication configurations at all; I think we still have
time-driven wakeups in walsender and walreceiver for instance.  So I'd
rather revert the patch now, and consider that a better fix will be part
of a future round of work to reduce the idle-system load in replication
setups.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Simon Riggs

Date:

07 June 2012, 22:25:41

On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Clearly, delaying checkpoint indefinitely would be a high risk choice.
>> But they won't be delayed indefinitely, since changes cause WAL
>> records to be written and that would soon cause another checkpoint.
>
> But that's exactly the problem - it might not be soon at all.  We have
> customers who process about one write transaction per day.  The
> current state of play in 9.2 is that we'll checkpoint when we get to
> the next WAL segment.  But at one write transaction per day, that
> could take weeks or months.  Someone invented a setting called
> 'checkpoint_timeout' for a very good reason - people don't want huge
> amounts of time to pass between checkpoints.  That setting is
> currently capped at one hour; the proposition that someone who sets it
> to 5 minutes on a low-write-volume system is OK with the effective
> value being 5 months does not seem likely to me.

We discussed that in the original thread.

The purpose of a checkpoint is to reduce recovery time. Nobody
actually wants a checkpoint just of itself and why would we care how
many months that takes? I grant that it takes a little while to come
to terms with that thought, but that doesn't make the patch wrong.

The only risk of data loss is in the case where someone deletes their
pg_xlog and who didn't take a backup in all that time, which is hardly
recommended behaviour. We're at exactly the same risk of data loss if
someone deletes their pg_clog. Too frequent checkpoints actually makes
the data loss risk from deleted pg_clog greater, so the balance of
data loss risk doesn't seem to have altered.

> Your suggestion of just checkpointing CLOG seems like it would
> mitigate the worst of the damage, but I think I agree with Tom that
> just reverting the whole patch would be a better solution, if we can't
> figure out anything cleaner.  It's better to have a few unnecessary
> checkpoints than to risk losing somebody's data, especially since the
> unnecessary checkpoints only happen with wal_level=hot_standby, but
> the data loss risk exists for everyone.

We're not at risk of losing anyone's data. There is no aspect of the
normal running system that is at a higher risk of data loss.

I don't think its fair comment to claim the patch has issues because
you've found a guy with very, very, very low write rates, no backup
strategy and a propensity to delete essential parts of their database
who will be adversely affected.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Avoiding adjacent checkpoint records

From

Peter Geoghegan

Date:

08 June 2012, 00:05:58

On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Clearly, delaying checkpoint indefinitely would be a high risk choice.
>> But they won't be delayed indefinitely, since changes cause WAL
>> records to be written and that would soon cause another checkpoint.
>
> But that's exactly the problem - it might not be soon at all.  We have
> customers who process about one write transaction per day.  The
> current state of play in 9.2 is that we'll checkpoint when we get to
> the next WAL segment.  But at one write transaction per day, that
> could take weeks or months.  Someone invented a setting called
> 'checkpoint_timeout' for a very good reason - people don't want huge
> amounts of time to pass between checkpoints.  That setting is
> currently capped at one hour; the proposition that someone who sets it
> to 5 minutes on a low-write-volume system is OK with the effective
> value being 5 months does not seem likely to me.

ISTM that this low-volume system accepts approximately the same risk
of data loss as a high volume system that writes the same volume of
data, but in 5 minutes rather than 5 months. You might argue that the
scope of things that can go wrong in 5 months is far greater than the
scope of 5 minutes. I might counter that the mechanical wear attendant
to constant writes was a more important factor, or flash memory having
gone through one too many P/E cycles, becoming unreliable.

The bottom line is that once you've flushed WAL, you're by definition
home-free - if you're relying on a checkpoint to save you from data
loss, you're in big trouble anyway. Checkpoints are theoretically just
for putting some cap on recovery time. That's pretty much how they're
described in the literature.

Your customer's use-case seems very narrow, and your complaint seems
unusual to me, but couldn't you just get the customer to force
checkpoints in a cronjob or something? CheckPointStmt will force,
provided !RecoveryInProgress() .

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

08 June 2012, 01:02:12

Peter Geoghegan <peter@2ndquadrant.com> writes:
> On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Clearly, delaying checkpoint indefinitely would be a high risk choice.
>>> But they won't be delayed indefinitely, since changes cause WAL
>>> records to be written and that would soon cause another checkpoint.

>> But that's exactly the problem - it might not be soon at all.

> Your customer's use-case seems very narrow, and your complaint seems
> unusual to me, but couldn't you just get the customer to force
> checkpoints in a cronjob or something? CheckPointStmt will force,
> provided !RecoveryInProgress() .

I think both you and Simon have completely missed the point.  This
is not a "use case" in the sense of someone doing it deliberately.
This is about data redundancy, ie, if you lose your WAL through some
unfortunate mishap, are you totally screwed or is there a reasonable
chance that the data is on-disk in the main data store?  I would guess
that the incidents Robert has been talking about were cases where EDB
were engaged to do crash recovery, and were successful precisely because
PG wasn't wholly dependent on the WAL copy of the data.

This project has always put reliability first.  It's not clear to me why
we would compromise that across-the-board in order to slightly reduce
idle load in replication configurations.  Yeah, it's probably not a
*large* compromise ... but it is a compromise, and one that doesn't
seem to me to be very well-advised.  We can fix the idle-load issue
without compromising on this basic goal; it will just take more than
a ten-line patch to do it.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Simon Riggs

Date:

08 June 2012, 03:45:11

On 8 June 2012 05:01, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <peter@2ndquadrant.com> writes:
>> On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Clearly, delaying checkpoint indefinitely would be a high risk choice.
>>>> But they won't be delayed indefinitely, since changes cause WAL
>>>> records to be written and that would soon cause another checkpoint.
>
>>> But that's exactly the problem - it might not be soon at all.
>
>> Your customer's use-case seems very narrow, and your complaint seems
>> unusual to me, but couldn't you just get the customer to force
>> checkpoints in a cronjob or something? CheckPointStmt will force,
>> provided !RecoveryInProgress() .
>
> I think both you and Simon have completely missed the point.  This
> is not a "use case" in the sense of someone doing it deliberately.
> This is about data redundancy, ie, if you lose your WAL through some
> unfortunate mishap, are you totally screwed or is there a reasonable
> chance that the data is on-disk in the main data store?  I would guess
> that the incidents Robert has been talking about were cases where EDB
> were engaged to do crash recovery, and were successful precisely because
> PG wasn't wholly dependent on the WAL copy of the data.

Apart from the likely point that hint bits exist on disk...

> This project has always put reliability first.  It's not clear to me why
> we would compromise that across-the-board in order to slightly reduce
> idle load in replication configurations.  Yeah, it's probably not a
> *large* compromise ... but it is a compromise, and one that doesn't
> seem to me to be very well-advised.  We can fix the idle-load issue
> without compromising on this basic goal; it will just take more than
> a ten-line patch to do it.

So now the standard for my patches is that I must consider what will
happen if the xlog is deleted?

Tell me such a rule is applied uniformly to all patches then I would be happy.


I will revoke, not because of the sense in this argument but because
you personally ask for it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

08 June 2012, 12:24:44

On Thu, Jun 7, 2012 at 9:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The only risk of data loss is in the case where someone deletes their
> pg_xlog and who didn't take a backup in all that time, which is hardly
> recommended behaviour. We're at exactly the same risk of data loss if
> someone deletes their pg_clog. Too frequent checkpoints actually makes
> the data loss risk from deleted pg_clog greater, so the balance of
> data loss risk doesn't seem to have altered.

This doesn't match my experience.  pg_xlog is often located on a
separate disk, which significantly increases the chances of something
bad happening to it, either through user error or because, uh, disks
sometimes fail.  Now, granted, you can also lose your data directory
(including pg_clog) this way, but just because we lose data in that
situation doesn't mean we should be happy about also losing data when
pg_xlog goes does the toilet, especially when we can easily prevent it
by going back to the behavior we've had in every previous release.

Now, I have had customers lose pg_clog data, and it does suck, but
it's usually a safe bet that most of the missing transactions
committed, so you can pad out the missing files with 0x55, and
probably get your data back.   On the other hand, it's impossible to
guess what any missing pg_xlog data might have been.  Perhaps if the
data pages are on disk and only CLOG didn't get written you could
somehow figure out which bits you need to flip in CLOG to get your
data back, but that's pretty heavy brain surgery, and if autovacuum or
even just a HOT prune runs before you realize that you need to do it
then you're toast.  OTOH, if the database has checkpointed,
pg_resetxlog is remarkably successful in letting you pick up the
pieces and go on with your life.

All that having been said, it wouldn't be a stupid idea to have a
little more redundancy in our CLOG mechanism than we do right now.
Hint bits help, as does the predictability of the data, but it's still
an awfully scary to have that much critical data packed into that
small a space.  I'd love to see us checksum those pages, or store the
data in some redundant location that makes it unlikely we'll lose both
copies, or ship a utility that will scan all your heap pages and try
to find hint bits that reveal which transactions committed and which
ones aborted, or all of the above.  But until then, I'd like to make
sure that we at least have the data on the disk instead of sitting
dirty in memory forever.

As a general thought about disaster recovery, my experience is that if
you can tell a customer to run a command (like pg_resetxlog), or - not
quite as good - if you can tell them to run some script that you email
them (like my pad-out-the-CLOG-with-0x55 script), then they're willing
to do that, and it usually works, and they're as happy as they're
going to be.  But if you tell them that they have to send you all
their data files or let you log into the machine and poke around for
$X/hour * many hours, then they typically don't want to do that.
Sometimes it's legally or procedurally impossible for them; even if
not, it's cheaper to find some other way to cope with the situation,
so they do, but now -  the way they view it - the database lost their
data.  Even if the problem was entirely self-inflicted, like an
intentional deletion of pg_xlog, and even if they therefore understand
that it was entirely their own stupid fault that the data got eaten,
it's a bad experience.  For that reason, I think we should be looking
for opportunities to increase the recoverability of the database in
every area.  I'm sure that everyone on this list who works with
customers on a regular basis has had customers who lost pg_xlog, who
lost pg_clog (or portions theref), who dropped their main table, who
lost the backing files for pg_class and/or pg_attribute, whose
database ended up in lost+found, who had a break in WAL, who had
individual blocks corrupted or unreadable within some important table,
who were missing TOAST chunks, who took a pg_basebackup and failed to
create recovery.conf, who had a corrupted index on a critical system
table, who had inconsistent system catalog contents.  Some of these
problems are caused by bad hardware or bugs, but the most common cause
is user error.  Regardless of the cause, the user wants to get as much
of their data back as possible as quickly and as easily and as
reliably as possible.  To the extent that we can transform a
situations that would have required consulting hours into situations
from which a semi-automated recovery is possible, or situations that
would have required many consulting hours into ones that require only
a few, that's a huge win.  Of course, we shouldn't place that goal
above all else; and of course, this is only one small piece of that.
But it is a piece, and it has a tangible benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

"Kevin Grittner"

Date:

08 June 2012, 13:25:23

Robert Haas <robertmhaas@gmail.com> wrote:
> if the database has checkpointed
I haven't been exactly clear on the risks about which Tom and Robert
have been concerned; is it a question about whether we change the
meaning of these settings to something more complicated?:
checkpoint_segments (integer)   Maximum number of log file segments between automatic WAL   checkpoints
checkpoint_timeout (integer)   Maximum time between automatic WAL checkpoints
I can see possibly changing the latter when absolutely nothing has
been written to WAL since the last checkpoint, although I'm not sure
that should suppress flushing dirty pages (e.g., from hinting) to
disk.  Such a change seems like it would be of pretty minimal
benefit, though, and not something for which it is worth taking on
any risk at all.  Any other change to the semantics of these
settings seems ill advised on the face of it.
... or am I not grasping the issue properly?
-Kevin

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

08 June 2012, 14:08:12

On Fri, Jun 8, 2012 at 12:24 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> I haven't been exactly clear on the risks about which Tom and Robert
> have been concerned; is it a question about whether we change the
> meaning of these settings to something more complicated?:
>
> checkpoint_segments (integer)
>    Maximum number of log file segments between automatic WAL
>    checkpoints
>
> checkpoint_timeout (integer)
>    Maximum time between automatic WAL checkpoints

The issue is that, in the tip of the 9.2 branch, checkpoint_timeout is
no longer the maximum time between automatic WAL checkpoints.
Instead, the checkpoint is skipped if we're still in the same WAL
segment that we were in when we did the last checkpoint.  Therefore,
there is absolutely no upper bound on the amount of time that can pass
between checkpoints.  If someone does one transaction, which happens
not to cross a WAL segment boundary, we will never automatically
checkpoint that transaction.  A checkpoint will ONLY be triggered when
we have enough write-ahead log volume to get us into the next segment.I am arguing (and Tom is now agreeing) that this
isbad, and that the 
patch which made this change needs either some kind of fix, or to be
reverted completely.

The original motivation for the patch was that the code to suppress
duplicate checkpoints stopped working correctly when Hot Standby was
committed.  The previous coding (before the commit at issue) skips a
checkpoint if no write-ahead log records at all have been emitted
since the start of the preceding checkpoint.  I believe this is the
correct behavior, but there's a problem:  when wal_level =
hot_standby, we emit an XLOG_RUNNING_XACTS record during every
checkpoint cycle.  So, if wal_level = hot_standby, the test for
whether anything has happened always returns false, and so the system
never quiesces: every checkpoint cycle contains at least the
XLOG_RUNNING_XACTS record, even if nothing else, so we never get to
skip any checkpoints.  When wal_level < hot_standby, the problem does
not exist and redundant checkpoints are suppressed just as we would
hope.

While Simon's patch does fix the problem, I believe that making
checkpoint_timeout anything less than a hard timeout is unwise.  The
previous behavior - at least one checkpoint per checkpoint_timeout -
is easy to understand and plan for; I believe the new behavior will be
an unpleasant surprise for users who care about checkpointing
regularly, which I think most do, whether they are here to be
represented in this conversation or not.  So I think we need a
different fix for the problem that wal_level = hot_standby defeats the
redundant-checkpoint-detection code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

09 June 2012, 09:44:06

Simon Riggs <simon@2ndQuadrant.com> writes:
> So now the standard for my patches is that I must consider what will
> happen if the xlog is deleted?

When you're messing around with something that affects data integrity, yes.
The long and the short of it is that this patch does reduce our ability
to recover from easily-foreseeable disasters.  The problem it was meant
to solve is not dire enough to justify that, and other fixes are
possible that don't require any compromises in this dimension.
So please revert.  We can revisit the original complaint in 9.3.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Bruce Momjian

Date:

12 June 2012, 23:56:00

On Wed, Jun 06, 2012 at 06:46:37PM -0400, Tom Lane wrote:
> I wrote:
> > Actually, it looks like there is an extremely simple way to handle this,
> > which is to move the call of LogStandbySnapshot (which generates the WAL
> > record in question) to before the checkpoint's REDO pointer is set, but
> > after we have decided that we need a checkpoint.
> 
> On further contemplation, there is a downside to that idea, which
> probably explains why the code was written as it was: if we place the
> XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather
> than after the checkpoint's REDO point, then a hot standby slave
> starting up from that checkpoint won't process the XLOG_RUNNING_XACTS
> record.  That means its KnownAssignedXids machinery won't be fully
> operational until the master starts another checkpoint, which might be
> awhile.  So this could result in undesirable delay in hot standby mode
> becoming active.

Stupid question, but why are we not just setting a boolean variable in
shared memory if we WAL-write a non-XLOG_RUNNING_XACTS record, and only
checkpoint if that is true?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: Avoiding adjacent checkpoint records

From

Tom Lane

Date:

13 June 2012, 01:08:32

Bruce Momjian <bruce@momjian.us> writes:
> Stupid question, but why are we not just setting a boolean variable in
> shared memory if we WAL-write a non-XLOG_RUNNING_XACTS record, and only
> checkpoint if that is true?

Well, (1) we are trying to avoid adding such logic to the critical
section inside XLogInsert, and (2) XLOG_RUNNING_XACTS is not the only
problematic record type, there's at least one other.
        regards, tom lane

Re: Avoiding adjacent checkpoint records

From

Robert Haas

Date:

30 August 2012, 19:10:44

On Mon, Aug 13, 2012 at 6:19 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Jun 9, 2012 at 5:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> So now the standard for my patches is that I must consider what will
>>> happen if the xlog is deleted?
>>
>> When you're messing around with something that affects data integrity, yes.
>> The long and the short of it is that this patch does reduce our ability
>> to recover from easily-foreseeable disasters.  The problem it was meant
>> to solve is not dire enough to justify that, and other fixes are
>> possible that don't require any compromises in this dimension.
>> So please revert.  We can revisit the original complaint in 9.3.
>
> This reversion was done, so
> b8b69d89905e04b910bcd   Wed Jun 13, 2012
> reverted:
> 18fb9d8d21a28caddb72      Wed Nov 2, 2011.
>
> However, the corresponding doc changes 43342891861cc2d08de and
> bd2396988a1afbcb6424 were not reverted.
>
> A simple reversion is probably not the right thing, because the
> original docs seemed rather inadequate.
>
> I've attached an attempt to fix this.  I also changed "WAL shipping"
> to "WAL archiving", as the reason for setting archive_timeout applies
> to all WAL archiving not just the special case of warm standby.

Committed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company