Thread: Avoiding adjacent checkpoint records
In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the rule for when to skip checkpoints on the grounds that not enough activity has happened since the last one. However, that commit left the comment block about it in a nonsensical state: * If this isn't a shutdown or forced checkpoint, and we have not switched * to the next WAL file since the start ofthe last checkpoint, skip the * checkpoint. The idea here is to avoid inserting duplicate checkpoints * when the systemis idle. That wastes log space, and more importantly it * exposes us to possible loss of both current and previouscheckpoint * records if the machine crashes just as we're writing the update. * (Perhaps it'd make even moresense to checkpoint only when the previous * checkpoint record is in a different xlog page?) The new code entirely fails to prevent writing adjacent checkpoint records, because what it checks is the distance from the previous checkpoint's REDO pointer, not the previous checkpoint record itself. So the concern raised in the last two sentences of the comment isn't being addressed at all: if we corrupt the current page of WAL while trying to write the new checkpoint record, we risk losing the previous checkpoint record too. Should the system then crash, there is enough logic to back up to the second previous checkpoint record and roll forward from there --- but since we've lost the last checkpoint and up to one page's worth of preceding WAL records, there is no guarantee that we'll manage to reach a database state that is consistent with data already flushed out to disk during the last checkpoint. I started to make a quick patch to add an additional check on the location of the previous checkpoint record, so that we'd skip a new checkpoint unless we'd moved to a new page of WAL. However, if we really want to take this risk seriously, ISTM that allowing adjacent checkpoint records is bad all the time, not only for non-forced checkpoints. What I'm now thinking is that a more appropriate way to address that risk is to force a skip to a new page (not segment) of WAL after we write a checkpoint record. This won't waste much WAL space in view of the new rule to avoid checkpoints more than once per segment on average. On the other hand, you could argue that this concern is entirely hypothetical, and we're already basically assuming that once a WAL record has been flushed to disk it's safe there even if we're still writing more stuff into the same page. If we don't want to assume that, then any XLogFlush would have to include skip-to-new-page, and that's not going to be cheap. Thoughts? regards, tom lane
On Wed, Jun 6, 2012 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the > rule for when to skip checkpoints on the grounds that not enough > activity has happened since the last one. However, that commit left the > comment block about it in a nonsensical state: > > * If this isn't a shutdown or forced checkpoint, and we have not switched > * to the next WAL file since the start of the last checkpoint, skip the > * checkpoint. The idea here is to avoid inserting duplicate checkpoints > * when the system is idle. That wastes log space, and more importantly it > * exposes us to possible loss of both current and previous checkpoint > * records if the machine crashes just as we're writing the update. > * (Perhaps it'd make even more sense to checkpoint only when the previous > * checkpoint record is in a different xlog page?) IIRC, the inspiration for the change was that we were getting a never-ending series of checkpoints even when nothing was happening at all: http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php I felt (and still feel) that this was misguided. I understand why people don't want a completely idle system to checkpoint; but I can't recall a single complaint about a checkpoint on a system low but not zero activity. Checkpoints are pretty cheap when there isn't much data to flush. The flip side is that I know of real customers who would have suffered real data loss had this code been present in the server version they were using. Checkpoints are the *only* mechanism by which SLRU pages get flushed to disk on a mostly-idle system. That means if something happens to your pg_xlog directory, and you haven't had a checkpoint, you're screwed. Letting data sit in memory for hours, days, weeks, or months because we haven't filled up a WAL segment is just terrible. The first user who loses a transaction that was committed a month ago after running pg_resetxlog is going to hit the ceiling, and I don't blame them. It wouldn't be so bad if we had background writing for SLRU pages, because then you could figure that the OS would eventually have a chance to write the page out... but we don't. It'll just sit there in shared memory, dirty, forever. CLOG data in particular is FAR too precious to take that kind of chance with. I don't think there's much sense in doing push-ups to avoid having the current and previous checkpoint records are on the same XLOG page. If the system is so nearly idle that you get two checkpoint records in the same 8k block, and that block gets corrupted, it is extremely likely that you can run pg_resetxlog and be OK. If not, that means there were more XLOG records after the corrupted page, and you're not going to be able to replay those anyway, whether the checkpoint records are in the same 8k block or not. So I'm not seeing how your proposal is buying us any additional measure of safety that we don't already have. Of course, if we had a way to skip over the corrupted portion of WAL and pick up replaying records after that, that would be very useful (even though you'd have to view the resulting database with extreme suspicion), but without that I don't see that finding the previous checkpoint record is doing much for us. Either way, you've potentially lost changes that were covered by WAL records emitted after the most recent checkpoint. The only thing we can really do is make sure that there aren't likely to be too many more unreplayed records after the last checkpoint segment, which goes back to my previous complaint. As a side point, another reason not to make the checkpoint record consume the rest of the page is that, for scalability reasons, we want to minimize the amount of calculation that has to be done while holding WALInsertLock, and have as much of the computation as possible get done before acquiring it. XLogInsert() is already way more complicated than anything anyone ought to be doing while holding a heavily-contended LWLock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 6, 2012 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the >> rule for when to skip checkpoints on the grounds that not enough >> activity has happened since the last one. > IIRC, the inspiration for the change was that we were getting a > never-ending series of checkpoints even when nothing was happening at > all: > http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php Right. > I felt (and still feel) that this was misguided. Looking at it again, I'm inclined to agree. The behavior was entirely correct up until somebody decided to emit a continuing stream of XLOG_RUNNING_XACTS WAL records even when the system is idle. Why did we not fix it by fixing that? > I don't think there's much sense in doing push-ups to avoid having the > current and previous checkpoint records are on the same XLOG page. Perhaps not. I only got exercised about it after noting that the commit hadn't updated the comment about it to match what the code is doing. If we end up reverting that commit and doing something else to fix the useless-checkpoint problem, I'm happy to let the subject rest, at least until we get some evidence of a real problem in the area. regards, tom lane
On Wed, Jun 6, 2012 at 4:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I felt (and still feel) that this was misguided. > > Looking at it again, I'm inclined to agree. The behavior was entirely > correct up until somebody decided to emit a continuing stream of > XLOG_RUNNING_XACTS WAL records even when the system is idle. Why did > we not fix it by fixing that? That's exactly what I think we should have done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 6, 2012 at 4:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I felt (and still feel) that this was misguided. >> Looking at it again, I'm inclined to agree. �The behavior was entirely >> correct up until somebody decided to emit a continuing stream of >> XLOG_RUNNING_XACTS WAL records even when the system is idle. �Why did >> we not fix it by fixing that? > That's exactly what I think we should have done. Actually, it looks like there is an extremely simple way to handle this, which is to move the call of LogStandbySnapshot (which generates the WAL record in question) to before the checkpoint's REDO pointer is set, but after we have decided that we need a checkpoint. That will result in later iterations not thinking that some work had happened while the checkpoint is in progress. It looks like we would need an extra release/reacquire of WALInsertLock to avoid holding that lock while doing LogStandbySnapshot, but that seems relatively negligible in comparison to the total cost of a checkpoint. There might be some still-better way to manage all this, but this one seems safe enough to consider as a post-beta patch. So I recommend we revert the change in the when-to-skip-checkpoint test in favor of reordering these operations. regards, tom lane
I wrote: > Actually, it looks like there is an extremely simple way to handle this, > which is to move the call of LogStandbySnapshot (which generates the WAL > record in question) to before the checkpoint's REDO pointer is set, but > after we have decided that we need a checkpoint. On further contemplation, there is a downside to that idea, which probably explains why the code was written as it was: if we place the XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather than after the checkpoint's REDO point, then a hot standby slave starting up from that checkpoint won't process the XLOG_RUNNING_XACTS record. That means its KnownAssignedXids machinery won't be fully operational until the master starts another checkpoint, which might be awhile. So this could result in undesirable delay in hot standby mode becoming active. I am not sure how significant this really is though. Comments? If we don't like that, I can think of a couple of other ways to get there, but they have their own downsides: * Instead of trying to detect after-the-fact whether any concurrent WAL activity happened during the last checkpoint, we could detect it during the checkpoint and then keep the info in a static variable in the checkpointer process until next time. However, I don't see any bulletproof way to do this without adding at least one or two lines of code within XLogInsert, which I'm sure Robert will complain about. * We could expand checkpoint records to contain two different REDO pointers, one to be used by hot standby slaves and one for normal crash recovery. (The LogStandbySnapshot records would appear between these two points; we'd still be moving them up to the start of the checkpoint sequence.) This is a relatively clean solution but would force pg_upgrade between beta2 and beta3, so that's not so nice. * Combining the two ideas, we could take the nominal REDO pointer, run LogStandbySnapshot, make a fresh note of where the insert point is (real REDO point, which is what we publish in shared memory for the bufmgr to compare LSNs to), complete the checkpoint, and write the checkpoint record using the nominal REDO pointer so that that's where any crash or HS slave starts from. But save the real REDO pointer in checkpointer static state, and in the next checkpoint use that rather than the nominal pointer to decide if anything's happened that would force a new checkpoint. I think this dodges both of the above complaints, but it feels pretty baroque. Thoughts, other ideas? regards, tom lane
On 6 June 2012 20:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the > rule for when to skip checkpoints on the grounds that not enough > activity has happened since the last one. However, that commit left the > comment block about it in a nonsensical state: > > * If this isn't a shutdown or forced checkpoint, and we have not switched > * to the next WAL file since the start of the last checkpoint, skip the > * checkpoint. The idea here is to avoid inserting duplicate checkpoints > * when the system is idle. That wastes log space, and more importantly it > * exposes us to possible loss of both current and previous checkpoint > * records if the machine crashes just as we're writing the update. > * (Perhaps it'd make even more sense to checkpoint only when the previous > * checkpoint record is in a different xlog page?) > > The new code entirely fails to prevent writing adjacent checkpoint > records, because what it checks is the distance from the previous > checkpoint's REDO pointer, not the previous checkpoint record itself. You were completely involved in the original discussion of this, as were others. I made the change to use the redo pointer at your request. So when you say Simon did this, what you mean is Simon acted according to group consensus on an agreed feature. How come this is a valid discussion? Why does making changes here make sense when other changes are said to destabilise the code and delay release? Should I revisit all the things others have done that I disagree with as well? No, I should not. Nor should we be trawling through changes made by me either. I'll look some more at this, but you should consider why this thread even exists. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Tom Lane <tgl@sss.pgh.pa.us> wrote: > there is no guarantee that we'll manage to reach a database state > that is consistent with data already flushed out to disk during > the last checkpoint. Robert Haas <robertmhaas@gmail.com> wrote: > I know of real customers who would have suffered real data loss > had this code been present in the server version they were using. > Checkpoints are the *only* mechanism by which SLRU pages get > flushed to disk on a mostly-idle system. That means if something > happens to your pg_xlog directory, and you haven't had a > checkpoint, you're screwed. Simon Riggs <simon@2ndQuadrant.com> wrote: > How come this is a valid discussion? Why does making changes here > make sense when other changes are said to destabilise the code and > delay release? I think the standard has been pretty clear -- at this point in a release we fix bugs and regressions from previous releases. The bar for introducing anything which doesn't qualify under either of those is very high in terms of being obviously useful and *very* low risk. > Should I revisit all the things others have done that I disagree > with as well? If you can spot any serious bugs, I would sure appreciate it if you point them out while we're still in beta. I think we all would. > I'll look some more at this, but you should consider why this > thread even exists. Because the beta release contains a new data loss bug which needs to be fixed. -Kevin
On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Actually, it looks like there is an extremely simple way to handle this, >> which is to move the call of LogStandbySnapshot (which generates the WAL >> record in question) to before the checkpoint's REDO pointer is set, but >> after we have decided that we need a checkpoint. > > On further contemplation, there is a downside to that idea, which > probably explains why the code was written as it was: if we place the > XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather > than after the checkpoint's REDO point, then a hot standby slave > starting up from that checkpoint won't process the XLOG_RUNNING_XACTS > record. That means its KnownAssignedXids machinery won't be fully > operational until the master starts another checkpoint, which might be > awhile. So this could result in undesirable delay in hot standby mode > becoming active. > > I am not sure how significant this really is though. Comments? I suspect that's pretty significant. > If we don't like that, I can think of a couple of other ways to get there, > but they have their own downsides: > > * Instead of trying to detect after-the-fact whether any concurrent > WAL activity happened during the last checkpoint, we could detect it > during the checkpoint and then keep the info in a static variable in > the checkpointer process until next time. However, I don't see any > bulletproof way to do this without adding at least one or two lines > of code within XLogInsert, which I'm sure Robert will complain about. My main concern here is to avoid doing anything that will make things harder for Heikki's WAL insert scaling patch, which I'm hoping will get done for 9.3. What do you have in mind, exactly? I feel like ProcLastRecPtr might be enough information. After logging running xacts, we can check whether ProcLastRecPtr is equal to the redo pointer. If so, then nothing got written to WAL between the time we began the checkpoint and the time we wrote that record. If, through further, similar gyrations, we can then work out whether the checkpoint record immediately follows the running-xacts record, we're there. That's pretty ugly, I guess, but it seems possible. Alternatively, we could keep a flag in XLogCtl->Insert indicating whether anything that requires a new checkpoint has happened since the last checkpoint. This could be set inside the existing block that tests whether RedoRecPtr is out of date, so any given backend would only do it once per checkpoint cycle. We'd have a hard-coded special case that would skip setting the flag for a running-xacts record. I kind of hate to shove even that much extra code in there, but as long as it doesn't mess up what Heikki has in mind maybe it'd be OK... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 6, 2012 at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we don't like that, I can think of a couple of other ways to get there, >> but they have their own downsides: >> >> * Instead of trying to detect after-the-fact whether any concurrent >> WAL activity happened during the last checkpoint, we could detect it >> during the checkpoint and then keep the info in a static variable in >> the checkpointer process until next time. �However, I don't see any >> bulletproof way to do this without adding at least one or two lines >> of code within XLogInsert, which I'm sure Robert will complain about. > My main concern here is to avoid doing anything that will make things > harder for Heikki's WAL insert scaling patch, which I'm hoping will > get done for 9.3. Yeah, I'm not very happy either with adding new requirements to XLogInsert, even if it is just tracking one more bit of information. > What do you have in mind, exactly? I feel like ProcLastRecPtr might > be enough information. After logging running xacts, we can check > whether ProcLastRecPtr is equal to the redo pointer. If so, then > nothing got written to WAL between the time we began the checkpoint > and the time we wrote that record. If, through further, similar > gyrations, we can then work out whether the checkpoint record > immediately follows the running-xacts record, we're there. That's > pretty ugly, I guess, but it seems possible. It's fairly messy because of the issues around "holes" being left at page ends etc --- so the fact that ProcLastRecPtr is different from the previous insert pointer doesn't immediately prove whether something else got inserted first, or we just had to leave some dead space. Heikki mentioned this morning that he'd like to remove some of those rules in 9.3, but that doesn't help us today. Note also that a closer look at LogStandbySnapshot shows it emits more than one WAL record, meaning we'd have to do this dance more than once, and the changes to do that would be pretty deadly to the modularity of the functions LogStandbySnapshot calls. The conclusion I'd come to yesterday was that we'd want XLogInsert to do something likeif (Insert->PrevRecord is different from ProcLastRecPtr) SomebodyElseWroteWAL = true; where SomebodyElseWroteWAL is a process-local boolean that we reset at the start of a checkpoint sequence, and then check after we've written out the LogStandbySnapshot records and the checkpoint record. (We'd also have to hack ProcLastRecPtr by initializing it to Insert->PrevRecord at the time we reset SomebodyElseWroteWAL, which is sort of ugly in that it messes up the relatively clean definition of that variable.) So that's not exactly a lot of new code in the critical section, but it's still new code. In the end I think I like the last idea I mentioned (keeping two different "REDO" values during a checkpoint) the best. It's a bit complicated but the grottiness is localized in CreateCheckpoint. Or at least I think it will be, without having written a patch yet. regards, tom lane
On 7 June 2012 14:59, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> there is no guarantee that we'll manage to reach a database state >> that is consistent with data already flushed out to disk during >> the last checkpoint. > > Robert Haas <robertmhaas@gmail.com> wrote: > >> I know of real customers who would have suffered real data loss >> had this code been present in the server version they were using. >> Checkpoints are the *only* mechanism by which SLRU pages get >> flushed to disk on a mostly-idle system. That means if something >> happens to your pg_xlog directory, and you haven't had a >> checkpoint, you're screwed. If that is the concern, then its a one line fix to add the missing clog flush. The other suggestions I've skim read seem fairly invasive at this stage of the release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: >> Robert Haas <robertmhaas@gmail.com> wrote: >>> I know of real customers who would have suffered real data loss >>> had this code been present in the server version they were using. > If that is the concern, then its a one line fix to add the missing clog flush. To where, and what performance impact will that have? > The other suggestions I've skim read seem fairly invasive at this > stage of the release. The issue here is that we committed a not-very-well-thought-out fix to the original problem. I think a reasonable argument could be made for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 and postponing any of these other ideas to 9.3. The useless-checkpoints problem has been there since 9.0 and can surely wait another release cycle to get fixed. But I concur with Robert that changing the system behavior so that checkpointing of committed changes might be delayed indefinitely is a high-risk choice. regards, tom lane
On 7 June 2012 17:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >>> Robert Haas <robertmhaas@gmail.com> wrote: >>>> I know of real customers who would have suffered real data loss >>>> had this code been present in the server version they were using. > >> If that is the concern, then its a one line fix to add the missing clog flush. > > To where, and what performance impact will that have? To the point where we decide to skip the other parts of the checkpoint. How would that cause a performance impact exactly? It's less work than the original behaviour would be. >> The other suggestions I've skim read seem fairly invasive at this >> stage of the release. > > The issue here is that we committed a not-very-well-thought-out fix > to the original problem. I think a reasonable argument could be made > for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 > and postponing any of these other ideas to 9.3. The useless-checkpoints > problem has been there since 9.0 and can surely wait another release > cycle to get fixed. But I concur with Robert that changing the system > behavior so that checkpointing of committed changes might be delayed > indefinitely is a high-risk choice. Clearly, delaying checkpoint indefinitely would be a high risk choice. But they won't be delayed indefinitely, since changes cause WAL records to be written and that would soon cause another checkpoint. Robert has shown a bug exists, so it should be fixed directly, especially if we believe in least invasive changes. If not-fixing-the-actual-bug happened before, its happening again here as well, with some poor sounding logic to justify it. Please act as you see fit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Clearly, delaying checkpoint indefinitely would be a high risk choice. > But they won't be delayed indefinitely, since changes cause WAL > records to be written and that would soon cause another checkpoint. But that's exactly the problem - it might not be soon at all. We have customers who process about one write transaction per day. The current state of play in 9.2 is that we'll checkpoint when we get to the next WAL segment. But at one write transaction per day, that could take weeks or months. Someone invented a setting called 'checkpoint_timeout' for a very good reason - people don't want huge amounts of time to pass between checkpoints. That setting is currently capped at one hour; the proposition that someone who sets it to 5 minutes on a low-write-volume system is OK with the effective value being 5 months does not seem likely to me. Your suggestion of just checkpointing CLOG seems like it would mitigate the worst of the damage, but I think I agree with Tom that just reverting the whole patch would be a better solution, if we can't figure out anything cleaner. It's better to have a few unnecessary checkpoints than to risk losing somebody's data, especially since the unnecessary checkpoints only happen with wal_level=hot_standby, but the data loss risk exists for everyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndQuadrant.com> writes: > On 7 June 2012 17:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> If that is the concern, then its a one line fix to add the missing clog flush. >> To where, and what performance impact will that have? > To the point where we decide to skip the other parts of the > checkpoint. How would that cause a performance impact exactly? It's > less work than the original behaviour would be. It's not clear to me exactly which parts of the checkpoint process would need to be duplicated here to be safe. What you're proposing is basically a partial checkpoint, which would need quite a lot of thought to be sure it did everything that should be done and nothing that shouldn't be. And it would be a continuing risk spot for future bugs of omission. >> The issue here is that we committed a not-very-well-thought-out fix >> to the original problem. �I think a reasonable argument could be made >> for simply reverting commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724 >> and postponing any of these other ideas to 9.3. �The useless-checkpoints >> problem has been there since 9.0 and can surely wait another release >> cycle to get fixed. �But I concur with Robert that changing the system >> behavior so that checkpointing of committed changes might be delayed >> indefinitely is a high-risk choice. > Clearly, delaying checkpoint indefinitely would be a high risk choice. > But they won't be delayed indefinitely, since changes cause WAL > records to be written and that would soon cause another checkpoint. No, the situation of most concern is where we make some change and then nothing happens for a very long time. If there's a continuing flow of updates, then yes a checkpoint will happen ... eventually. Even with the assumption of continuing updates, the time until a commit is checkpointed might be vastly longer than the configured checkpoint interval, and that's an interval in which loss of the WAL log is more dangerous than it was in any previous release. So basically, this fix is giving up some hard-to-quantify amount of data security. Per this thread, there are other ways in which we can fix the useless-checkpoint problem without giving up any such security. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > ... It's better to have a few unnecessary > checkpoints than to risk losing somebody's data, especially since the > unnecessary checkpoints only happen with wal_level=hot_standby, but > the data loss risk exists for everyone. Yeah, that's another point here: the benefit of the patch accrues to a different set of people than the ones paying the penalty. If you've got hot standby enabled, presumably you are replicating to at least one slave and so the prospect of data loss via WAL loss is mitigated for you. I also note that the other work done in 9.2 to reduce idle-system load did not address replication configurations at all; I think we still have time-driven wakeups in walsender and walreceiver for instance. So I'd rather revert the patch now, and consider that a better fix will be part of a future round of work to reduce the idle-system load in replication setups. regards, tom lane
On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Clearly, delaying checkpoint indefinitely would be a high risk choice. >> But they won't be delayed indefinitely, since changes cause WAL >> records to be written and that would soon cause another checkpoint. > > But that's exactly the problem - it might not be soon at all. We have > customers who process about one write transaction per day. The > current state of play in 9.2 is that we'll checkpoint when we get to > the next WAL segment. But at one write transaction per day, that > could take weeks or months. Someone invented a setting called > 'checkpoint_timeout' for a very good reason - people don't want huge > amounts of time to pass between checkpoints. That setting is > currently capped at one hour; the proposition that someone who sets it > to 5 minutes on a low-write-volume system is OK with the effective > value being 5 months does not seem likely to me. We discussed that in the original thread. The purpose of a checkpoint is to reduce recovery time. Nobody actually wants a checkpoint just of itself and why would we care how many months that takes? I grant that it takes a little while to come to terms with that thought, but that doesn't make the patch wrong. The only risk of data loss is in the case where someone deletes their pg_xlog and who didn't take a backup in all that time, which is hardly recommended behaviour. We're at exactly the same risk of data loss if someone deletes their pg_clog. Too frequent checkpoints actually makes the data loss risk from deleted pg_clog greater, so the balance of data loss risk doesn't seem to have altered. > Your suggestion of just checkpointing CLOG seems like it would > mitigate the worst of the damage, but I think I agree with Tom that > just reverting the whole patch would be a better solution, if we can't > figure out anything cleaner. It's better to have a few unnecessary > checkpoints than to risk losing somebody's data, especially since the > unnecessary checkpoints only happen with wal_level=hot_standby, but > the data loss risk exists for everyone. We're not at risk of losing anyone's data. There is no aspect of the normal running system that is at a higher risk of data loss. I don't think its fair comment to claim the patch has issues because you've found a guy with very, very, very low write rates, no backup strategy and a propensity to delete essential parts of their database who will be adversely affected. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Clearly, delaying checkpoint indefinitely would be a high risk choice. >> But they won't be delayed indefinitely, since changes cause WAL >> records to be written and that would soon cause another checkpoint. > > But that's exactly the problem - it might not be soon at all. We have > customers who process about one write transaction per day. The > current state of play in 9.2 is that we'll checkpoint when we get to > the next WAL segment. But at one write transaction per day, that > could take weeks or months. Someone invented a setting called > 'checkpoint_timeout' for a very good reason - people don't want huge > amounts of time to pass between checkpoints. That setting is > currently capped at one hour; the proposition that someone who sets it > to 5 minutes on a low-write-volume system is OK with the effective > value being 5 months does not seem likely to me. ISTM that this low-volume system accepts approximately the same risk of data loss as a high volume system that writes the same volume of data, but in 5 minutes rather than 5 months. You might argue that the scope of things that can go wrong in 5 months is far greater than the scope of 5 minutes. I might counter that the mechanical wear attendant to constant writes was a more important factor, or flash memory having gone through one too many P/E cycles, becoming unreliable. The bottom line is that once you've flushed WAL, you're by definition home-free - if you're relying on a checkpoint to save you from data loss, you're in big trouble anyway. Checkpoints are theoretically just for putting some cap on recovery time. That's pretty much how they're described in the literature. Your customer's use-case seems very narrow, and your complaint seems unusual to me, but couldn't you just get the customer to force checkpoints in a cronjob or something? CheckPointStmt will force, provided !RecoveryInProgress() . -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Clearly, delaying checkpoint indefinitely would be a high risk choice. >>> But they won't be delayed indefinitely, since changes cause WAL >>> records to be written and that would soon cause another checkpoint. >> But that's exactly the problem - it might not be soon at all. > Your customer's use-case seems very narrow, and your complaint seems > unusual to me, but couldn't you just get the customer to force > checkpoints in a cronjob or something? CheckPointStmt will force, > provided !RecoveryInProgress() . I think both you and Simon have completely missed the point. This is not a "use case" in the sense of someone doing it deliberately. This is about data redundancy, ie, if you lose your WAL through some unfortunate mishap, are you totally screwed or is there a reasonable chance that the data is on-disk in the main data store? I would guess that the incidents Robert has been talking about were cases where EDB were engaged to do crash recovery, and were successful precisely because PG wasn't wholly dependent on the WAL copy of the data. This project has always put reliability first. It's not clear to me why we would compromise that across-the-board in order to slightly reduce idle load in replication configurations. Yeah, it's probably not a *large* compromise ... but it is a compromise, and one that doesn't seem to me to be very well-advised. We can fix the idle-load issue without compromising on this basic goal; it will just take more than a ten-line patch to do it. regards, tom lane
On 8 June 2012 05:01, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <peter@2ndquadrant.com> writes: >> On 7 June 2012 18:03, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Jun 7, 2012 at 12:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> Clearly, delaying checkpoint indefinitely would be a high risk choice. >>>> But they won't be delayed indefinitely, since changes cause WAL >>>> records to be written and that would soon cause another checkpoint. > >>> But that's exactly the problem - it might not be soon at all. > >> Your customer's use-case seems very narrow, and your complaint seems >> unusual to me, but couldn't you just get the customer to force >> checkpoints in a cronjob or something? CheckPointStmt will force, >> provided !RecoveryInProgress() . > > I think both you and Simon have completely missed the point. This > is not a "use case" in the sense of someone doing it deliberately. > This is about data redundancy, ie, if you lose your WAL through some > unfortunate mishap, are you totally screwed or is there a reasonable > chance that the data is on-disk in the main data store? I would guess > that the incidents Robert has been talking about were cases where EDB > were engaged to do crash recovery, and were successful precisely because > PG wasn't wholly dependent on the WAL copy of the data. Apart from the likely point that hint bits exist on disk... > This project has always put reliability first. It's not clear to me why > we would compromise that across-the-board in order to slightly reduce > idle load in replication configurations. Yeah, it's probably not a > *large* compromise ... but it is a compromise, and one that doesn't > seem to me to be very well-advised. We can fix the idle-load issue > without compromising on this basic goal; it will just take more than > a ten-line patch to do it. So now the standard for my patches is that I must consider what will happen if the xlog is deleted? Tell me such a rule is applied uniformly to all patches then I would be happy. I will revoke, not because of the sense in this argument but because you personally ask for it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 7, 2012 at 9:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > The only risk of data loss is in the case where someone deletes their > pg_xlog and who didn't take a backup in all that time, which is hardly > recommended behaviour. We're at exactly the same risk of data loss if > someone deletes their pg_clog. Too frequent checkpoints actually makes > the data loss risk from deleted pg_clog greater, so the balance of > data loss risk doesn't seem to have altered. This doesn't match my experience. pg_xlog is often located on a separate disk, which significantly increases the chances of something bad happening to it, either through user error or because, uh, disks sometimes fail. Now, granted, you can also lose your data directory (including pg_clog) this way, but just because we lose data in that situation doesn't mean we should be happy about also losing data when pg_xlog goes does the toilet, especially when we can easily prevent it by going back to the behavior we've had in every previous release. Now, I have had customers lose pg_clog data, and it does suck, but it's usually a safe bet that most of the missing transactions committed, so you can pad out the missing files with 0x55, and probably get your data back. On the other hand, it's impossible to guess what any missing pg_xlog data might have been. Perhaps if the data pages are on disk and only CLOG didn't get written you could somehow figure out which bits you need to flip in CLOG to get your data back, but that's pretty heavy brain surgery, and if autovacuum or even just a HOT prune runs before you realize that you need to do it then you're toast. OTOH, if the database has checkpointed, pg_resetxlog is remarkably successful in letting you pick up the pieces and go on with your life. All that having been said, it wouldn't be a stupid idea to have a little more redundancy in our CLOG mechanism than we do right now. Hint bits help, as does the predictability of the data, but it's still an awfully scary to have that much critical data packed into that small a space. I'd love to see us checksum those pages, or store the data in some redundant location that makes it unlikely we'll lose both copies, or ship a utility that will scan all your heap pages and try to find hint bits that reveal which transactions committed and which ones aborted, or all of the above. But until then, I'd like to make sure that we at least have the data on the disk instead of sitting dirty in memory forever. As a general thought about disaster recovery, my experience is that if you can tell a customer to run a command (like pg_resetxlog), or - not quite as good - if you can tell them to run some script that you email them (like my pad-out-the-CLOG-with-0x55 script), then they're willing to do that, and it usually works, and they're as happy as they're going to be. But if you tell them that they have to send you all their data files or let you log into the machine and poke around for $X/hour * many hours, then they typically don't want to do that. Sometimes it's legally or procedurally impossible for them; even if not, it's cheaper to find some other way to cope with the situation, so they do, but now - the way they view it - the database lost their data. Even if the problem was entirely self-inflicted, like an intentional deletion of pg_xlog, and even if they therefore understand that it was entirely their own stupid fault that the data got eaten, it's a bad experience. For that reason, I think we should be looking for opportunities to increase the recoverability of the database in every area. I'm sure that everyone on this list who works with customers on a regular basis has had customers who lost pg_xlog, who lost pg_clog (or portions theref), who dropped their main table, who lost the backing files for pg_class and/or pg_attribute, whose database ended up in lost+found, who had a break in WAL, who had individual blocks corrupted or unreadable within some important table, who were missing TOAST chunks, who took a pg_basebackup and failed to create recovery.conf, who had a corrupted index on a critical system table, who had inconsistent system catalog contents. Some of these problems are caused by bad hardware or bugs, but the most common cause is user error. Regardless of the cause, the user wants to get as much of their data back as possible as quickly and as easily and as reliably as possible. To the extent that we can transform a situations that would have required consulting hours into situations from which a semi-automated recovery is possible, or situations that would have required many consulting hours into ones that require only a few, that's a huge win. Of course, we shouldn't place that goal above all else; and of course, this is only one small piece of that. But it is a piece, and it has a tangible benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > if the database has checkpointed I haven't been exactly clear on the risks about which Tom and Robert have been concerned; is it a question about whether we change the meaning of these settings to something more complicated?: checkpoint_segments (integer) Maximum number of log file segments between automatic WAL checkpoints checkpoint_timeout (integer) Maximum time between automatic WAL checkpoints I can see possibly changing the latter when absolutely nothing has been written to WAL since the last checkpoint, although I'm not sure that should suppress flushing dirty pages (e.g., from hinting) to disk. Such a change seems like it would be of pretty minimal benefit, though, and not something for which it is worth taking on any risk at all. Any other change to the semantics of these settings seems ill advised on the face of it. ... or am I not grasping the issue properly? -Kevin
On Fri, Jun 8, 2012 at 12:24 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I haven't been exactly clear on the risks about which Tom and Robert > have been concerned; is it a question about whether we change the > meaning of these settings to something more complicated?: > > checkpoint_segments (integer) > Maximum number of log file segments between automatic WAL > checkpoints > > checkpoint_timeout (integer) > Maximum time between automatic WAL checkpoints The issue is that, in the tip of the 9.2 branch, checkpoint_timeout is no longer the maximum time between automatic WAL checkpoints. Instead, the checkpoint is skipped if we're still in the same WAL segment that we were in when we did the last checkpoint. Therefore, there is absolutely no upper bound on the amount of time that can pass between checkpoints. If someone does one transaction, which happens not to cross a WAL segment boundary, we will never automatically checkpoint that transaction. A checkpoint will ONLY be triggered when we have enough write-ahead log volume to get us into the next segment.I am arguing (and Tom is now agreeing) that this isbad, and that the patch which made this change needs either some kind of fix, or to be reverted completely. The original motivation for the patch was that the code to suppress duplicate checkpoints stopped working correctly when Hot Standby was committed. The previous coding (before the commit at issue) skips a checkpoint if no write-ahead log records at all have been emitted since the start of the preceding checkpoint. I believe this is the correct behavior, but there's a problem: when wal_level = hot_standby, we emit an XLOG_RUNNING_XACTS record during every checkpoint cycle. So, if wal_level = hot_standby, the test for whether anything has happened always returns false, and so the system never quiesces: every checkpoint cycle contains at least the XLOG_RUNNING_XACTS record, even if nothing else, so we never get to skip any checkpoints. When wal_level < hot_standby, the problem does not exist and redundant checkpoints are suppressed just as we would hope. While Simon's patch does fix the problem, I believe that making checkpoint_timeout anything less than a hard timeout is unwise. The previous behavior - at least one checkpoint per checkpoint_timeout - is easy to understand and plan for; I believe the new behavior will be an unpleasant surprise for users who care about checkpointing regularly, which I think most do, whether they are here to be represented in this conversation or not. So I think we need a different fix for the problem that wal_level = hot_standby defeats the redundant-checkpoint-detection code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndQuadrant.com> writes: > So now the standard for my patches is that I must consider what will > happen if the xlog is deleted? When you're messing around with something that affects data integrity, yes. The long and the short of it is that this patch does reduce our ability to recover from easily-foreseeable disasters. The problem it was meant to solve is not dire enough to justify that, and other fixes are possible that don't require any compromises in this dimension. So please revert. We can revisit the original complaint in 9.3. regards, tom lane
On Wed, Jun 06, 2012 at 06:46:37PM -0400, Tom Lane wrote: > I wrote: > > Actually, it looks like there is an extremely simple way to handle this, > > which is to move the call of LogStandbySnapshot (which generates the WAL > > record in question) to before the checkpoint's REDO pointer is set, but > > after we have decided that we need a checkpoint. > > On further contemplation, there is a downside to that idea, which > probably explains why the code was written as it was: if we place the > XLOG_RUNNING_XACTS WAL record emitted during a checkpoint before rather > than after the checkpoint's REDO point, then a hot standby slave > starting up from that checkpoint won't process the XLOG_RUNNING_XACTS > record. That means its KnownAssignedXids machinery won't be fully > operational until the master starts another checkpoint, which might be > awhile. So this could result in undesirable delay in hot standby mode > becoming active. Stupid question, but why are we not just setting a boolean variable in shared memory if we WAL-write a non-XLOG_RUNNING_XACTS record, and only checkpoint if that is true? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > Stupid question, but why are we not just setting a boolean variable in > shared memory if we WAL-write a non-XLOG_RUNNING_XACTS record, and only > checkpoint if that is true? Well, (1) we are trying to avoid adding such logic to the critical section inside XLogInsert, and (2) XLOG_RUNNING_XACTS is not the only problematic record type, there's at least one other. regards, tom lane
On Mon, Aug 13, 2012 at 6:19 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Sat, Jun 9, 2012 at 5:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >>> So now the standard for my patches is that I must consider what will >>> happen if the xlog is deleted? >> >> When you're messing around with something that affects data integrity, yes. >> The long and the short of it is that this patch does reduce our ability >> to recover from easily-foreseeable disasters. The problem it was meant >> to solve is not dire enough to justify that, and other fixes are >> possible that don't require any compromises in this dimension. >> So please revert. We can revisit the original complaint in 9.3. > > This reversion was done, so > b8b69d89905e04b910bcd Wed Jun 13, 2012 > reverted: > 18fb9d8d21a28caddb72 Wed Nov 2, 2011. > > However, the corresponding doc changes 43342891861cc2d08de and > bd2396988a1afbcb6424 were not reverted. > > A simple reversion is probably not the right thing, because the > original docs seemed rather inadequate. > > I've attached an attempt to fix this. I also changed "WAL shipping" > to "WAL archiving", as the reason for setting archive_timeout applies > to all WAL archiving not just the special case of warm standby. Committed, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company