Thread: pg_upgrade resets timeline to 1
commit 4c5e060049a3714dd27b7f4732fe922090edea69 Author: Bruce Momjian <bruce@momjian.us> Date: Sat May 16 00:40:18 2015 -0400 pg_upgrade: force timeline 1 in the new cluster Previously, this prevented promoted standby servers from being upgraded because of a missing WAL history file. (Timeline1 doesn't need a history file, and we don't copy WAL files anyway.) Pardon me for starting a fresh thread, but I couldn't find where this was discussed. I've just had trouble getting barman to work again after a 9.1->9.4.2 upgrade, and I think part of the problem was that the WAL for this cluster got reset from timeline 2 to 1, which made barman's incoming WALs processor drop the files, probably because the new filename 0001... is now "less" than the 0002... before. I don't expect to be able to recover through a pg_upgrade operation, but pg_upgrade shouldn't make things more complicated than they should be for backup tools. (If there's a problem with the history files, shouldn't pg_upgrade copy them instead?) In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the timeline to make sure the archive_command doesn't clobber any files from the old cluster when reused in the new cluster? https://bugs.debian.org/786993 Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Wed, May 27, 2015 at 05:40:09PM +0200, Christoph Berg wrote: > commit 4c5e060049a3714dd27b7f4732fe922090edea69 > Author: Bruce Momjian <bruce@momjian.us> > Date: Sat May 16 00:40:18 2015 -0400 > > pg_upgrade: force timeline 1 in the new cluster > > Previously, this prevented promoted standby servers from being upgraded > because of a missing WAL history file. (Timeline 1 doesn't need a > history file, and we don't copy WAL files anyway.) > > Pardon me for starting a fresh thread, but I couldn't find where this > was discussed. > > I've just had trouble getting barman to work again after a 9.1->9.4.2 > upgrade, and I think part of the problem was that the WAL for this > cluster got reset from timeline 2 to 1, which made barman's incoming > WALs processor drop the files, probably because the new filename > 0001... is now "less" than the 0002... before. > > I don't expect to be able to recover through a pg_upgrade operation, > but pg_upgrade shouldn't make things more complicated than they should > be for backup tools. (If there's a problem with the history files, > shouldn't pg_upgrade copy them instead?) > > In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the > timeline to make sure the archive_command doesn't clobber any files > from the old cluster when reused in the new cluster? > > https://bugs.debian.org/786993 Uh, WAL files and WAL history files are not compatible across PG major versions so you should never be fetching them after a major upgrade. I have noticed some people are putting their WAL files in directories with PG major version numbers to avoid this problem. We could have pg_upgrade increment the timeline and allow for missing history files, but that doesn't fix problems with non-pg_upgrade upgrades, which also should never be sharing WAL files from previous major versions. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: Bruce Momjian 2015-05-27 <20150527174244.GB31835@momjian.us> > > In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the > > timeline to make sure the archive_command doesn't clobber any files > > from the old cluster when reused in the new cluster? > > > > https://bugs.debian.org/786993 > > Uh, WAL files and WAL history files are not compatible across PG major > versions so you should never be fetching them after a major upgrade. I > have noticed some people are putting their WAL files in directories with > PG major version numbers to avoid this problem. I guess I could rename all the barman server definitions to $server-$version, yes. My point is mostly that if I chose to continue to use the same backup store (knowing that I can't recover across the upgrade point), pg_upgrade shouldn't make things more complicated than they need to. This change broke barman and probably most of the other WAL backup helpers out there, and IMHO shouldn't have been backpatched to released branches. > We could have pg_upgrade increment the timeline and allow for missing > history files, but that doesn't fix problems with non-pg_upgrade > upgrades, which also should never be sharing WAL files from previous > major versions. pg_upgrade-style upgrades have a chance to know which timeline to use. That other methods have less knowledge about the "old" system shouldn't mean that pg_upgrade shouldn't care. (Wishlist idea: an initdb option to chose the timeline to start with) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Wed, May 27, 2015 at 10:06:03PM +0200, Christoph Berg wrote: > Re: Bruce Momjian 2015-05-27 <20150527174244.GB31835@momjian.us> > > > In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the > > > timeline to make sure the archive_command doesn't clobber any files > > > from the old cluster when reused in the new cluster? > > > > > > https://bugs.debian.org/786993 > > > > Uh, WAL files and WAL history files are not compatible across PG major > > versions so you should never be fetching them after a major upgrade. I > > have noticed some people are putting their WAL files in directories with > > PG major version numbers to avoid this problem. > > I guess I could rename all the barman server definitions to > $server-$version, yes. My point is mostly that if I chose to continue > to use the same backup store (knowing that I can't recover across the > upgrade point), pg_upgrade shouldn't make things more complicated than > they need to. This change broke barman and probably most of the other > WAL backup helpers out there, and IMHO shouldn't have been backpatched > to released branches. Well, if you used pg_dump/pg_restore, you would have had even larger problems as the file names would have matched. > > We could have pg_upgrade increment the timeline and allow for missing > > history files, but that doesn't fix problems with non-pg_upgrade > > upgrades, which also should never be sharing WAL files from previous > > major versions. > > pg_upgrade-style upgrades have a chance to know which timeline to use. > That other methods have less knowledge about the "old" system > shouldn't mean that pg_upgrade shouldn't care. That is an open question, whether pg_upgrade should try to avoid this, even if other methods do not, or should we better document not to do this. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, May 27, 2015 at 05:40:09PM +0200, Christoph Berg wrote: > commit 4c5e060049a3714dd27b7f4732fe922090edea69 > Author: Bruce Momjian <bruce@momjian.us> > Date: Sat May 16 00:40:18 2015 -0400 > > pg_upgrade: force timeline 1 in the new cluster > > Previously, this prevented promoted standby servers from being upgraded > because of a missing WAL history file. (Timeline 1 doesn't need a > history file, and we don't copy WAL files anyway.) > > Pardon me for starting a fresh thread, but I couldn't find where this > was discussed. > > I've just had trouble getting barman to work again after a 9.1->9.4.2 > upgrade, and I think part of the problem was that the WAL for this > cluster got reset from timeline 2 to 1, which made barman's incoming > WALs processor drop the files, probably because the new filename > 0001... is now "less" than the 0002... before. It looks like an upgrade from 9.1.x to 9.3.0 or later has always set the new timeline identifier (TLI) to 1. My testing confirms this for an upgrade from 9.1.16 to 9.4.1 and for an upgrade from 9.1.16 to 9.4.2, so I failed to reproduce your report. Would you verify the versions you used? If you were upgrading from 9.3.x, I _can_ reproduce that. Since the 2015-05-16 commits you cite, pg_upgrade always sets TLI=1. Behavior before those commits depended on the source and destination major versions. PostgreSQL 9.0, 9.1 and 9.2 restored the TLI regardless of source version. PostgreSQL 9.3 and 9.4 restored the TLI when upgrading from 9.3 or 9.4, but they set TLI=1 when upgrading from 9.2 or earlier. (Commit 038f3a0 introduced this inconsistent behavior of 9.3 and later.) The commit you cite fixed this symptom: http://www.postgresql.org/message-id/flat/D5359E0908278642BB1747131D62694DAB22560F@AUSMXMBX01.mrws.biz I'm attaching a test script that I used to observe TLI assignment and to test for that problem. pg_upgrade has been restoring TLI without history files since 9.0.0 or earlier, and that was always risky. The reported symptom became possible with the introduction of the TIMELINE_HISTORY walsender command in 9.3.0. (It was hard to encounter before 9.4, because 9.3 to 9.3 pg_upgrade runs are rare outside of hacker testing.) Since you observed barman breakage less than a week after a release that changed the post-pg_upgrade TLI, it seems prudent to figure that other folks will be affected. At the same time, I don't understand why that release would prompt the first report. Any upgrade from {9.0,9.1,9.2} to {9.3,9.4} already had the behavior you experienced. Ideas? > I don't expect to be able to recover through a pg_upgrade operation, > but pg_upgrade shouldn't make things more complicated than they should > be for backup tools. (If there's a problem with the history files, > shouldn't pg_upgrade copy them instead?) > > In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the > timeline to make sure the archive_command doesn't clobber any files > from the old cluster when reused in the new cluster? It's worth considering that, as a major-release change. Do note this in the documentation, though: The archive command should generally be designed to refuse to overwrite any pre-existing archive file. This is an important safety feature to preserve the integrity of your archive in case of administrator error (such as sending the output of two different servers to the same archive directory). -- http://www.postgresql.org/docs/devel/static/continuous-archiving.html
Attachment
On 27 May 2015 at 18:42, Bruce Momjian <bruce@momjian.us> wrote:
--
On Wed, May 27, 2015 at 05:40:09PM +0200, Christoph Berg wrote:
> commit 4c5e060049a3714dd27b7f4732fe922090edea69
> Author: Bruce Momjian <bruce@momjian.us>
> Date: Sat May 16 00:40:18 2015 -0400
>
> pg_upgrade: force timeline 1 in the new cluster
>
> Previously, this prevented promoted standby servers from being upgraded
> because of a missing WAL history file. (Timeline 1 doesn't need a
> history file, and we don't copy WAL files anyway.)
>
> Pardon me for starting a fresh thread, but I couldn't find where this
> was discussed.
>
> I've just had trouble getting barman to work again after a 9.1->9.4.2
> upgrade, and I think part of the problem was that the WAL for this
> cluster got reset from timeline 2 to 1, which made barman's incoming
> WALs processor drop the files, probably because the new filename
> 0001... is now "less" than the 0002... before.
>
> I don't expect to be able to recover through a pg_upgrade operation,
> but pg_upgrade shouldn't make things more complicated than they should
> be for backup tools. (If there's a problem with the history files,
> shouldn't pg_upgrade copy them instead?)
>
> In fact, I'm wondering if pg_upgrade shouldn't rather *increase* the
> timeline to make sure the archive_command doesn't clobber any files
> from the old cluster when reused in the new cluster?
>
> https://bugs.debian.org/786993
Uh, WAL files and WAL history files are not compatible across PG major
versions so you should never be fetching them after a major upgrade. I
have noticed some people are putting their WAL files in directories with
PG major version numbers to avoid this problem.
We could have pg_upgrade increment the timeline and allow for missing
history files, but that doesn't fix problems with non-pg_upgrade
upgrades, which also should never be sharing WAL files from previous
major versions.
Maybe, but I thought we had a high respect for backwards compatibility and we clearly just broke quite a few things that didn't need to be broken.
Hmm, it looks like the change to TimeLine 1 is just a kludge anyway. The rule that TimeLine 1 doesn't need a history file is itself a hack.
What we should be saying is that the last timeline doesn't need a history file. Then no change is needed here.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Bruce Momjian 2015-05-28 <20150527221607.GA7964@momjian.us> > Well, if you used pg_dump/pg_restore, you would have had even larger > problems as the file names would have matched. True, but even here it's possible that files get overwritten. If you had a server running on TL 1 for files 0001001..00010020, and then did a PITR at location 10, you'll have a server writing to 00020010. If you pg_upgrade that, it will keep its WAL position, but start at 1 again, overwriting files 00010011 and following. > > > We could have pg_upgrade increment the timeline and allow for missing > > > history files, but that doesn't fix problems with non-pg_upgrade > > > upgrades, which also should never be sharing WAL files from previous > > > major versions. > > > > pg_upgrade-style upgrades have a chance to know which timeline to use. > > That other methods have less knowledge about the "old" system > > shouldn't mean that pg_upgrade shouldn't care. > > That is an open question, whether pg_upgrade should try to avoid this, > even if other methods do not, or should we better document not to do > this. Actually, if initdb could be told to start at an arbitrary timeline, it would be trivial to avoid the problem with pg_dump upgrades as well. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: Simon Riggs 2015-05-28 <CANP8+j+JTCk+MTh30UgBaDaq6WmOxK3xbHFdT=O9fFpXoOLCCw@mail.gmail.com> > Hmm, it looks like the change to TimeLine 1 is just a kludge anyway. The > rule that TimeLine 1 doesn't need a history file is itself a hack. > > What we should be saying is that the last timeline doesn't need a history > file. Then no change is needed here. IMHO it's as simple as that, yes. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: Noah Misch 2015-05-28 <20150528072721.GA4102649@tornado.leadboat.com> > > I've just had trouble getting barman to work again after a 9.1->9.4.2 > > upgrade, and I think part of the problem was that the WAL for this > > cluster got reset from timeline 2 to 1, which made barman's incoming > > WALs processor drop the files, probably because the new filename > > 0001... is now "less" than the 0002... before. > > It looks like an upgrade from 9.1.x to 9.3.0 or later has always set the new > timeline identifier (TLI) to 1. My testing confirms this for an upgrade from > 9.1.16 to 9.4.1 and for an upgrade from 9.1.16 to 9.4.2, so I failed to > reproduce your report. Would you verify the versions you used? If you were > upgrading from 9.3.x, I _can_ reproduce that. Sorry, the "9.1" was a typo, the system was on 9.2.11 before/during pg_upgrade. > Do note this in the documentation, though: > > The archive command should generally be designed to refuse to overwrite any > pre-existing archive file. This is an important safety feature to preserve > the integrity of your archive in case of administrator error (such as > sending the output of two different servers to the same archive directory). > -- http://www.postgresql.org/docs/devel/static/continuous-archiving.html (Except that this wasn't possible in practise since ~9.2 until very recently because some files got archived again during a timeline switch :-/ ) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Thu, May 28, 2015 at 08:47:07AM +0100, Simon Riggs wrote: > We could have pg_upgrade increment the timeline and allow for missing > history files, but that doesn't fix problems with non-pg_upgrade > upgrades, which also should never be sharing WAL files from previous > major versions. > > > Maybe, but I thought we had a high respect for backwards compatibility and we > clearly just broke quite a few things that didn't need to be broken. I can't break something that was never intended to work, and mixing WAL from previous major versions was never designed to work. > Hmm, it looks like the change to TimeLine 1 is just a kludge anyway. The rule > that TimeLine 1 doesn't need a history file is itself a hack. > > What we should be saying is that the last timeline doesn't need a history file. > Then no change is needed here. Yes, that would make a lot more sense than what we have now, but this had to be backpatched, so reverting to the 9.3 and earlier behavior seemed logical. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, May 28, 2015 at 10:13:14AM +0200, Christoph Berg wrote: > Re: Bruce Momjian 2015-05-28 <20150527221607.GA7964@momjian.us> > > Well, if you used pg_dump/pg_restore, you would have had even larger > > problems as the file names would have matched. > > True, but even here it's possible that files get overwritten. If you > had a server running on TL 1 for files 0001001..00010020, and then did > a PITR at location 10, you'll have a server writing to 00020010. > If you pg_upgrade that, it will keep its WAL position, but start at 1 > again, overwriting files 00010011 and following. > > > > > We could have pg_upgrade increment the timeline and allow for missing > > > > history files, but that doesn't fix problems with non-pg_upgrade > > > > upgrades, which also should never be sharing WAL files from previous > > > > major versions. > > > > > > pg_upgrade-style upgrades have a chance to know which timeline to use. > > > That other methods have less knowledge about the "old" system > > > shouldn't mean that pg_upgrade shouldn't care. > > > > That is an open question, whether pg_upgrade should try to avoid this, > > even if other methods do not, or should we better document not to do > > this. > > Actually, if initdb could be told to start at an arbitrary timeline, > it would be trivial to avoid the problem with pg_dump upgrades as > well. Yes, that would make sense. Perhaps we should revisit this for 9.6. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, May 28, 2015 at 10:18:18AM +0200, Christoph Berg wrote: > Re: Noah Misch 2015-05-28 <20150528072721.GA4102649@tornado.leadboat.com> > > > I've just had trouble getting barman to work again after a 9.1->9.4.2 > > > upgrade, and I think part of the problem was that the WAL for this > > > cluster got reset from timeline 2 to 1, which made barman's incoming > > > WALs processor drop the files, probably because the new filename > > > 0001... is now "less" than the 0002... before. > > > > It looks like an upgrade from 9.1.x to 9.3.0 or later has always set the new > > timeline identifier (TLI) to 1. My testing confirms this for an upgrade from > > 9.1.16 to 9.4.1 and for an upgrade from 9.1.16 to 9.4.2, so I failed to > > reproduce your report. Would you verify the versions you used? If you were > > upgrading from 9.3.x, I _can_ reproduce that. > > Sorry, the "9.1" was a typo, the system was on 9.2.11 before/during > pg_upgrade. I ran 9.2.11-to-9.4.1 and 9.2.11-to-9.4.2 upgrades through my script. Both of them set TLI=1. I would be inclined to restore compatibility if this were a 9.4.2 regression, but upgrades from 9.2 to 9.4 have always done that.
On Thu, May 28, 2015 at 10:39:15AM -0400, Noah Misch wrote: > > > It looks like an upgrade from 9.1.x to 9.3.0 or later has always set the new > > > timeline identifier (TLI) to 1. My testing confirms this for an upgrade from > > > 9.1.16 to 9.4.1 and for an upgrade from 9.1.16 to 9.4.2, so I failed to > > > reproduce your report. Would you verify the versions you used? If you were > > > upgrading from 9.3.x, I _can_ reproduce that. > > > > Sorry, the "9.1" was a typo, the system was on 9.2.11 before/during > > pg_upgrade. > > I ran 9.2.11-to-9.4.1 and 9.2.11-to-9.4.2 upgrades through my script. Both of > them set TLI=1. I would be inclined to restore compatibility if this were a > 9.4.2 regression, but upgrades from 9.2 to 9.4 have always done that. Right, it was only 9.3 to 9.4.0 (and 9.4.1) that restored the timeline. Restores to 9.4.2 do not. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, May 28, 2015 at 10:20:58AM -0400, Bruce Momjian wrote: > On Thu, May 28, 2015 at 08:47:07AM +0100, Simon Riggs wrote: > > What we should be saying is that the last timeline doesn't need a history file. > > Then no change is needed here. > > Yes, that would make a lot more sense than what we have now, but this > had to be backpatched, so reverting to the 9.3 and earlier behavior > seemed logical. To clarify for the archives, the 2015-05-16 changes did not revert to 9.3 and earlier behavior. Rather, they standardized on the {9.0,9.1,9.2}-to-{9.3,9.4} upgrade behavior, bringing that behavior to all supported branches and source versions. Here is the history of timeline restoration in pg_upgrade: On Thu, May 28, 2015 at 03:27:21AM -0400, Noah Misch wrote: > Since the 2015-05-16 commits you cite, pg_upgrade always sets TLI=1. Behavior > before those commits depended on the source and destination major versions. > PostgreSQL 9.0, 9.1 and 9.2 restored the TLI regardless of source version. > PostgreSQL 9.3 and 9.4 restored the TLI when upgrading from 9.3 or 9.4, but > they set TLI=1 when upgrading from 9.2 or earlier. (Commit 038f3a0 introduced > this inconsistent behavior of 9.3 and later.)
Re: Noah Misch 2015-05-28 <20150528150234.GA4111886@tornado.leadboat.com> > On Thu, May 28, 2015 at 10:20:58AM -0400, Bruce Momjian wrote: > > On Thu, May 28, 2015 at 08:47:07AM +0100, Simon Riggs wrote: > > > What we should be saying is that the last timeline doesn't need a history file. > > > Then no change is needed here. > > > > Yes, that would make a lot more sense than what we have now, but this > > had to be backpatched, so reverting to the 9.3 and earlier behavior > > seemed logical. > > To clarify for the archives, the 2015-05-16 changes did not revert to 9.3 and > earlier behavior. Rather, they standardized on the {9.0,9.1,9.2}-to-{9.3,9.4} > upgrade behavior, bringing that behavior to all supported branches and source > versions. Here is the history of timeline restoration in pg_upgrade: Ok, sorry for the noise then. It's not a regression, but I still think the behavior needs improvement, but this is indeed 9.6 material. Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Thu, May 28, 2015 at 05:26:56PM +0200, Christoph Berg wrote: > Re: Noah Misch 2015-05-28 <20150528150234.GA4111886@tornado.leadboat.com> > > To clarify for the archives, the 2015-05-16 changes did not revert to 9.3 and > > earlier behavior. Rather, they standardized on the {9.0,9.1,9.2}-to-{9.3,9.4} > > upgrade behavior, bringing that behavior to all supported branches and source > > versions. Here is the history of timeline restoration in pg_upgrade: > > Ok, sorry for the noise then. It's not a regression, but I still think > the behavior needs improvement, but this is indeed 9.6 material. No, thank you for the report. It had strong signs of being a regression, considering recent changes and the timing of your discovery.
On 29/05/15 12:59, Noah Misch wrote: > On Thu, May 28, 2015 at 05:26:56PM +0200, Christoph Berg wrote: >> Re: Noah Misch 2015-05-28 <20150528150234.GA4111886@tornado.leadboat.com> >>> To clarify for the archives, the 2015-05-16 changes did not revert to 9.3 and >>> earlier behavior. Rather, they standardized on the {9.0,9.1,9.2}-to-{9.3,9.4} >>> upgrade behavior, bringing that behavior to all supported branches and source >>> versions. Here is the history of timeline restoration in pg_upgrade: >> Ok, sorry for the noise then. It's not a regression, but I still think >> the behavior needs improvement, but this is indeed 9.6 material. > No, thank you for the report. It had strong signs of being a regression, > considering recent changes and the timing of your discovery. > >From my experience, I would far rather a user raise concerns that are important to them, and find there is no real problem, than users not raising things and a serious bug or system shorting coming go unnoticed. This is a major concern of mine, for example: in my current project, where users were NOT raising problems in a timely manner, caused unnecessary work rather later in the project than I would have liked! So not just for PostgreSQL, but in general if a user has concerns, please raise them!!! Cheers, Gavin