Thread: pg_internal.init is hazardous to your health
Dirk Lutzebaeck and I just spent a tense couple of hours trying to figure out why a large database Down Under wasn't coming up after being reloaded from a base backup plus PITR recovery. The symptoms were that the recovery went fine, but backend processes would fail at startup or soon after with "could not open relation XX/XX/XX: No such file" type of errors. The answer that ultimately emerged was that they'd been running a nightly maintenance script that did REINDEX SYSTEM (among other things I suppose). The PITR base backup included pg_internal.init files that were appropriate when it was taken, and the PITR recovery process did nothing whatsoever to update 'em :-(. So incoming backends picked up init files with obsolete relfilenode values. We don't actually need to *update* the file, per se, we only need to remove it if no longer valid --- the next incoming backend will rebuild it. I could see fixing this by making WAL recovery run around and zap all the .init files (only problem is to find 'em), or we could add a new kind of WAL record saying "remove the .init file for database XYZ" to be emitted whenever someone removes the active one. Thoughts? Meanwhile, if you're trying to recover from a PITR backup and it's not working, try removing any pg_internal.init files you can find. regards, tom lane
On Tue, 17 Oct 2006, Tom Lane wrote: > Dirk Lutzebaeck and I just spent a tense couple of hours trying to > figure out why a large database Down Under wasn't coming up after being > reloaded from a base backup plus PITR recovery. The symptoms were that > the recovery went fine, but backend processes would fail at startup or > soon after with "could not open relation XX/XX/XX: No such file" type of > errors. > > The answer that ultimately emerged was that they'd been running a > nightly maintenance script that did REINDEX SYSTEM (among other things > I suppose). The PITR base backup included pg_internal.init files that > were appropriate when it was taken, and the PITR recovery process did > nothing whatsoever to update 'em :-(. So incoming backends picked up > init files with obsolete relfilenode values. Ouch. > We don't actually need to *update* the file, per se, we only need to > remove it if no longer valid --- the next incoming backend will rebuild > it. I could see fixing this by making WAL recovery run around and zap > all the .init files (only problem is to find 'em), or we could add a new > kind of WAL record saying "remove the .init file for database XYZ" > to be emitted whenever someone removes the active one. Thoughts? The latter seems the Right Way except, I guess, that the decision to remove the file is buried deep inside inval.c. Thanks, Gavin
On Tue, 2006-10-17 at 22:29 -0400, Tom Lane wrote: > Dirk Lutzebaeck and I just spent a tense couple of hours trying to > figure out why a large database Down Under wasn't coming up after being > reloaded from a base backup plus PITR recovery. The symptoms were that > the recovery went fine, but backend processes would fail at startup or > soon after with "could not open relation XX/XX/XX: No such file" type of > errors. Understand the tension... > The answer that ultimately emerged was that they'd been running a > nightly maintenance script that did REINDEX SYSTEM (among other things > I suppose). The PITR base backup included pg_internal.init files that > were appropriate when it was taken, and the PITR recovery process did > nothing whatsoever to update 'em :-(. So incoming backends picked up > init files with obsolete relfilenode values. OK, I'm looking at this now for later discussion. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-10-18 at 12:49 +1000, Gavin Sherry wrote: > > We don't actually need to *update* the file, per se, we only need to > > remove it if no longer valid --- the next incoming backend will rebuild > > it. I could see fixing this by making WAL recovery run around and zap > > all the .init files (only problem is to find 'em), or we could add a new > > kind of WAL record saying "remove the .init file for database XYZ" > > to be emitted whenever someone removes the active one. Thoughts? Yes, that assessment seems good. > The latter seems the Right Way except, I guess, that the decision to > remove the file is buried deep inside inval.c. I'd prefer the zap everything approach, but emitting a WAL record looks mostly straightforward and just as good. RelationCacheInitFileInvalidate() can easily emit a WAL record. This is called twice in succession, so we would emit WAL on the RelationCacheInitFileInvalidate(true) call only. I'll work out a patch for that...XLOG_XACT_RELCACHE_INVALIDATE RelationCacheInitFileInvalidate() is also called on each FinishPreparedTransaction(). If that is called 100% of the time, then we can skip writing an additional record for prepared transactions by triggering the removal of pg_internal.init when we see a XLOG_XACT_COMMIT_PREPARED during replay. Not sure whether we need to do that, Heikki? Anyone? I'm guessing no, but it seems sensible to check. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > RelationCacheInitFileInvalidate() is also called on each > FinishPreparedTransaction(). Surely not... regards, tom lane
On Wed, 2006-10-18 at 13:24 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > RelationCacheInitFileInvalidate() is also called on each > > FinishPreparedTransaction(). > > Surely not... I take that to mean there's nothing special about prepared transactions and invalidating the rel cache, so we *do* need to have a separate WAL record in all cases. OK, I'll write up a patch later today (working in US for few days). -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > RelationCacheInitFileInvalidate() is also called on each > FinishPreparedTransaction(). It's only called if the prepared transaction invalidated the init file. > If that is called 100% of the time, then we > can skip writing an additional record for prepared transactions by > triggering the removal of pg_internal.init when we see a > XLOG_XACT_COMMIT_PREPARED during replay. > Not sure whether we need to do that, Heikki? Anyone? > I'm guessing no, but it seems sensible to check. If you write the WAL record in RelationCacheInitFileInvalidate(true), that's enough. No extra handling for prepared transactions is needed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-10-18 at 15:56 +0100, Simon Riggs wrote: > On Tue, 2006-10-17 at 22:29 -0400, Tom Lane wrote: > > The answer that ultimately emerged was that they'd been running a > > nightly maintenance script that did REINDEX SYSTEM (among other things > > I suppose). The PITR base backup included pg_internal.init files that > > were appropriate when it was taken, and the PITR recovery process did > > nothing whatsoever to update 'em :-(. So incoming backends picked up > > init files with obsolete relfilenode values. > > OK, I'm looking at this now for later discussion. I've coded a patch and am just testing now. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com