Thread: Missing pg_control crashes postmaster

Missing pg_control crashes postmaster

From

Brian Faherty

Date:

23 July 2018, 22:31:13

Hey Hackers,

If a postmaster is running and the pg_control file is removed postgres
will PANIC.

Steps to recreate:

1.) start a new cluster
2.) rm $DATADIR/pg_control
3.) psql => CHECKPOINT;

PANIC:  could not open control file "global/pg_control": No such file
or directory

After the PANIC there is no pg_control. Recovery would be difficult
without a replica or a backup. Instead of crashing we can just write a
new pg_control file since all the data is in memory at the time.

There does not really seem to be a need for this behavior as all the
information postgres needs is in memory at this point. I propose with
a patch to just recreate pg_control on updates if it does not exist.

-- 
Brian Faherty

Attachment

0001-Create-pg_control-on-update-if-not-exists.patch

Re: Missing pg_control crashes postmaster

From

"David G. Johnston"

Date:

23 July 2018, 22:49:33

On Mon, Jul 23, 2018 at 12:31 PM, Brian Faherty <anothergenericuser@gmail.com> wrote:

There does not really seem to be a need for this behavior as all the
information postgres needs is in memory at this point. I propose with
a patch to just recreate pg_control on updates if it does not exist.

Or at minimum create said file with a different name in PGDATA so an admin can rename it should they wish to accept the in memory version as being a valid replacement for whatever ended up happening to the original.

Even if it can be safely rebuilt having pg_control removed out from under a running server seems like something that shouldn't happen and the server is in its rights to panic if it does.

David J.

Re: Missing pg_control crashes postmaster

From

Andres Freund

Date:

23 July 2018, 22:55:48


On July 23, 2018 12:31:13 PM PDT, Brian Faherty <anothergenericuser@gmail.com> wrote:
>Hey Hackers,
>
>If a postmaster is running and the pg_control file is removed postgres
>will PANIC.
>
>Steps to recreate:
>
>1.) start a new cluster
>2.) rm $DATADIR/pg_control
>3.) psql => CHECKPOINT;
>
>PANIC:  could not open control file "global/pg_control": No such file
>or directory
>
>After the PANIC there is no pg_control. Recovery would be difficult
>without a replica or a backup. Instead of crashing we can just write a
>new pg_control file since all the data is in memory at the time.
>
>There does not really seem to be a need for this behavior as all the
>information postgres needs is in memory at this point. I propose with
>a patch to just recreate pg_control on updates if it does not exist.

What's the issue this would solve? Given that there's moments, until the control file is rewritten, where you would be
toasteither way, I don't buy this gives much added safety. Nor have you explained which realistic scenarios lead to the
filemissing, without much broader problems being present. 


Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Missing pg_control crashes postmaster

From

Tom Lane

Date:

24 July 2018, 02:00:30

Brian Faherty <anothergenericuser@gmail.com> writes:
> If a postmaster is running and the pg_control file is removed postgres
> will PANIC.

That's very intentional.  Don't do it.

> There does not really seem to be a need for this behavior as all the
> information postgres needs is in memory at this point. I propose with
> a patch to just recreate pg_control on updates if it does not exist.

I would vote to reject any such patch; it's too likely to cause more
problems than it solves.  Generally, if critical files like that one
have disappeared, trying to write new data isn't going to be enough
to fix it and could well result in more corruption.

As an example, imagine that you do "rm -rf $PGDATA; initdb" without
remembering to shut down the old postmaster first.  Currently, the
old postmaster will panic/quit fairly promptly and no harm done.
The more aggressive it is at trying to "recover" from the situation,
the more likely it is to corrupt the new installation.

(Note that you would have to break a few other things in order to
make this particular scenario actually hazardous.  My point is just
that there *are* reasons not to try to recover automatically.)

            regards, tom lane

Re: Missing pg_control crashes postmaster

From

Michael Paquier

Date:

24 July 2018, 04:58:03

On Mon, Jul 23, 2018 at 07:00:30PM -0400, Tom Lane wrote:
> I would vote to reject any such patch; it's too likely to cause more
> problems than it solves.  Generally, if critical files like that one
> have disappeared, trying to write new data isn't going to be enough
> to fix it and could well result in more corruption.

+1.
--
Michael

Attachment

signature.asc

Re: Missing pg_control crashes postmaster

From

Craig Ringer

Date:

24 July 2018, 05:03:49

On 24 July 2018 at 03:31, Brian Faherty <anothergenericuser@gmail.com> wrote:

Hey Hackers,

If a postmaster is running and the pg_control file is removed postgres
will PANIC.

How does that happen?

"Don't go deleting stuff in pgdata" is pretty fundamental.

Re: Missing pg_control crashes postmaster

From

David Steele

Date:

25 July 2018, 17:18:30

On 7/23/18 7:00 PM, Tom Lane wrote:
> Brian Faherty <anothergenericuser@gmail.com> writes:
 >
>> There does not really seem to be a need for this behavior as all the
>> information postgres needs is in memory at this point. I propose with
>> a patch to just recreate pg_control on updates if it does not exist.
> 
> I would vote to reject any such patch; it's too likely to cause more
> problems than it solves.  Generally, if critical files like that one
> have disappeared, trying to write new data isn't going to be enough
> to fix it and could well result in more corruption.
> 
> As an example, imagine that you do "rm -rf $PGDATA; initdb" without
> remembering to shut down the old postmaster first.  Currently, the
> old postmaster will panic/quit fairly promptly and no harm done.
> The more aggressive it is at trying to "recover" from the situation,
> the more likely it is to corrupt the new installation.

It seems much more likely that a missing/modified postmaster.pid will 
cause postgres to panic than it is for a missing pg_control to do so.

Older versions of postgres don't panic until the next checkpoint and 
newer versions won't panic at all on an idle system since we fixed 
redundant checkpoints in 9.6 (6ef2eba3).  An idle postgres 11 cluster 
seems happy enough to run without a pg_control file indefinitely (or at 
least 10 minutes, which is past the default checkpoint time).  As soon 
as I write data or perform a checkpoint it does panic, of course.

Conversely, removing/modifying postmaster.pid causes postgres to panic 
very quickly on the versions I tested, 9.4 and 11.

It seems to me that doing the postmaster.pid test at checkpoint time (if 
we don't already) would be enough to protect pg_control against 
unintentionally replaced clusters.

Or perhaps writing to an alternate file as David J suggests would do the 
trick.

It seems like an easy win if we can find a safe way to do it, though I 
admit that this is only a benefit in corner cases.

Regards,
-- 
-David
david@pgmasters.net

Re: Missing pg_control crashes postmaster

From

Andres Freund

Date:

25 July 2018, 17:37:31


On July 25, 2018 7:18:30 AM PDT, David Steele <david@pgmasters.net> wrote:
>On 7/23/18 7:00 PM, Tom Lane wrote:
>> Brian Faherty <anothergenericuser@gmail.com> writes:
> >
>>> There does not really seem to be a need for this behavior as all the
>>> information postgres needs is in memory at this point. I propose
>with
>>> a patch to just recreate pg_control on updates if it does not exist.
>>
>> I would vote to reject any such patch; it's too likely to cause more
>> problems than it solves.  Generally, if critical files like that one
>> have disappeared, trying to write new data isn't going to be enough
>> to fix it and could well result in more corruption.
>>
>> As an example, imagine that you do "rm -rf $PGDATA; initdb" without
>> remembering to shut down the old postmaster first.  Currently, the
>> old postmaster will panic/quit fairly promptly and no harm done.
>> The more aggressive it is at trying to "recover" from the situation,
>> the more likely it is to corrupt the new installation.
>
>It seems much more likely that a missing/modified postmaster.pid will
>cause postgres to panic than it is for a missing pg_control to do so.
>
>Older versions of postgres don't panic until the next checkpoint and
>newer versions won't panic at all on an idle system since we fixed
>redundant checkpoints in 9.6 (6ef2eba3).  An idle postgres 11 cluster
>seems happy enough to run without a pg_control file indefinitely (or at
>
>least 10 minutes, which is past the default checkpoint time).  As soon
>as I write data or perform a checkpoint it does panic, of course.
>
>Conversely, removing/modifying postmaster.pid causes postgres to panic
>very quickly on the versions I tested, 9.4 and 11.
>
>It seems to me that doing the postmaster.pid test at checkpoint time
>(if
>we don't already) would be enough to protect pg_control against
>unintentionally replaced clusters.
>
>Or perhaps writing to an alternate file as David J suggests would do
>the
>trick.
>
>It seems like an easy win if we can find a safe way to do it, though I
>admit that this is only a benefit in corner cases.

What would we win here? Which scenario that's not contrived would be less bad due to the proposed change.  This seems
complexityfor it's own sake. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Missing pg_control crashes postmaster

From

David Steele

Date:

25 July 2018, 17:52:08

On 7/25/18 10:37 AM, Andres Freund wrote:
> On July 25, 2018 7:18:30 AM PDT, David Steele <david@pgmasters.net> wrote:
>>
>> It seems like an easy win if we can find a safe way to do it, though I
>> admit that this is only a benefit in corner cases.
> 
> What would we win here? Which scenario that's not contrived would be less bad due to the proposed change.  This seems
complexityfor it's own sake.

I think it's worth preserving pg_control even in the case where there is 
other damage to the cluster.  The alternative in this case (if no backup 
exists) is to run pg_resetwal which means data since the last checkpoint 
will not be written out causing even more data loss.  I have run 
clusters with checkpoint_timeout = 60m so data loss in this case is a 
real concern.

I favor the contrived scenario that helps preserve the current cluster 
instead of a hypothetical newly init'd one.  I also don't think that 
users deleting files out of a cluster is all that contrived.

Adding O_CREATE to open() doesn't seem too complex to me.  I'm not 
really in favor of the renaming idea, but I'm not against it either if 
it gets me a copy of the pg_control file.

Regards,
-- 
-David
david@pgmasters.net

Re: Missing pg_control crashes postmaster

From

Andres Freund

Date:

25 July 2018, 18:09:52

Hi,

On 2018-07-25 10:52:08 -0400, David Steele wrote:
> On 7/25/18 10:37 AM, Andres Freund wrote:
> > On July 25, 2018 7:18:30 AM PDT, David Steele <david@pgmasters.net> wrote:
> > > 
> > > It seems like an easy win if we can find a safe way to do it, though I
> > > admit that this is only a benefit in corner cases.
> > 
> > What would we win here? Which scenario that's not contrived would be less bad due to the proposed change.  This
seemscomplexity for it's own sake.
 
> 
> I think it's worth preserving pg_control even in the case where there is
> other damage to the cluster.  The alternative in this case (if no backup
> exists) is to run pg_resetwal which means data since the last checkpoint
> will not be written out causing even more data loss.  I have run clusters
> with checkpoint_timeout = 60m so data loss in this case is a real concern.

Wait, what? How is "data loss in this case is a real concern." - no
even a remotely realistic scenario has been described where this matters
so far.


> I favor the contrived scenario that helps preserve the current cluster
> instead of a hypothetical newly init'd one.  I also don't think that users
> deleting files out of a cluster is all that contrived.

But trying to limp on in that case, and that being helpful, is.


> Adding O_CREATE to open() doesn't seem too complex to me.  I'm not really in
> favor of the renaming idea, but I'm not against it either if it gets me a
> copy of the pg_control file.

The problem is that that'll just hide the issue for a bit longer, while
continuing (due to the O_CREAT we'll not PANIC anymore).  Which can lead
to a lot of followup issues, like checkpoints removing old WAL that'd
have been useful for data recovery.

Greetings,

Andres Freund

Re: Missing pg_control crashes postmaster

From

David Steele

Date:

25 July 2018, 18:19:49

On 7/25/18 11:09 AM, Andres Freund wrote:
> On 2018-07-25 10:52:08 -0400, David Steele wrote:
> 
>> I favor the contrived scenario that helps preserve the current cluster
>> instead of a hypothetical newly init'd one.  I also don't think that users
>> deleting files out of a cluster is all that contrived.
> 
> But trying to limp on in that case, and that being helpful, is.

OK, I can't argue with that.  It would be wrong to continue operating 
without knowing what the damage is.

>> Adding O_CREATE to open() doesn't seem too complex to me.  I'm not really in
>> favor of the renaming idea, but I'm not against it either if it gets me a
>> copy of the pg_control file.
> 
> The problem is that that'll just hide the issue for a bit longer, while
> continuing (due to the O_CREAT we'll not PANIC anymore).  Which can lead
> to a lot of followup issues, like checkpoints removing old WAL that'd
> have been useful for data recovery.

So if a panic is the best thing to do, it might still be good to write 
out a copy of pg_control to another file and let the user know that it's 
there.  More information seems better than less to me.

Regards,
-- 
-David
david@pgmasters.net

Re: Missing pg_control crashes postmaster

From

Tom Lane

Date:

25 July 2018, 18:23:12

David Steele <david@pgmasters.net> writes:
> On 7/25/18 10:37 AM, Andres Freund wrote:
>> What would we win here? Which scenario that's not contrived would be less bad due to the proposed change.  This
seemscomplexity for it's own sake. 

> I favor the contrived scenario that helps preserve the current cluster
> instead of a hypothetical newly init'd one.  I also don't think that
> users deleting files out of a cluster is all that contrived.

What's not contrived about it?  Particularly the case of *only* deleting
pg_control and not any other critical file?  I don't recall having heard
any such stories in the last twenty years.

Also, even if we added the complexity needed to write data into say
pg_control.bak, what then?  You think that that would be any less
prone to indiscriminate rm'ing?  Where and how would we document this,
and what's the odds that someone dumb enough to remove pg_control would
ever have read that part of the documentation?

I'm with Andres: this is a solution in search of a problem.

            regards, tom lane

Re: Missing pg_control crashes postmaster

From

Andres Freund

Date:

25 July 2018, 18:26:14

Hi,

On 2018-07-25 11:19:49 -0400, David Steele wrote:
> On 7/25/18 11:09 AM, Andres Freund wrote:
> > On 2018-07-25 10:52:08 -0400, David Steele wrote:
> > The problem is that that'll just hide the issue for a bit longer, while
> > continuing (due to the O_CREAT we'll not PANIC anymore).  Which can lead
> > to a lot of followup issues, like checkpoints removing old WAL that'd
> > have been useful for data recovery.
> 
> So if a panic is the best thing to do, it might still be good to write out a
> copy of pg_control to another file and let the user know that it's there.
> More information seems better than less to me.

Sure, if that were the only concern. But code complexity and testability
also is. This means we need to maintain code for an edge case that
nobody has come up with a reason for. And at least during development,
we need to come up with test cases. Or we continually run the test case
as part of the regression tests for something without any sort of
practical relevance.

If one wanted to improve recoverability in scenarios like this, there'd
be actually useful things like adding the option to extract control
files, FPIs, clog contents from the WAL with pg_waldump.

Greetings,

Andres Freund

Re: Missing pg_control crashes postmaster

From

Tom Lane

Date:

25 July 2018, 18:33:14

David Steele <david@pgmasters.net> writes:
> On 7/25/18 11:09 AM, Andres Freund wrote:
>> The problem is that that'll just hide the issue for a bit longer, while
>> continuing (due to the O_CREAT we'll not PANIC anymore).  Which can lead
>> to a lot of followup issues, like checkpoints removing old WAL that'd
>> have been useful for data recovery.

> So if a panic is the best thing to do, it might still be good to write 
> out a copy of pg_control to another file and let the user know that it's 
> there.  More information seems better than less to me.

I'm still dubious that this is fixing any real-world problem that is
more pressing than the problems it would create.  If you're asked to
resuscitate a dead cluster, do you trust pg_control.bak if you find
it?  Maybe it's horribly out of date (consider likelihood that someone
removed pg_control more than once, having got away with that the first
time).  If there's both that and pg_control, which do you trust?

            regards, tom lane

Re: Missing pg_control crashes postmaster

From

David Steele

Date:

25 July 2018, 18:33:47

On 7/25/18 11:26 AM, Andres Freund wrote:
> On 2018-07-25 11:19:49 -0400, David Steele wrote:
> 
> If one wanted to improve recoverability in scenarios like this, there'd
> be actually useful things like adding the option to extract control
> files, FPIs, clog contents from the WAL with pg_waldump.

I think you'd need to be a bit careful about what to trust in the WAL 
but presumably the last checkpoint LSN would be safe enough.

At the end of the day good backups are the best protection against an 
issue like this.

Regards,
-- 
-David
david@pgmasters.net

Re: Missing pg_control crashes postmaster

From

David Steele

Date:

25 July 2018, 18:38:25

On 7/25/18 11:33 AM, Tom Lane wrote:
> David Steele <david@pgmasters.net> writes:
>> On 7/25/18 11:09 AM, Andres Freund wrote:
>>> The problem is that that'll just hide the issue for a bit longer, while
>>> continuing (due to the O_CREAT we'll not PANIC anymore).  Which can lead
>>> to a lot of followup issues, like checkpoints removing old WAL that'd
>>> have been useful for data recovery.
> 
>> So if a panic is the best thing to do, it might still be good to write
>> out a copy of pg_control to another file and let the user know that it's
>> there.  More information seems better than less to me.
> 
> I'm still dubious that this is fixing any real-world problem that is
> more pressing than the problems it would create.  If you're asked to
> resuscitate a dead cluster, do you trust pg_control.bak if you find
> it?  Maybe it's horribly out of date (consider likelihood that someone
> removed pg_control more than once, having got away with that the first
> time).  If there's both that and pg_control, which do you trust?

It would need to be a manual operation.  I don't think automating it 
would be a good idea for the reasons that Andres has enumerated.

Perhaps making pg_resetwal a bit smarter in these scenarios would be the 
way to go.  It's already the tool of last resort so this kind of 
manipulation might be a better fit there.

Regards,
-- 
-David
david@pgmasters.net