Thread: BUG #14999: pg_rewind corrupts control file global/pg_control

BUG #14999: pg_rewind corrupts control file global/pg_control

From

PG Bug reporting form

Date:

04 January 2018, 23:06:33

The following bug has been logged on the website:

Bug reference:      14999
Logged by:          Christian H.
Email address:      office@tiptop-labs.com
PostgreSQL version: 10.1
Operating system:   e.g. Debian Buster
Description:

I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

A patch for branch REL_10_STABLE of repository
https://github.com/postgres/postgres, a README, and Dockerfiles for
demonstrating both bug and patch are available from
https://github.com/tiptop-labs/postgres-patches .

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 January 2018, 04:26:37

On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> I have encountered a bug in PostgreSQL 10.1: when the target directory for
> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
> "could not open target file" (legitimate) and corrupts the control file
> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
> "unexpected control file size 0, expected 8192" and a restore from
> pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

> A patch for branch REL_10_STABLE of repository
> https://github.com/postgres/postgres, a README, and Dockerfiles for
> demonstrating both bug and patch are available from
> https://github.com/tiptop-labs/postgres-patches .

You may not know when github.com is gone, which would cause the data
you are attaching here to go away. What looks interesting is
pg_rewind.patch, which is what you are proposing as a fix, right? I am
attaching it here for PG archives.

-               open_target_file(filename, false);
+               /* Trunc target file for action FILE_ACTION_COPY. */
+               open_target_file(filename, chunkoff == 0);

                write_target_range(chunk, chunkoff, chunksize);
Hm. Let me think more about that as there are quite a few
distributions that link to SSL files that cannot be written, and
pg_rewind copies all configuration files in full.

-        "FROM fetchchunks\n";
+        "FROM fetchchunks\n"
+        "ORDER BY path, begin\n";
If this is aimed at improving the performance of pg_rewind by making
chunk writes more sequential, it should be a different patch. I would
see value in that. If this is aimed to change the order of the files
to avoid the write corruption, this is wrong.

I would imagine that you could see something similar with the offline
mode, haven't tested yet though.
-- 
Michael

Attachment

pg_rewind.patch

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

TipTop Labs

Date:

05 January 2018, 06:36:22

> On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> I have encountered a bug in PostgreSQL 10.1: when the target directory for
>> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
>> "could not open target file" (legitimate) and corrupts the control file
>> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
>> "unexpected control file size 0, expected 8192" and a restore from
>> pg_basebackup is needed.
>
> Likely that's reproducible down to 9.5 where pg_rewind has been
> introduced. I agree that we should do better with failure handling
> here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

>> A patch for branch REL_10_STABLE of repository
>> https://github.com/postgres/postgres, a README, and Dockerfiles for
>> demonstrating both bug and patch are available from
>> https://github.com/tiptop-labs/postgres-patches .
>
> You may not know when github.com is gone, which would cause the data
> you are attaching here to go away. What looks interesting is
> pg_rewind.patch, which is what you are proposing as a fix, right? I am
> attaching it here for PG archives.

Yes, it's pg_rewind.patch. There is a bit of rationale in README, feel free to also attach here if/as needed.

>
> -               open_target_file(filename, false);
> +               /* Trunc target file for action FILE_ACTION_COPY. */
> +               open_target_file(filename, chunkoff == 0);
>
>                write_target_range(chunk, chunkoff, chunksize);
> Hm. Let me think more about that as there are quite a few
> distributions that link to SSL files that cannot be written, and
> pg_rewind copies all configuration files in full.

Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.

> -        "FROM fetchchunks\n";
> +        "FROM fetchchunks\n"
> +        "ORDER BY path, begin\n";
> If this is aimed at improving the performance of pg_rewind by making
> chunk writes more sequential, it should be a different patch. I would
> see value in that. If this is aimed to change the order of the files
> to avoid the write corruption, this is wrong.

It may be wrong then; from README: Truncation of the file was moved to the second loop. Truncation occurs there if
chunksare written into files at offset 0. This is the case for FILE_ACTION_COPY. An additional SQL "ORDER BY path,
begin"ensures that these chunks are processed first. 

> I would imagine that you could see something similar with the offline
> mode, haven't tested yet though.
> --
> Michael
> <pg_rewind.patch>

--
Christian

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 January 2018, 06:41:55

On Fri, Jan 5, 2018 at 12:36 PM, TipTop Labs <office@tiptop-labs.com> wrote:
> Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.

I recall that Ubuntu and Debian do that. I don't use them but others
on this list will likely correct me. The point is that this is
authorized.
-- 
Michael

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

15 January 2018, 10:25:37

On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:
>> On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
>> On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>>> I have encountered a bug in PostgreSQL 10.1: when the target directory for
>>> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
>>> "could not open target file" (legitimate) and corrupts the control file
>>> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
>>> "unexpected control file size 0, expected 8192" and a restore from
>>> pg_basebackup is needed.
>>
>> Likely that's reproducible down to 9.5 where pg_rewind has been
>> introduced. I agree that we should do better with failure handling
>> here. Corrupting the control file is not cool.
>
> I can already confirm that this also occurs with PostgreSQL 9.6.

As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.

I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.

Thoughts?
--
Michael

Attachment

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

TipTop Labs

Date:

26 February 2018, 19:28:53

1. I can confirm that your patch is effective also in my Docker-based test setup and with the current REL_10_STABLE code base (i.e. PostgreSQL 10.2).

2. Your patch is more encompassing than the one I had submitted earlier, and it may be the right way to go. It is cleaner but more "complicated" in that it may require enlisting/recognizing all those special files (pg_control, filenode.map, etc). IMO the earlier patch would already/tolerate handle those, because the distinction it makes is not based on whether something is a configuration file, but purely on whether it is writable.

3. Sorry for the late response :)

On Jan 15, 2018, at 8:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:
On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.

Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.

I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.

Thoughts?
--
Michael
<rewind-readonly-fix-v1.patch>

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

27 February 2018, 04:24:41

On Mon, Feb 26, 2018 at 05:28:53PM +0100, TipTop Labs wrote:
> 1. I can confirm that your patch is effective also in my Docker-based
> test setup and with the current REL_10_STABLE code base
> (i.e. PostgreSQL 10.2).

Thanks for checking.  Note that I am still not completely happy with the
handling in errno in some newly-added code paths..

> 2. Your patch is more encompassing than the one I had submitted
> earlier, and it may be the right way to go. It is cleaner but more
> "complicated" in that it may require enlisting/recognizing all those
> special files (pg_control, filenode.map, etc). IMO the earlier patch
> would already/tolerate handle those, because the distinction it makes
> is not based on whether something is a configuration file, but purely
> on whether it is writable.

You are basically looking for that I think:
https://www.postgresql.org/message-id/20180205071022.GA17337%40paquier.xyz
You cannot ignore pg_control and filenode.map though as those are
critical data so they have to be updated.  So if those files are not
writable, you actually have more problems than you think as the cluster
would not be able to correctly start.
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Dmitry Dolgov

Date:

02 March 2018, 18:35:07

> On 27 February 2018 at 02:24, Michael Paquier <michael@paquier.xyz> wrote:
>
> Note that I am still not completely happy with the
> handling in errno in some newly-added code paths..

It maybe a stupid question, but why do you need to reset errno here? Is it to
avoid its value being carried from previous calls of `open_target_file`? In
this case if you put the code with `errno == EACCESS` under the if condition
`if (dstfd < 0)`, then as far as I remember you should always get relevant
errno. At the same time maybe it's valid to reset `errno` before `open`, like
with `getpriority`:

    For  some system calls and library functions (e.g., getpriority(2)), -1 is
    a valid return on success.  In such cases, a successful return can be
    distinguished from an error return by setting errno to zero before the
    call, and then, if the call returns a status that indicates that an error
    may have occurred, checking to see if errno has a nonzero value.

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 March 2018, 10:54:31

On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:
> It maybe a stupid question, but why do you need to reset errno here? Is it to
> avoid its value being carried from previous calls of `open_target_file`? In
> this case if you put the code with `errno == EACCESS` under the if condition
> `if (dstfd < 0)`, then as far as I remember you should always get relevant
> errno. At the same time maybe it's valid to reset `errno` before `open`, like
> with `getpriority`:

[ ... reads and feels stupid ...]

Of course all the checks should be where dstno is negative...

I have done a second pass on the patch, and attached is a new version.
This fixes this handling of errno and addresses some typos.  I have also
fixed the test case where one of the read-only files was not properly
switched to do so.  I have also added a commit log message.
--
Michael

Attachment

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Masahiko Sawada

Date:

05 March 2018, 11:47:06

On Mon, Mar 5, 2018 at 4:54 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote:
>> It maybe a stupid question, but why do you need to reset errno here? Is it to
>> avoid its value being carried from previous calls of `open_target_file`? In
>> this case if you put the code with `errno == EACCESS` under the if condition
>> `if (dstfd < 0)`, then as far as I remember you should always get relevant
>> errno. At the same time maybe it's valid to reset `errno` before `open`, like
>> with `getpriority`:
>
> [ ... reads and feels stupid ...]
>
> Of course all the checks should be where dstno is negative...
>
> I have done a second pass on the patch, and attached is a new version.
> This fixes this handling of errno and addresses some typos.  I have also
> fixed the test case where one of the read-only files was not properly
> switched to do so.  I have also added a commit log message.

@@ -24,7 +24,10 @@
 typedef enum
 {
        FILE_ACTION_CREATE,                     /* create local
directory or symbolic link */
-       FILE_ACTION_COPY,                       /* copy whole file,
overwriting if exists */
+       FILE_ACTION_COPY_CONFIG,        /* copy whole configuration
file, overwriting
+                                                                * if exists */
+       FILE_ACTION_COPY_DATA,          /* copy whole relation file,
overwriting if
+                                                                * exists */
        FILE_ACTION_COPY_TAIL,          /* copy tail from 'oldsize' to
'newsize' */
        FILE_ACTION_NONE,                       /* no action (we might
still copy modified
                                                                 *
blocks based on the parsed WAL) */

Since files other than relation files such as vm, fsm, WAL are
categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
misreading. For example, we can use FILE_ACTION_COPY_DATA and
FILE_ACTION_COPY_OTHER?

As you mentioned before, with this patch we end up ignoring file-open
error of database files other than relation files. I guess it's a bit
risky. We can enter them to COPY_DATA but I'd defer it to committer.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 March 2018, 12:01:11

On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
> Since files other than relation files such as vm, fsm, WAL are
> categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
> misreading. For example, we can use FILE_ACTION_COPY_DATA and
> FILE_ACTION_COPY_OTHER?

I am not stopped on a given name.

> As you mentioned before, with this patch we end up ignoring file-open
> error of database files other than relation files. I guess it's a bit
> risky. We can enter them to COPY_DATA but I'd defer it to committer.

COPY_DATA is aimed at being used on files which can be parsed by
blocks.  You cannot do that with VMs and FSMs.  Now you are true that we
could complain for non-configuration files which need to be
fully-copied, still Postgres issues a fsync on all the files of the data
folder when beginning recovery, so you would bump on problems anyway
after restarting the rewound instance.
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Masahiko Sawada

Date:

06 March 2018, 04:39:45

On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
>> Since files other than relation files such as vm, fsm, WAL are
>> categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
>> misreading. For example, we can use FILE_ACTION_COPY_DATA and
>> FILE_ACTION_COPY_OTHER?
>
> I am not stopped on a given name.

Hmm, when I used pg_rewind --debug with this patch the debug message
made me confused because almost database files appears with
COPY_CONFIG. Also looking at the source code comment, it says
COPY_CONFIG is aimed to configuration files. But these file are not
configuration files actually. COPY_DATA makes sense to me, but
COPY_CONFIG doesn't.

Anyway, other than that the patch looks good to me. I'd like to wait
for the Dmitory's review comment before marking this as "Ready for
Commiter".

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Dmitry Dolgov

Date:

06 March 2018, 23:37:34

> On 6 March 2018 at 02:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote:
>> On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote:
>>> Since files other than relation files such as vm, fsm, WAL are
>>> categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause
>>> misreading. For example, we can use FILE_ACTION_COPY_DATA and
>>> FILE_ACTION_COPY_OTHER?
>>
>> I am not stopped on a given name.
>
> Hmm, when I used pg_rewind --debug with this patch the debug message
> made me confused because almost database files appears with
> COPY_CONFIG. Also looking at the source code comment, it says
> COPY_CONFIG is aimed to configuration files. But these file are not
> configuration files actually. COPY_DATA makes sense to me, but
> COPY_CONFIG doesn't.
>
> Anyway, other than that the patch looks good to me. I'd like to wait
> for the Dmitory's review comment before marking this as "Ready for
> Commiter".

Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
one thing - does it make sense to think about other error codes here, not only
about `EACCESS`? E.g. if a file was removed during the process (so, it should
be `ENOENT`), or something more exotic happened, like there are too many
symbolic links were encountered in resolving a pathname (`ELOOP`)?

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

07 March 2018, 04:46:24

On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:
> Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
> one thing - does it make sense to think about other error codes here, not only
> about `EACCESS`? E.g. if a file was removed during the process (so, it should
> be `ENOENT`), or something more exotic happened, like there are too many
> symbolic links were encountered in resolving a pathname (`ELOOP`)?

The presence of the file is ensured in the pre-phase which builds the
file map (see process_source_file), and actions are taken depending on
the presence of a file on the source and the target.  So a file missing
on the target after those pre-checks have ensured that it was actually
existing should be reported with ENOENT.  ELOOP would as well be faced
on the backend before seeing it in pg_rewind, no?  In short, it seems to
me that it is better to keep the code simple.
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Dmitry Dolgov

Date:

07 March 2018, 13:49:04

> On 7 March 2018 at 02:46, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote:
>> Thank you for waiting. Yes, it also looks good for me, but I'm wondering about
>> one thing - does it make sense to think about other error codes here, not only
>> about `EACCESS`? E.g. if a file was removed during the process (so, it should
>> be `ENOENT`), or something more exotic happened, like there are too many
>> symbolic links were encountered in resolving a pathname (`ELOOP`)?
>
> The presence of the file is ensured in the pre-phase which builds the
> file map (see process_source_file), and actions are taken depending on
> the presence of a file on the source and the target.  So a file missing
> on the target after those pre-checks have ensured that it was actually
> existing should be reported with ENOENT.  ELOOP would as well be faced
> on the backend before seeing it in pg_rewind, no?  In short, it seems to
> me that it is better to keep the code simple.

Ok, I agree. Then yes, this patch can be probably marked as ready.

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Tom Lane

Date:

03 April 2018, 02:27:55

Dmitry Dolgov <9erthalion6@gmail.com> writes:
> Ok, I agree. Then yes, this patch can be probably marked as ready.

I started to look over this patch, and soon decided that the
commentary in pg_rewind is impossibly bad :-(. You need to study
libpq_executeFileMap for a long time before you are even sure which side
of the transfer it's executing on; the fact that it's doing a copy *TO*
the remote server is incredibly misleading, and the comments are not
anywhere near good enough to de-confuse somebody coming upon the code for
the first time. Not to mention that the function's name conveys nothing
whatever. Maybe it's not the job of this patch to fix that, but I can't
help thinking that poor design and documentation are a big part of why
this bug exists in the first place.

But enough of that rant. Now that I've more or less wrapped my head
around what's happening, I think this is the core of the problem:
libpq_executeFileMap truncates every non-data file in the local data
directory before it's fetched anything at all from the remote server.
This seems to me to be utterly cavalier and brittle, because ANY FAILURE
AT ALL in the fetch sequence leaves you with a data directory that's been
trashed in whole or in part --- and that sequence could span a good long
time, if there's a lot to copy. The originally complained-of problem
(that pg_control gets nuked) is just the tip of that iceberg, and I'm
afraid that the proposed patch is only removing a single potential
failure mode.

I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.

Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.

regards, tom lane

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

03 April 2018, 05:41:10

On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:
> I think that a saner design would involve not truncating any particular
> target file until we first get some data to write to it.  Christian's
> original patch shown at
> https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com
> seems to be headed in that direction, in that it forces ordering of the
> data fetch (which is likely a good idea anyway to ensure locality of
> reference) and then performs the truncation when we get the first chunk
> of data for a file.

Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.

> Michael's proposal to have separate error handling for data and config
> files is not bad in itself, and it does improve the behavior for one
> foreseeable trouble case.  But I think we need the other thing as well,
> or some variant of it, before we can regard pg_rewind as anything except
> a chainsaw with the safety catch missing.

I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base.  However, it seems to me that both concepts
should be included to get a complete patch.  In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file.  This fixes the case of
read-only files without being user-intrusive.

In the case of 3-2) having a warning instead of a failure is subject to
discussion though.  Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.

At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!

So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.

Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well.  This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.

The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.

Thoughts?
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

TipTop Labs

Date:

03 April 2018, 23:13:05

On Apr 3, 2018, at 4:41 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:
I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.

Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.

IMO this is a better characterization of my original patch (from https://github.com/tiptop-labs/postgres-patches/blob/master/bug_14999/README):

bug
===

libpq_fetch.c loops twice over files in pgdata, a first time in
libpq_executeFileMap(), and then a second time (for files with action
FILE_ACTION_COPY or FILE_ACTION_COPY_TAIL) in receiveFileChunks().
For FILE_ACTION_COPY, it deletes (truncates) the file from the target directory
already during the first loop. If pg_rewind then encounters a read-only file
(e.g. server.key) still during the first loop, it exits with pg_fatal
("could not open target file"). After such truncation of global/pg_control
pg_rewind cannot run again ("unexpected control file size 0, , expected 8192")
and a restore from pg_basebackup is needed.

patch
=====

Truncation of the file was moved to the second loop. Truncation occurs there if
chunks are written into files at offset 0. This is the case for
FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these
chunks are processed first.

Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.

I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.

In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.

At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!

So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.

Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.

The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.

Thoughts?
--
Michael

Regards,

Christian

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

04 April 2018, 10:26:25

On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote:
> Truncation of the file was moved to the second loop. Truncation occurs
> there if chunks are written into files at offset 0. This is the case
> for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures
> that these chunks are processed first.

Yes.  I have spent a large portion of my day looking at that.  And
actually this is wrong as well.  First, please find attached a patch I
have written while testing your changes.  There is nothing much new into
it, except that I added more comments and a test (other tests fail as
well, that does not matter).

First, I have looked at a way to not rely on the tweak with chunkoff by
extending the temporary table used to register the chunks with an extra
column which tracks the type of action which is taken on the file.
Another one I looked at was to pass down the action type taken on the
entry directly to receiveFileChunks().  However both of them have been
grotty.

So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind.  The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files.  Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source.  So the data folder would be in an
inconsistent state if trying to rewind it again.

If the control file from the source has been received and then a
read-only error is found, then in order to perform a subsequent rewind
you need to start and stop the target cluster cleanly, or update
manually the control file and mark it as cleanly stopped.  And this is a
recipe for data corruption as we now start a rewind based on the
guarantee that a target cluster has been able to cleanly stop, with a
shutdown checkpoint issued.  You could always try to start and stop the
cluster cleanly, but if your target instance replays WAL on data blocks
which have been truncated by a previous rewind, you would need to
re-create an instance using a new base backup, which is actually harder
to diagnose.

The patch I sent prevously which makes the difference between relation
files and "configuration" files helps a bit, still it makes me uneasy
because it will not be able to handle correctly failures on files which
are part of a relation but need to be fully copied.  So I think that we
should drop this approach as well for its complexity.

Another thing that could definitely help is to work on a patch which
checks that *all* files are writable before doing any operations for
files which are present on both the source and the target, and make the
rewind nicely fail with the source still intact.  That's quite a bit of
refactoring ahead for little benefit I think in the long run.

What we could simply do is to document the limitation.  Hence, if both
the target and the source have read-only files, then the user needs to
be sure to remove them from the target before doing the rewind, and do
the re-linking after the operation.  If an error is found while
processing the rewind, then a new base backup needs to be taken.

The refactoring I am mentioning two paragraphs above could definitely be
done to ease user's life, but I would treat that as a new feature and
not a bug.  Another approach, way more simple that we could do is to
scan the data folder of the target before doing anything and see if some
files are not writable, and then report this list back to the user.  If
we do that even for files which are on the target but not the source
then that would be a loss for some cases, however I would like to
imagine that when using pg_rewind the configuration file set is
consistent across nodes in a majority of cases, so a user could remove
those non-writable files manually.  This has also the advantage to be
non-intrusive, so a back-patch is definitely possible.

Thoughts?
--
Michael

Attachment

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Tom Lane

Date:

04 April 2018, 21:50:12

Michael Paquier <michael@paquier.xyz> writes:
> So after that I falled back to your patch and began testing it, which is
> where I noticed that we can *never* give the insurance to recover a data
> folder on which an error has happened in the middle of a pg_rewind.  The
> reason for that is quite simple: even if the truncation has been moved
> down to the moment where the first chunk of a file is received, you may
> have already done work on some relation files.  Particularly, some of
> them may have been truncated down to a given size without a new range of
> blocks fetched from the source.  So the data folder would be in an
> inconsistent state if trying to rewind it again.

Yes, we certainly cannot guarantee that failure partway through pg_rewind
leaves a consistent state of the target data directory.  It is likely
worth pointing that out in the documentation.  Whether we can or should
do anything about it is a different question.

When I first started looking at this thread, I wondered if maybe somebody
had had in mind to create an active defense against starting a postmaster
in an inconsistent target cluster, by dint of intentionally truncating
pg_control before the transfer starts and not making it valid again till
the very end.  It's now clear from looking at the code that that's not
what's going on :-(.  But I wonder how hard it would be to make it so,
and whether that'd be worth doing if it's not too hard.

Actually, probably a safer way to attack that would be to remove or
rename the topmost PG_VERSION file, and then put it back afterwards.
That'd be far easier to recover from manually, if need be, than
clobbering pg_control.

In any case, that seems separate from the question of what to do with
read-only files in the data directory.  Should we push forward with
committing Michael's previous patch, and leave that issue for later?

            regards, tom lane

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 April 2018, 04:38:18

On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
> When I first started looking at this thread, I wondered if maybe somebody
> had had in mind to create an active defense against starting a postmaster
> in an inconsistent target cluster, by dint of intentionally truncating
> pg_control before the transfer starts and not making it valid again till
> the very end.  It's now clear from looking at the code that that's not
> what's going on :-(.  But I wonder how hard it would be to make it so,
> and whether that'd be worth doing if it's not too hard.

One safeguard I can see is to add an extra loop just before processing
the file map to check if a file planned for copy can be sanely opened or
not.  That's simple to do with open_target_file(), followed by
close_target_file().  If there is an EACCES error, then we complain at
this early stage, and the data folder can be reused after cleaning up
the data folder from those files.  That's not actually hard to do at
quick glance, but I may be missing something so let's be careful.

Another approach that we could consider, HEAD-only though, is to extend
pg_stat_file so as an entry's st_mode is available and we could use that
to not copy the file's data from the start.  That would be actually
quite clean, particularly for large read-only files.

> Actually, probably a safer way to attack that would be to remove or
> rename the topmost PG_VERSION file, and then put it back afterwards.
> That'd be far easier to recover from manually, if need be, than
> clobbering pg_control.
>
> In any case, that seems separate from the question of what to do with
> read-only files in the data directory.  Should we push forward with
> committing Michael's previous patch, and leave that issue for later?

That one is not good either as long as we don't make the difference in
the data folder between three types of files:
1) Relation files.
2) System files which are critical at diverse degrees for the system.
3) Custom configuration files.

pg_rewind is able to consider 1) as a separate category as it usually
needs to fetch a range set of blocks, and puts 2) and 3) in the same
bucket.  It could be possible to split 2) and 3) but the maintenance
cost is high as for each new system file added in Postgres we would need
to maintain an extra filtering logic in pg_rewind, something which is
most likely going to rot, leading to potential corruption problems.

My point is that I have seen on some VMware test beds a kernel going
rogue because of ESX servers, causing a PostgreSQL-related partition to
go in read-only mode.  So if you use my previous patch, and that the
partition where the target server's data folder is located turns
all-of-a-sudden to read-only, there is a risk of silencing real
failures.  Even if the fetched data is ordered, paths like pg_xact and
pg_twophase are processed after all relation files, so failing to write
them correctly would silently break a set of transactions :(

So please let me suggest a set of two patches:
1) 0001 is a documentation patch which should be back-patched.  This
tells two things:
1-1) If pg_rewind fails in the middle of processing, then the best thing
to do is to re-create the data folder using a new fresh backup.
1-2) Be careful of files that the process running pg_rewind is not able
to write to.  If those are located on both the target and the source,
then pg_rewind will copy it from the source to the target, leading to an
immediate failure.  After removing those unwritable files, be sure to
clean up anything which has been copied from the source and should be
read-only, then rebuild the links.
2) 0002 is a patch for HEAD to order the set of chunks fetched, because
this makes the data writes sequentials which is good for performance and
limits the number of open()/close() on the files of the target's data
folder.

I think that this is the best answer we can give now as there are a
couple of ways to improving the handling of any checks, however there
are side effects to any of them.  This also feels a bit like a new
feature, while the documentation in pg_rewind sucks now for such
details.  I have also spent some time detailing the commit message of
the first patch about all that, that would be nice to keep in the git
history.

If the extra checks I am mentioning at the top of this email are worth
adding, then it is possible to drop 1-2) partially, rewriting it to
mention that pg_rewind tells the user at early stages about the failure
and that a data folder can be again reused.  This does not change the
fact that a pg_rewind breaking mid-flight should result in a new base
backup, so documentation is mandatory anyway.  And there are a couple of
approaches that we could consider here.  At least I saw two of them.
Other folks may see better approaches than I did, who knows..

Thanks,
--
Michael

Attachment

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

TipTop Labs

Date:

05 April 2018, 08:04:40

> On Apr 5, 2018, at 3:38 AM, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
>> When I first started looking at this thread, I wondered if maybe somebody
>> had had in mind to create an active defense against starting a postmaster
>> in an inconsistent target cluster, by dint of intentionally truncating
>> pg_control before the transfer starts and not making it valid again till
>> the very end.  It's now clear from looking at the code that that's not
>> what's going on :-(.  But I wonder how hard it would be to make it so,
>> and whether that'd be worth doing if it's not too hard.
>
> One safeguard I can see is to add an extra loop just before processing
> the file map to check if a file planned for copy can be sanely opened or
> not.  That's simple to do with open_target_file(), followed by
> close_target_file().  If there is an EACCES error, then we complain at
> this early stage, and the data folder can be reused after cleaning up
> the data folder from those files.  That's not actually hard to do at
> quick glance, but I may be missing something so let's be careful.
>
> Another approach that we could consider, HEAD-only though, is to extend
> pg_stat_file so as an entry's st_mode is available and we could use that
> to not copy the file's data from the start.  That would be actually
> quite clean, particularly for large read-only files.
>
>> Actually, probably a safer way to attack that would be to remove or
>> rename the topmost PG_VERSION file, and then put it back afterwards.
>> That'd be far easier to recover from manually, if need be, than
>> clobbering pg_control.
>>
>> In any case, that seems separate from the question of what to do with
>> read-only files in the data directory.  Should we push forward with
>> committing Michael's previous patch, and leave that issue for later?
>
> That one is not good either as long as we don't make the difference in
> the data folder between three types of files:
> 1) Relation files.
> 2) System files which are critical at diverse degrees for the system.
> 3) Custom configuration files.
>
> pg_rewind is able to consider 1) as a separate category as it usually
> needs to fetch a range set of blocks, and puts 2) and 3) in the same
> bucket.  It could be possible to split 2) and 3) but the maintenance
> cost is high as for each new system file added in Postgres we would need
> to maintain an extra filtering logic in pg_rewind, something which is
> most likely going to rot, leading to potential corruption problems.
>
> My point is that I have seen on some VMware test beds a kernel going
> rogue because of ESX servers, causing a PostgreSQL-related partition to
> go in read-only mode.  So if you use my previous patch, and that the
> partition where the target server's data folder is located turns
> all-of-a-sudden to read-only, there is a risk of silencing real
> failures.  Even if the fetched data is ordered, paths like pg_xact and
> pg_twophase are processed after all relation files, so failing to write
> them correctly would silently break a set of transactions :(
>
> So please let me suggest a set of two patches:
> 1) 0001 is a documentation patch which should be back-patched.  This
> tells two things:
> 1-1) If pg_rewind fails in the middle of processing, then the best thing
> to do is to re-create the data folder using a new fresh backup.
> 1-2) Be careful of files that the process running pg_rewind is not able
> to write to.  If those are located on both the target and the source,
> then pg_rewind will copy it from the source to the target, leading to an
> immediate failure.  After removing those unwritable files, be sure to
> clean up anything which has been copied from the source and should be
> read-only, then rebuild the links.
> 2) 0002 is a patch for HEAD to order the set of chunks fetched, because
> this makes the data writes sequentials which is good for performance and
> limits the number of open()/close() on the files of the target's data
> folder.

Thanks for crediting me in patch 0002.

One final word about my original patch, since the commit message for 0001 refers to its limitations: I acknowledge that
itcannot cover situations where the order of processing files is such that the second loop detects non-writable files
onlyafter it had already done work on some others. It was meant to fix the specific problem I ran into without breaking
theregression tests that ship with the source code ("make check"). The main reason that it is not more elaborate is
thatI did not want to submit anything affecting more than a few lines of code without prior input from the community. 

So from my perspective I welcome both 0001 and 0002.

> I think that this is the best answer we can give now as there are a
> couple of ways to improving the handling of any checks, however there
> are side effects to any of them.  This also feels a bit like a new
> feature, while the documentation in pg_rewind sucks now for such
> details.  I have also spent some time detailing the commit message of
> the first patch about all that, that would be nice to keep in the git
> history.
>
> If the extra checks I am mentioning at the top of this email are worth
> adding, then it is possible to drop 1-2) partially, rewriting it to
> mention that pg_rewind tells the user at early stages about the failure
> and that a data folder can be again reused.  This does not change the
> fact that a pg_rewind breaking mid-flight should result in a new base
> backup, so documentation is mandatory anyway.  And there are a couple of
> approaches that we could consider here.  At least I saw two of them.
> Other folks may see better approaches than I did, who knows..
> Thanks,
> --
> Michael
>
<0001-Add-note-in-pg_rewind-documentation-about-read-only-.patch><0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patch>

--
Christian

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

05 April 2018, 09:58:40

On Thu, Apr 05, 2018 at 07:04:40AM +0200, TipTop Labs wrote:
> One final word about my original patch, since the commit message for
> 0001 refers to its limitations: I acknowledge that it cannot cover
> situations where the order of processing files is such that the second
> loop detects non-writable files only after it had already done work on
> some others. It was meant to fix the specific problem I ran into
> without breaking the regression tests that ship with the source code
> ("make check"). The main reason that it is not more elaborate is that
> I did not want to submit anything affecting more than a few lines of
> code without prior input from the community.

That definitely makes sense to me.  Reviews and discussions can take a
long time to settle so this method looks right to me.  It is better not
spend too much time on a patch if you unsure of its shape, and much
better to ask for input first before diving more into the details.  This
stuff is really complicated and the devil is in the details.

> So from my perspective I welcome both 0001 and 0002.

OK, thanks for confirming!
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Tom Lane

Date:

07 April 2018, 18:51:32

Thinking about this some more, I wondered if useful behavior with
read-only config files would be to check whether they have the same
contents on both servers, and give a warning or error if not.

Now, that would only be useful if you suppose that they actually should be
the same on both.  For the particular case of server private keys, maybe
they typically would be different.  So I'm not real sure about this idea,
but wanted to toss it out for discussion.

In any case, after another look at the code, it seems to me that it'd be
pretty easy to get pg_rewind to notice at the start whether a target file
it plans to overwrite is read-only: process_source_file is already doing
an lstat per file, so I think a permissions check there wouldn't add much
overhead.  So at that point we could either bail out without damage, or
decide to skip the file.  We could still lose if permissions change midway
through the run, but that's the sort of situation I think is OK to fail
in.

Also, I'm having second thoughts about the usefulness of adding an ORDER
BY to the fetch query just on (untested) performance grounds.  It looks
to me like the chunks within a file already will be inserted into the
fetchchunks table in order, and I think within-file order is all that's
worth worrying about.  In principle the remote server might read out the
rows in some other order than we stored them, but since fetchchunks is a
temp table and not subject to synchronize_seqscans, I don't think we
really need to worry about that.

Furthermore, the patch as given has got a serious performance gotcha:

     sql =
         "SELECT path, begin,\n"
         "  pg_read_binary_file(path, begin, len, true) AS chunk\n"
-        "FROM fetchchunks\n";
+        "FROM fetchchunks\n"
+        "ORDER BY path, begin\n";

While 9.6 and later will do this in a sane way, older versions will
execute pg_read_binary_file() ahead of the sort step, like so:

# explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM fetchchunks ORDER BY
path,begin; 
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Sort  (cost=74639.34..75889.41 rows=500025 width=44)
   Output: path, begin, (pg_read_binary_file(path, begin, (len)::bigint, true))
   Sort Key: fetchchunks.path, fetchchunks.begin
   ->  Seq Scan on pg_temp_2.fetchchunks  (cost=0.00..11925.38 rows=500025 width=44)
         Output: path, begin, pg_read_binary_file(path, begin, (len)::bigint, true)
(5 rows)

with the effect that we're passing the entire contents of the source data
directory (or at least, as much of it as pg_rewind needs to read) through
the sort.  Yipes.  So it'd be safer to do it like this:

# explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM (select * from
fetchchunksORDER BY path, begin) ss; 
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Subquery Scan on ss  (cost=72139.22..80889.66 rows=500025 width=44)
   Output: ss.path, ss.begin, pg_read_binary_file(ss.path, ss.begin, (ss.len)::bigint, true)
   ->  Sort  (cost=72139.22..73389.28 rows=500025 width=44)
         Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len
         Sort Key: fetchchunks.path, fetchchunks.begin
         ->  Seq Scan on pg_temp_2.fetchchunks  (cost=0.00..9425.25 rows=500025 width=44)
               Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len
(7 rows)

But the real bottom line here is I don't feel a need to touch the fetch
query at all without some actual performance measurements proving there's
a benefit.

            regards, tom lane

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

08 April 2018, 01:22:58

On Sat, Apr 07, 2018 at 11:51:32AM -0400, Tom Lane wrote:
> In any case, after another look at the code, it seems to me that it'd be
> pretty easy to get pg_rewind to notice at the start whether a target file
> it plans to overwrite is read-only: process_source_file is already doing
> an lstat per file, so I think a permissions check there wouldn't add much
> overhead.  So at that point we could either bail out without damage, or
> decide to skip the file.

Yes, that's one of the methods I mentioned in my previous email.  Now,
do we want to fail the run immediately if such a file is found?  Or do
we want to report at once all such files so as the user does not need to
run multiple times pg_rewind.  The latter would be much more useful.

A downside here is that the file from the source would still be fetched
and copied on the target.  So a file which was in read-only would become
basically writable.  Don't think that it is a big deal as the has
superuser rights at this point, but that's worth mentioning.

> We could still lose if permissions change midway
> through the run, but that's the sort of situation I think is OK to fail
> in.

If the file system switches to read-only in the middle of the run, I see
no way to recover from that.  That would be bad luck, but it is way
better to fail hard and let the user that something has gone wrong.  I
still think that we want to document that if pg_rewind fails hard
midflight, then the only thing safe enough moving forward is to use a
new base backup.
--
Michael

Attachment

signature.asc

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

09 April 2018, 07:58:26

On Sun, Apr 08, 2018 at 07:22:58AM +0900, Michael Paquier wrote:
> Yes, that's one of the methods I mentioned in my previous email.  Now,
> do we want to fail the run immediately if such a file is found?  Or do
> we want to report at once all such files so as the user does not need to
> run multiple times pg_rewind.

So I have been playing with that part...

> The latter would be much more useful.

While that would be nice, the result is rather ugly and this needs to
generate the same error message in two code paths.  So I have chewed
things in such a way that one failure found makes the whole rewind to
just fail, and allows retries to work.  The attached has a test as well
as documentation.

Thoughts?

> A downside here is that the file from the source would still be fetched
> and copied on the target.  So a file which was in read-only would become
> basically writable.  Don't think that it is a big deal as the has
> superuser rights at this point, but that's worth mentioning.

That's mentioned in the docs of the attached.
--
Michael

Attachment

Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From

Michael Paquier

Date:

19 June 2018, 10:09:56

On Mon, Apr 09, 2018 at 01:58:26PM +0900, Michael Paquier wrote:
> That's mentioned in the docs of the attached.

So, I have read again this thread, and it seems to me that there are
many approaches, but all of them have downsides.  Hence at the end I
would suggest to document the limitation with for example the attached,
and call it a day.

Thoughts?
--
Michael