Thread: BUG #14999: pg_rewind corrupts control file global/pg_control
The following bug has been logged on the website: Bug reference: 14999 Logged by: Christian H. Email address: office@tiptop-labs.com PostgreSQL version: 10.1 Operating system: e.g. Debian Buster Description: I have encountered a bug in PostgreSQL 10.1: when the target directory for pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with "could not open target file" (legitimate) and corrupts the control file global/pg_control to size 0 (bug). From now on, pg_rewind always exits with "unexpected control file size 0, expected 8192" and a restore from pg_basebackup is needed. A patch for branch REL_10_STABLE of repository https://github.com/postgres/postgres, a README, and Dockerfiles for demonstrating both bug and patch are available from https://github.com/tiptop-labs/postgres-patches .
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form <noreply@postgresql.org> wrote: > I have encountered a bug in PostgreSQL 10.1: when the target directory for > pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with > "could not open target file" (legitimate) and corrupts the control file > global/pg_control to size 0 (bug). From now on, pg_rewind always exits with > "unexpected control file size 0, expected 8192" and a restore from > pg_basebackup is needed. Likely that's reproducible down to 9.5 where pg_rewind has been introduced. I agree that we should do better with failure handling here. Corrupting the control file is not cool. > A patch for branch REL_10_STABLE of repository > https://github.com/postgres/postgres, a README, and Dockerfiles for > demonstrating both bug and patch are available from > https://github.com/tiptop-labs/postgres-patches . You may not know when github.com is gone, which would cause the data you are attaching here to go away. What looks interesting is pg_rewind.patch, which is what you are proposing as a fix, right? I am attaching it here for PG archives. - open_target_file(filename, false); + /* Trunc target file for action FILE_ACTION_COPY. */ + open_target_file(filename, chunkoff == 0); write_target_range(chunk, chunkoff, chunksize); Hm. Let me think more about that as there are quite a few distributions that link to SSL files that cannot be written, and pg_rewind copies all configuration files in full. - "FROM fetchchunks\n"; + "FROM fetchchunks\n" + "ORDER BY path, begin\n"; If this is aimed at improving the performance of pg_rewind by making chunk writes more sequential, it should be a different patch. I would see value in that. If this is aimed to change the order of the files to avoid the write corruption, this is wrong. I would imagine that you could see something similar with the offline mode, haven't tested yet though. -- Michael
Attachment
> On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > > On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form > <noreply@postgresql.org> wrote: >> I have encountered a bug in PostgreSQL 10.1: when the target directory for >> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with >> "could not open target file" (legitimate) and corrupts the control file >> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with >> "unexpected control file size 0, expected 8192" and a restore from >> pg_basebackup is needed. > > Likely that's reproducible down to 9.5 where pg_rewind has been > introduced. I agree that we should do better with failure handling > here. Corrupting the control file is not cool. I can already confirm that this also occurs with PostgreSQL 9.6. >> A patch for branch REL_10_STABLE of repository >> https://github.com/postgres/postgres, a README, and Dockerfiles for >> demonstrating both bug and patch are available from >> https://github.com/tiptop-labs/postgres-patches . > > You may not know when github.com is gone, which would cause the data > you are attaching here to go away. What looks interesting is > pg_rewind.patch, which is what you are proposing as a fix, right? I am > attaching it here for PG archives. Yes, it's pg_rewind.patch. There is a bit of rationale in README, feel free to also attach here if/as needed. > > - open_target_file(filename, false); > + /* Trunc target file for action FILE_ACTION_COPY. */ > + open_target_file(filename, chunkoff == 0); > > write_target_range(chunk, chunkoff, chunksize); > Hm. Let me think more about that as there are quite a few > distributions that link to SSL files that cannot be written, and > pg_rewind copies all configuration files in full. Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks. > - "FROM fetchchunks\n"; > + "FROM fetchchunks\n" > + "ORDER BY path, begin\n"; > If this is aimed at improving the performance of pg_rewind by making > chunk writes more sequential, it should be a different patch. I would > see value in that. If this is aimed to change the order of the files > to avoid the write corruption, this is wrong. It may be wrong then; from README: Truncation of the file was moved to the second loop. Truncation occurs there if chunksare written into files at offset 0. This is the case for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin"ensures that these chunks are processed first. > I would imagine that you could see something similar with the offline > mode, haven't tested yet though. > -- > Michael > <pg_rewind.patch> -- Christian
On Fri, Jan 5, 2018 at 12:36 PM, TipTop Labs <office@tiptop-labs.com> wrote: > Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks. I recall that Ubuntu and Debian do that. I don't use them but others on this list will likely correct me. The point is that this is authorized. -- Michael
On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote: >> On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form >> <noreply@postgresql.org> wrote: >>> I have encountered a bug in PostgreSQL 10.1: when the target directory for >>> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with >>> "could not open target file" (legitimate) and corrupts the control file >>> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with >>> "unexpected control file size 0, expected 8192" and a restore from >>> pg_basebackup is needed. >> >> Likely that's reproducible down to 9.5 where pg_rewind has been >> introduced. I agree that we should do better with failure handling >> here. Corrupting the control file is not cool. > > I can already confirm that this also occurs with PostgreSQL 9.6. As far as I can see, this happens because of the way the 'remote' mode of pg_rewind considers a set of files to truncate or not. Attached is a patch which does things in a more correct way. The key point here is to make the difference between a relation file and a configuration file when building the filemap for copying a file in full. When a configuration file is not readable, then trying to open it should not be a failure. When using a relation file, there should be failures. There are still two things I am not completely happy about: - pg_control is considered as a configuration file with this patch, however we should fail it pg_control is not readable. In practice I guess that this should not happen, and the patch produces a warning message. I think that we should consider add a special handling for things like pg_control, filenode.map. etc. so as you get a hard failure if those are not readable, so they should enter in the category of FILE_ACTION_COPY_DATA. This needs a bit more thoughts in process_source_file(). - open_target_file() resets manually errno. This is necessary as this gets called continuously when processing the same file in remote mode. I tried as well to make open_target_file and close_target_file one-time operations for each file, but the result was even more ugly, so I went up with this solution. I am adding that to the next CF to not forget about it. This approach is back-patchable down to 9.5 in my opinion. I have added as well a TAP test in the patch which is able to reproduce the problem. Thoughts? -- Michael
Attachment
1. I can confirm that your patch is effective also in my Docker-based test setup and with the current REL_10_STABLE code base (i.e. PostgreSQL 10.2).
2. Your patch is more encompassing than the one I had submitted earlier, and it may be the right way to go. It is cleaner but more "complicated" in that it may require enlisting/recognizing all those special files (pg_control, filenode.map, etc). IMO the earlier patch would already/tolerate handle those, because the distinction it makes is not based on whether something is a configuration file, but purely on whether it is writable.
3. Sorry for the late response :)
2. Your patch is more encompassing than the one I had submitted earlier, and it may be the right way to go. It is cleaner but more "complicated" in that it may require enlisting/recognizing all those special files (pg_control, filenode.map, etc). IMO the earlier patch would already/tolerate handle those, because the distinction it makes is not based on whether something is a configuration file, but purely on whether it is writable.
3. Sorry for the late response :)
On Jan 15, 2018, at 8:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote:On Fri, Jan 05, 2018 at 04:36:22AM +0100, TipTop Labs wrote:On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:I have encountered a bug in PostgreSQL 10.1: when the target directory for
pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
"could not open target file" (legitimate) and corrupts the control file
global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
"unexpected control file size 0, expected 8192" and a restore from
pg_basebackup is needed.
Likely that's reproducible down to 9.5 where pg_rewind has been
introduced. I agree that we should do better with failure handling
here. Corrupting the control file is not cool.
I can already confirm that this also occurs with PostgreSQL 9.6.
As far as I can see, this happens because of the way the 'remote' mode
of pg_rewind considers a set of files to truncate or not. Attached is a
patch which does things in a more correct way. The key point here is to
make the difference between a relation file and a configuration file
when building the filemap for copying a file in full. When a
configuration file is not readable, then trying to open it should not be
a failure. When using a relation file, there should be failures. There
are still two things I am not completely happy about:
- pg_control is considered as a configuration file with this patch,
however we should fail it pg_control is not readable. In practice I
guess that this should not happen, and the patch produces a warning
message. I think that we should consider add a special handling for
things like pg_control, filenode.map. etc. so as you get a hard failure
if those are not readable, so they should enter in the category of
FILE_ACTION_COPY_DATA. This needs a bit more thoughts in
process_source_file().
- open_target_file() resets manually errno. This is necessary as this
gets called continuously when processing the same file in remote mode. I
tried as well to make open_target_file and close_target_file one-time
operations for each file, but the result was even more ugly, so I went
up with this solution.
I am adding that to the next CF to not forget about it. This approach
is back-patchable down to 9.5 in my opinion. I have added as well a TAP
test in the patch which is able to reproduce the problem.
Thoughts?
--
Michael
<rewind-readonly-fix-v1.patch>
On Mon, Feb 26, 2018 at 05:28:53PM +0100, TipTop Labs wrote: > 1. I can confirm that your patch is effective also in my Docker-based > test setup and with the current REL_10_STABLE code base > (i.e. PostgreSQL 10.2). Thanks for checking. Note that I am still not completely happy with the handling in errno in some newly-added code paths.. > 2. Your patch is more encompassing than the one I had submitted > earlier, and it may be the right way to go. It is cleaner but more > "complicated" in that it may require enlisting/recognizing all those > special files (pg_control, filenode.map, etc). IMO the earlier patch > would already/tolerate handle those, because the distinction it makes > is not based on whether something is a configuration file, but purely > on whether it is writable. You are basically looking for that I think: https://www.postgresql.org/message-id/20180205071022.GA17337%40paquier.xyz You cannot ignore pg_control and filenode.map though as those are critical data so they have to be updated. So if those files are not writable, you actually have more problems than you think as the cluster would not be able to correctly start. -- Michael
Attachment
> On 27 February 2018 at 02:24, Michael Paquier <michael@paquier.xyz> wrote: > > Note that I am still not completely happy with the > handling in errno in some newly-added code paths.. It maybe a stupid question, but why do you need to reset errno here? Is it to avoid its value being carried from previous calls of `open_target_file`? In this case if you put the code with `errno == EACCESS` under the if condition `if (dstfd < 0)`, then as far as I remember you should always get relevant errno. At the same time maybe it's valid to reset `errno` before `open`, like with `getpriority`: For some system calls and library functions (e.g., getpriority(2)), -1 is a valid return on success. In such cases, a successful return can be distinguished from an error return by setting errno to zero before the call, and then, if the call returns a status that indicates that an error may have occurred, checking to see if errno has a nonzero value.
On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote: > It maybe a stupid question, but why do you need to reset errno here? Is it to > avoid its value being carried from previous calls of `open_target_file`? In > this case if you put the code with `errno == EACCESS` under the if condition > `if (dstfd < 0)`, then as far as I remember you should always get relevant > errno. At the same time maybe it's valid to reset `errno` before `open`, like > with `getpriority`: [ ... reads and feels stupid ...] Of course all the checks should be where dstno is negative... I have done a second pass on the patch, and attached is a new version. This fixes this handling of errno and addresses some typos. I have also fixed the test case where one of the read-only files was not properly switched to do so. I have also added a commit log message. -- Michael
Attachment
On Mon, Mar 5, 2018 at 4:54 PM, Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Mar 02, 2018 at 04:35:07PM +0100, Dmitry Dolgov wrote: >> It maybe a stupid question, but why do you need to reset errno here? Is it to >> avoid its value being carried from previous calls of `open_target_file`? In >> this case if you put the code with `errno == EACCESS` under the if condition >> `if (dstfd < 0)`, then as far as I remember you should always get relevant >> errno. At the same time maybe it's valid to reset `errno` before `open`, like >> with `getpriority`: > > [ ... reads and feels stupid ...] > > Of course all the checks should be where dstno is negative... > > I have done a second pass on the patch, and attached is a new version. > This fixes this handling of errno and addresses some typos. I have also > fixed the test case where one of the read-only files was not properly > switched to do so. I have also added a commit log message. @@ -24,7 +24,10 @@ typedef enum { FILE_ACTION_CREATE, /* create local directory or symbolic link */ - FILE_ACTION_COPY, /* copy whole file, overwriting if exists */ + FILE_ACTION_COPY_CONFIG, /* copy whole configuration file, overwriting + * if exists */ + FILE_ACTION_COPY_DATA, /* copy whole relation file, overwriting if + * exists */ FILE_ACTION_COPY_TAIL, /* copy tail from 'oldsize' to 'newsize' */ FILE_ACTION_NONE, /* no action (we might still copy modified * blocks based on the parsed WAL) */ Since files other than relation files such as vm, fsm, WAL are categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause misreading. For example, we can use FILE_ACTION_COPY_DATA and FILE_ACTION_COPY_OTHER? As you mentioned before, with this patch we end up ignoring file-open error of database files other than relation files. I guess it's a bit risky. We can enter them to COPY_DATA but I'd defer it to committer. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote: > Since files other than relation files such as vm, fsm, WAL are > categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause > misreading. For example, we can use FILE_ACTION_COPY_DATA and > FILE_ACTION_COPY_OTHER? I am not stopped on a given name. > As you mentioned before, with this patch we end up ignoring file-open > error of database files other than relation files. I guess it's a bit > risky. We can enter them to COPY_DATA but I'd defer it to committer. COPY_DATA is aimed at being used on files which can be parsed by blocks. You cannot do that with VMs and FSMs. Now you are true that we could complain for non-configuration files which need to be fully-copied, still Postgres issues a fsync on all the files of the data folder when beginning recovery, so you would bump on problems anyway after restarting the rewound instance. -- Michael
Attachment
On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote: >> Since files other than relation files such as vm, fsm, WAL are >> categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause >> misreading. For example, we can use FILE_ACTION_COPY_DATA and >> FILE_ACTION_COPY_OTHER? > > I am not stopped on a given name. Hmm, when I used pg_rewind --debug with this patch the debug message made me confused because almost database files appears with COPY_CONFIG. Also looking at the source code comment, it says COPY_CONFIG is aimed to configuration files. But these file are not configuration files actually. COPY_DATA makes sense to me, but COPY_CONFIG doesn't. Anyway, other than that the patch looks good to me. I'd like to wait for the Dmitory's review comment before marking this as "Ready for Commiter". Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> On 6 March 2018 at 02:39, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Mar 5, 2018 at 6:01 PM, Michael Paquier <michael@paquier.xyz> wrote: >> On Mon, Mar 05, 2018 at 05:47:06PM +0900, Masahiko Sawada wrote: >>> Since files other than relation files such as vm, fsm, WAL are >>> categorize as FILE_ACTION_COPY_CONFIG, I think the name would cause >>> misreading. For example, we can use FILE_ACTION_COPY_DATA and >>> FILE_ACTION_COPY_OTHER? >> >> I am not stopped on a given name. > > Hmm, when I used pg_rewind --debug with this patch the debug message > made me confused because almost database files appears with > COPY_CONFIG. Also looking at the source code comment, it says > COPY_CONFIG is aimed to configuration files. But these file are not > configuration files actually. COPY_DATA makes sense to me, but > COPY_CONFIG doesn't. > > Anyway, other than that the patch looks good to me. I'd like to wait > for the Dmitory's review comment before marking this as "Ready for > Commiter". Thank you for waiting. Yes, it also looks good for me, but I'm wondering about one thing - does it make sense to think about other error codes here, not only about `EACCESS`? E.g. if a file was removed during the process (so, it should be `ENOENT`), or something more exotic happened, like there are too many symbolic links were encountered in resolving a pathname (`ELOOP`)?
On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote: > Thank you for waiting. Yes, it also looks good for me, but I'm wondering about > one thing - does it make sense to think about other error codes here, not only > about `EACCESS`? E.g. if a file was removed during the process (so, it should > be `ENOENT`), or something more exotic happened, like there are too many > symbolic links were encountered in resolving a pathname (`ELOOP`)? The presence of the file is ensured in the pre-phase which builds the file map (see process_source_file), and actions are taken depending on the presence of a file on the source and the target. So a file missing on the target after those pre-checks have ensured that it was actually existing should be reported with ENOENT. ELOOP would as well be faced on the backend before seeing it in pg_rewind, no? In short, it seems to me that it is better to keep the code simple. -- Michael
Attachment
> On 7 March 2018 at 02:46, Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Mar 06, 2018 at 09:37:34PM +0100, Dmitry Dolgov wrote: >> Thank you for waiting. Yes, it also looks good for me, but I'm wondering about >> one thing - does it make sense to think about other error codes here, not only >> about `EACCESS`? E.g. if a file was removed during the process (so, it should >> be `ENOENT`), or something more exotic happened, like there are too many >> symbolic links were encountered in resolving a pathname (`ELOOP`)? > > The presence of the file is ensured in the pre-phase which builds the > file map (see process_source_file), and actions are taken depending on > the presence of a file on the source and the target. So a file missing > on the target after those pre-checks have ensured that it was actually > existing should be reported with ENOENT. ELOOP would as well be faced > on the backend before seeing it in pg_rewind, no? In short, it seems to > me that it is better to keep the code simple. Ok, I agree. Then yes, this patch can be probably marked as ready.
Dmitry Dolgov <9erthalion6@gmail.com> writes: > Ok, I agree. Then yes, this patch can be probably marked as ready. I started to look over this patch, and soon decided that the commentary in pg_rewind is impossibly bad :-(. You need to study libpq_executeFileMap for a long time before you are even sure which side of the transfer it's executing on; the fact that it's doing a copy *TO* the remote server is incredibly misleading, and the comments are not anywhere near good enough to de-confuse somebody coming upon the code for the first time. Not to mention that the function's name conveys nothing whatever. Maybe it's not the job of this patch to fix that, but I can't help thinking that poor design and documentation are a big part of why this bug exists in the first place. But enough of that rant. Now that I've more or less wrapped my head around what's happening, I think this is the core of the problem: libpq_executeFileMap truncates every non-data file in the local data directory before it's fetched anything at all from the remote server. This seems to me to be utterly cavalier and brittle, because ANY FAILURE AT ALL in the fetch sequence leaves you with a data directory that's been trashed in whole or in part --- and that sequence could span a good long time, if there's a lot to copy. The originally complained-of problem (that pg_control gets nuked) is just the tip of that iceberg, and I'm afraid that the proposed patch is only removing a single potential failure mode. I think that a saner design would involve not truncating any particular target file until we first get some data to write to it. Christian's original patch shown at https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com seems to be headed in that direction, in that it forces ordering of the data fetch (which is likely a good idea anyway to ensure locality of reference) and then performs the truncation when we get the first chunk of data for a file. Michael's proposal to have separate error handling for data and config files is not bad in itself, and it does improve the behavior for one foreseeable trouble case. But I think we need the other thing as well, or some variant of it, before we can regard pg_rewind as anything except a chainsaw with the safety catch missing. regards, tom lane
On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote: > I think that a saner design would involve not truncating any particular > target file until we first get some data to write to it. Christian's > original patch shown at > https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com > seems to be headed in that direction, in that it forces ordering of the > data fetch (which is likely a good idea anyway to ensure locality of > reference) and then performs the truncation when we get the first chunk > of data for a file. Please note that Christopher's patch is awkwardly making the difference between a configuration file and a relation file by abusing of chunkoff. It also blows up on the test t/006_readonly.pl included in some of my previous patches. > Michael's proposal to have separate error handling for data and config > files is not bad in itself, and it does improve the behavior for one > foreseeable trouble case. But I think we need the other thing as well, > or some variant of it, before we can regard pg_rewind as anything except > a chainsaw with the safety catch missing. I think that I agree with your proposal to delay the file truncation up to the moment where the first chunk for a file is received, as well as I agree to the fact that enforcing the ordering of the files would give a more efficient work base. However, it seems to me that both concepts should be included to get a complete patch. In short, a complete patch could do in its most complex shape: 1) Ordering of the data fetched by libpq. 2) Truncation of files a first chunk is received only if it is a configuration file. 3) Discard any data related to files that cannot be opened: 3-1) If the file worked on is a relation file, we have a hard failure. 3-2) If the file worked on is a configuration file, then a warning is issued and we move on to the next file. This fixes the case of read-only files without being user-intrusive. In the case of 3-2) having a warning instead of a failure is subject to discussion though. Perhaps we could live with a simple failure, but that's rather user-unfriendly, still we could document the limitation and tell users to not put read-only files on the target which is also present on the source. At the same time pg_rewind considers in my patch things like FSM and VM files as configuration files, which you want to truncate out of the way and fully copy, but also want to fail hard if they cannot be written to! So at the end a failure for everything may make the most sense, if we document that people don't put read-only files in the target's data folder which can cause failures. Note also that if we use isRelDataFile() in receiveFileChunks() it is possible to decide if a file needs to be truncated when receiving its first chunk, so it is possible to drop completely the concept of file type as well. This would make the patch even more simple, and there is no need to rely on other weird tweaks like chunkoff or such. The ordering of the items when doing a local copy is not mandatory as files are processed one by one, still I think that it could be more solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the "trunc" argument. Thoughts? -- Michael
Attachment
On Apr 3, 2018, at 4:41 AM, Michael Paquier <michael@paquier.xyz> wrote:On Mon, Apr 02, 2018 at 07:27:55PM -0400, Tom Lane wrote:I think that a saner design would involve not truncating any particular
target file until we first get some data to write to it. Christian's
original patch shown at
https://www.postgresql.org/message-id/CAB7nPqTyq3cjphJmoCPc0n8r90O4j1BpZC2BMaj8dKZ7g0-mtw%40mail.gmail.com
seems to be headed in that direction, in that it forces ordering of the
data fetch (which is likely a good idea anyway to ensure locality of
reference) and then performs the truncation when we get the first chunk
of data for a file.
Please note that Christopher's patch is awkwardly making the difference
between a configuration file and a relation file by abusing of chunkoff.
It also blows up on the test t/006_readonly.pl included in some of my
previous patches.
bug
===
libpq_fetch.c loops twice over files in pgdata, a first time in
libpq_executeFileMap(), and then a second time (for files with action
FILE_ACTION_COPY or FILE_ACTION_COPY_TAIL) in receiveFileChunks().
For FILE_ACTION_COPY, it deletes (truncates) the file from the target directory
already during the first loop. If pg_rewind then encounters a read-only file
(e.g. server.key) still during the first loop, it exits with pg_fatal
("could not open target file"). After such truncation of global/pg_control
pg_rewind cannot run again ("unexpected control file size 0, , expected 8192")
and a restore from pg_basebackup is needed.
patch
=====
Truncation of the file was moved to the second loop. Truncation occurs there if
chunks are written into files at offset 0. This is the case for
FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these
chunks are processed first.
===
libpq_fetch.c loops twice over files in pgdata, a first time in
libpq_executeFileMap(), and then a second time (for files with action
FILE_ACTION_COPY or FILE_ACTION_COPY_TAIL) in receiveFileChunks().
For FILE_ACTION_COPY, it deletes (truncates) the file from the target directory
already during the first loop. If pg_rewind then encounters a read-only file
(e.g. server.key) still during the first loop, it exits with pg_fatal
("could not open target file"). After such truncation of global/pg_control
pg_rewind cannot run again ("unexpected control file size 0, , expected 8192")
and a restore from pg_basebackup is needed.
patch
=====
Truncation of the file was moved to the second loop. Truncation occurs there if
chunks are written into files at offset 0. This is the case for
FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these
chunks are processed first.
Michael's proposal to have separate error handling for data and config
files is not bad in itself, and it does improve the behavior for one
foreseeable trouble case. But I think we need the other thing as well,
or some variant of it, before we can regard pg_rewind as anything except
a chainsaw with the safety catch missing.
I think that I agree with your proposal to delay the file truncation up
to the moment where the first chunk for a file is received, as well as I
agree to the fact that enforcing the ordering of the files would give a
more efficient work base. However, it seems to me that both concepts
should be included to get a complete patch. In short, a complete patch
could do in its most complex shape:
1) Ordering of the data fetched by libpq.
2) Truncation of files a first chunk is received only if it is a
configuration file.
3) Discard any data related to files that cannot be opened:
3-1) If the file worked on is a relation file, we have a hard failure.
3-2) If the file worked on is a configuration file, then a warning is
issued and we move on to the next file. This fixes the case of
read-only files without being user-intrusive.
In the case of 3-2) having a warning instead of a failure is subject to
discussion though. Perhaps we could live with a simple failure, but
that's rather user-unfriendly, still we could document the limitation
and tell users to not put read-only files on the target which is also
present on the source.
At the same time pg_rewind considers in my patch things like FSM and VM
files as configuration files, which you want to truncate out of the way
and fully copy, but also want to fail hard if they cannot be written to!
So at the end a failure for everything may make the most sense, if we
document that people don't put read-only files in the target's data
folder which can cause failures.
Note also that if we use isRelDataFile() in receiveFileChunks() it is
possible to decide if a file needs to be truncated when receiving its
first chunk, so it is possible to drop completely the concept of file
type as well. This would make the patch even more simple, and there is
no need to rely on other weird tweaks like chunkoff or such.
The ordering of the items when doing a local copy is not mandatory as
files are processed one by one, still I think that it could be more
solid to reuse isRelDataFile() in rewind_copy_file_range() and drop the
"trunc" argument.
Thoughts?
--
Michael
Regards,
Christian
On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote: > Truncation of the file was moved to the second loop. Truncation occurs > there if chunks are written into files at offset 0. This is the case > for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures > that these chunks are processed first. Yes. I have spent a large portion of my day looking at that. And actually this is wrong as well. First, please find attached a patch I have written while testing your changes. There is nothing much new into it, except that I added more comments and a test (other tests fail as well, that does not matter). First, I have looked at a way to not rely on the tweak with chunkoff by extending the temporary table used to register the chunks with an extra column which tracks the type of action which is taken on the file. Another one I looked at was to pass down the action type taken on the entry directly to receiveFileChunks(). However both of them have been grotty. So after that I falled back to your patch and began testing it, which is where I noticed that we can *never* give the insurance to recover a data folder on which an error has happened in the middle of a pg_rewind. The reason for that is quite simple: even if the truncation has been moved down to the moment where the first chunk of a file is received, you may have already done work on some relation files. Particularly, some of them may have been truncated down to a given size without a new range of blocks fetched from the source. So the data folder would be in an inconsistent state if trying to rewind it again. If the control file from the source has been received and then a read-only error is found, then in order to perform a subsequent rewind you need to start and stop the target cluster cleanly, or update manually the control file and mark it as cleanly stopped. And this is a recipe for data corruption as we now start a rewind based on the guarantee that a target cluster has been able to cleanly stop, with a shutdown checkpoint issued. You could always try to start and stop the cluster cleanly, but if your target instance replays WAL on data blocks which have been truncated by a previous rewind, you would need to re-create an instance using a new base backup, which is actually harder to diagnose. The patch I sent prevously which makes the difference between relation files and "configuration" files helps a bit, still it makes me uneasy because it will not be able to handle correctly failures on files which are part of a relation but need to be fully copied. So I think that we should drop this approach as well for its complexity. Another thing that could definitely help is to work on a patch which checks that *all* files are writable before doing any operations for files which are present on both the source and the target, and make the rewind nicely fail with the source still intact. That's quite a bit of refactoring ahead for little benefit I think in the long run. What we could simply do is to document the limitation. Hence, if both the target and the source have read-only files, then the user needs to be sure to remove them from the target before doing the rewind, and do the re-linking after the operation. If an error is found while processing the rewind, then a new base backup needs to be taken. The refactoring I am mentioning two paragraphs above could definitely be done to ease user's life, but I would treat that as a new feature and not a bug. Another approach, way more simple that we could do is to scan the data folder of the target before doing anything and see if some files are not writable, and then report this list back to the user. If we do that even for files which are on the target but not the source then that would be a loss for some cases, however I would like to imagine that when using pg_rewind the configuration file set is consistent across nodes in a majority of cases, so a user could remove those non-writable files manually. This has also the advantage to be non-intrusive, so a back-patch is definitely possible. Thoughts? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > So after that I falled back to your patch and began testing it, which is > where I noticed that we can *never* give the insurance to recover a data > folder on which an error has happened in the middle of a pg_rewind. The > reason for that is quite simple: even if the truncation has been moved > down to the moment where the first chunk of a file is received, you may > have already done work on some relation files. Particularly, some of > them may have been truncated down to a given size without a new range of > blocks fetched from the source. So the data folder would be in an > inconsistent state if trying to rewind it again. Yes, we certainly cannot guarantee that failure partway through pg_rewind leaves a consistent state of the target data directory. It is likely worth pointing that out in the documentation. Whether we can or should do anything about it is a different question. When I first started looking at this thread, I wondered if maybe somebody had had in mind to create an active defense against starting a postmaster in an inconsistent target cluster, by dint of intentionally truncating pg_control before the transfer starts and not making it valid again till the very end. It's now clear from looking at the code that that's not what's going on :-(. But I wonder how hard it would be to make it so, and whether that'd be worth doing if it's not too hard. Actually, probably a safer way to attack that would be to remove or rename the topmost PG_VERSION file, and then put it back afterwards. That'd be far easier to recover from manually, if need be, than clobbering pg_control. In any case, that seems separate from the question of what to do with read-only files in the data directory. Should we push forward with committing Michael's previous patch, and leave that issue for later? regards, tom lane
On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote: > When I first started looking at this thread, I wondered if maybe somebody > had had in mind to create an active defense against starting a postmaster > in an inconsistent target cluster, by dint of intentionally truncating > pg_control before the transfer starts and not making it valid again till > the very end. It's now clear from looking at the code that that's not > what's going on :-(. But I wonder how hard it would be to make it so, > and whether that'd be worth doing if it's not too hard. One safeguard I can see is to add an extra loop just before processing the file map to check if a file planned for copy can be sanely opened or not. That's simple to do with open_target_file(), followed by close_target_file(). If there is an EACCES error, then we complain at this early stage, and the data folder can be reused after cleaning up the data folder from those files. That's not actually hard to do at quick glance, but I may be missing something so let's be careful. Another approach that we could consider, HEAD-only though, is to extend pg_stat_file so as an entry's st_mode is available and we could use that to not copy the file's data from the start. That would be actually quite clean, particularly for large read-only files. > Actually, probably a safer way to attack that would be to remove or > rename the topmost PG_VERSION file, and then put it back afterwards. > That'd be far easier to recover from manually, if need be, than > clobbering pg_control. > > In any case, that seems separate from the question of what to do with > read-only files in the data directory. Should we push forward with > committing Michael's previous patch, and leave that issue for later? That one is not good either as long as we don't make the difference in the data folder between three types of files: 1) Relation files. 2) System files which are critical at diverse degrees for the system. 3) Custom configuration files. pg_rewind is able to consider 1) as a separate category as it usually needs to fetch a range set of blocks, and puts 2) and 3) in the same bucket. It could be possible to split 2) and 3) but the maintenance cost is high as for each new system file added in Postgres we would need to maintain an extra filtering logic in pg_rewind, something which is most likely going to rot, leading to potential corruption problems. My point is that I have seen on some VMware test beds a kernel going rogue because of ESX servers, causing a PostgreSQL-related partition to go in read-only mode. So if you use my previous patch, and that the partition where the target server's data folder is located turns all-of-a-sudden to read-only, there is a risk of silencing real failures. Even if the fetched data is ordered, paths like pg_xact and pg_twophase are processed after all relation files, so failing to write them correctly would silently break a set of transactions :( So please let me suggest a set of two patches: 1) 0001 is a documentation patch which should be back-patched. This tells two things: 1-1) If pg_rewind fails in the middle of processing, then the best thing to do is to re-create the data folder using a new fresh backup. 1-2) Be careful of files that the process running pg_rewind is not able to write to. If those are located on both the target and the source, then pg_rewind will copy it from the source to the target, leading to an immediate failure. After removing those unwritable files, be sure to clean up anything which has been copied from the source and should be read-only, then rebuild the links. 2) 0002 is a patch for HEAD to order the set of chunks fetched, because this makes the data writes sequentials which is good for performance and limits the number of open()/close() on the files of the target's data folder. I think that this is the best answer we can give now as there are a couple of ways to improving the handling of any checks, however there are side effects to any of them. This also feels a bit like a new feature, while the documentation in pg_rewind sucks now for such details. I have also spent some time detailing the commit message of the first patch about all that, that would be nice to keep in the git history. If the extra checks I am mentioning at the top of this email are worth adding, then it is possible to drop 1-2) partially, rewriting it to mention that pg_rewind tells the user at early stages about the failure and that a data folder can be again reused. This does not change the fact that a pg_rewind breaking mid-flight should result in a new base backup, so documentation is mandatory anyway. And there are a couple of approaches that we could consider here. At least I saw two of them. Other folks may see better approaches than I did, who knows.. Thanks, -- Michael
Attachment
> On Apr 5, 2018, at 3:38 AM, Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote: >> When I first started looking at this thread, I wondered if maybe somebody >> had had in mind to create an active defense against starting a postmaster >> in an inconsistent target cluster, by dint of intentionally truncating >> pg_control before the transfer starts and not making it valid again till >> the very end. It's now clear from looking at the code that that's not >> what's going on :-(. But I wonder how hard it would be to make it so, >> and whether that'd be worth doing if it's not too hard. > > One safeguard I can see is to add an extra loop just before processing > the file map to check if a file planned for copy can be sanely opened or > not. That's simple to do with open_target_file(), followed by > close_target_file(). If there is an EACCES error, then we complain at > this early stage, and the data folder can be reused after cleaning up > the data folder from those files. That's not actually hard to do at > quick glance, but I may be missing something so let's be careful. > > Another approach that we could consider, HEAD-only though, is to extend > pg_stat_file so as an entry's st_mode is available and we could use that > to not copy the file's data from the start. That would be actually > quite clean, particularly for large read-only files. > >> Actually, probably a safer way to attack that would be to remove or >> rename the topmost PG_VERSION file, and then put it back afterwards. >> That'd be far easier to recover from manually, if need be, than >> clobbering pg_control. >> >> In any case, that seems separate from the question of what to do with >> read-only files in the data directory. Should we push forward with >> committing Michael's previous patch, and leave that issue for later? > > That one is not good either as long as we don't make the difference in > the data folder between three types of files: > 1) Relation files. > 2) System files which are critical at diverse degrees for the system. > 3) Custom configuration files. > > pg_rewind is able to consider 1) as a separate category as it usually > needs to fetch a range set of blocks, and puts 2) and 3) in the same > bucket. It could be possible to split 2) and 3) but the maintenance > cost is high as for each new system file added in Postgres we would need > to maintain an extra filtering logic in pg_rewind, something which is > most likely going to rot, leading to potential corruption problems. > > My point is that I have seen on some VMware test beds a kernel going > rogue because of ESX servers, causing a PostgreSQL-related partition to > go in read-only mode. So if you use my previous patch, and that the > partition where the target server's data folder is located turns > all-of-a-sudden to read-only, there is a risk of silencing real > failures. Even if the fetched data is ordered, paths like pg_xact and > pg_twophase are processed after all relation files, so failing to write > them correctly would silently break a set of transactions :( > > So please let me suggest a set of two patches: > 1) 0001 is a documentation patch which should be back-patched. This > tells two things: > 1-1) If pg_rewind fails in the middle of processing, then the best thing > to do is to re-create the data folder using a new fresh backup. > 1-2) Be careful of files that the process running pg_rewind is not able > to write to. If those are located on both the target and the source, > then pg_rewind will copy it from the source to the target, leading to an > immediate failure. After removing those unwritable files, be sure to > clean up anything which has been copied from the source and should be > read-only, then rebuild the links. > 2) 0002 is a patch for HEAD to order the set of chunks fetched, because > this makes the data writes sequentials which is good for performance and > limits the number of open()/close() on the files of the target's data > folder. Thanks for crediting me in patch 0002. One final word about my original patch, since the commit message for 0001 refers to its limitations: I acknowledge that itcannot cover situations where the order of processing files is such that the second loop detects non-writable files onlyafter it had already done work on some others. It was meant to fix the specific problem I ran into without breaking theregression tests that ship with the source code ("make check"). The main reason that it is not more elaborate is thatI did not want to submit anything affecting more than a few lines of code without prior input from the community. So from my perspective I welcome both 0001 and 0002. > I think that this is the best answer we can give now as there are a > couple of ways to improving the handling of any checks, however there > are side effects to any of them. This also feels a bit like a new > feature, while the documentation in pg_rewind sucks now for such > details. I have also spent some time detailing the commit message of > the first patch about all that, that would be nice to keep in the git > history. > > If the extra checks I am mentioning at the top of this email are worth > adding, then it is possible to drop 1-2) partially, rewriting it to > mention that pg_rewind tells the user at early stages about the failure > and that a data folder can be again reused. This does not change the > fact that a pg_rewind breaking mid-flight should result in a new base > backup, so documentation is mandatory anyway. And there are a couple of > approaches that we could consider here. At least I saw two of them. > Other folks may see better approaches than I did, who knows.. > Thanks, > -- > Michael > <0001-Add-note-in-pg_rewind-documentation-about-read-only-.patch><0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patch> -- Christian
On Thu, Apr 05, 2018 at 07:04:40AM +0200, TipTop Labs wrote: > One final word about my original patch, since the commit message for > 0001 refers to its limitations: I acknowledge that it cannot cover > situations where the order of processing files is such that the second > loop detects non-writable files only after it had already done work on > some others. It was meant to fix the specific problem I ran into > without breaking the regression tests that ship with the source code > ("make check"). The main reason that it is not more elaborate is that > I did not want to submit anything affecting more than a few lines of > code without prior input from the community. That definitely makes sense to me. Reviews and discussions can take a long time to settle so this method looks right to me. It is better not spend too much time on a patch if you unsure of its shape, and much better to ask for input first before diving more into the details. This stuff is really complicated and the devil is in the details. > So from my perspective I welcome both 0001 and 0002. OK, thanks for confirming! -- Michael
Attachment
Thinking about this some more, I wondered if useful behavior with read-only config files would be to check whether they have the same contents on both servers, and give a warning or error if not. Now, that would only be useful if you suppose that they actually should be the same on both. For the particular case of server private keys, maybe they typically would be different. So I'm not real sure about this idea, but wanted to toss it out for discussion. In any case, after another look at the code, it seems to me that it'd be pretty easy to get pg_rewind to notice at the start whether a target file it plans to overwrite is read-only: process_source_file is already doing an lstat per file, so I think a permissions check there wouldn't add much overhead. So at that point we could either bail out without damage, or decide to skip the file. We could still lose if permissions change midway through the run, but that's the sort of situation I think is OK to fail in. Also, I'm having second thoughts about the usefulness of adding an ORDER BY to the fetch query just on (untested) performance grounds. It looks to me like the chunks within a file already will be inserted into the fetchchunks table in order, and I think within-file order is all that's worth worrying about. In principle the remote server might read out the rows in some other order than we stored them, but since fetchchunks is a temp table and not subject to synchronize_seqscans, I don't think we really need to worry about that. Furthermore, the patch as given has got a serious performance gotcha: sql = "SELECT path, begin,\n" " pg_read_binary_file(path, begin, len, true) AS chunk\n" - "FROM fetchchunks\n"; + "FROM fetchchunks\n" + "ORDER BY path, begin\n"; While 9.6 and later will do this in a sane way, older versions will execute pg_read_binary_file() ahead of the sort step, like so: # explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM fetchchunks ORDER BY path,begin; QUERY PLAN ------------------------------------------------------------------------------------- Sort (cost=74639.34..75889.41 rows=500025 width=44) Output: path, begin, (pg_read_binary_file(path, begin, (len)::bigint, true)) Sort Key: fetchchunks.path, fetchchunks.begin -> Seq Scan on pg_temp_2.fetchchunks (cost=0.00..11925.38 rows=500025 width=44) Output: path, begin, pg_read_binary_file(path, begin, (len)::bigint, true) (5 rows) with the effect that we're passing the entire contents of the source data directory (or at least, as much of it as pg_rewind needs to read) through the sort. Yipes. So it'd be safer to do it like this: # explain verbose SELECT path, begin, pg_read_binary_file(path, begin, len, true) AS chunk FROM (select * from fetchchunksORDER BY path, begin) ss; QUERY PLAN --------------------------------------------------------------------------------------------- Subquery Scan on ss (cost=72139.22..80889.66 rows=500025 width=44) Output: ss.path, ss.begin, pg_read_binary_file(ss.path, ss.begin, (ss.len)::bigint, true) -> Sort (cost=72139.22..73389.28 rows=500025 width=44) Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len Sort Key: fetchchunks.path, fetchchunks.begin -> Seq Scan on pg_temp_2.fetchchunks (cost=0.00..9425.25 rows=500025 width=44) Output: fetchchunks.path, fetchchunks.begin, fetchchunks.len (7 rows) But the real bottom line here is I don't feel a need to touch the fetch query at all without some actual performance measurements proving there's a benefit. regards, tom lane
On Sat, Apr 07, 2018 at 11:51:32AM -0400, Tom Lane wrote: > In any case, after another look at the code, it seems to me that it'd be > pretty easy to get pg_rewind to notice at the start whether a target file > it plans to overwrite is read-only: process_source_file is already doing > an lstat per file, so I think a permissions check there wouldn't add much > overhead. So at that point we could either bail out without damage, or > decide to skip the file. Yes, that's one of the methods I mentioned in my previous email. Now, do we want to fail the run immediately if such a file is found? Or do we want to report at once all such files so as the user does not need to run multiple times pg_rewind. The latter would be much more useful. A downside here is that the file from the source would still be fetched and copied on the target. So a file which was in read-only would become basically writable. Don't think that it is a big deal as the has superuser rights at this point, but that's worth mentioning. > We could still lose if permissions change midway > through the run, but that's the sort of situation I think is OK to fail > in. If the file system switches to read-only in the middle of the run, I see no way to recover from that. That would be bad luck, but it is way better to fail hard and let the user that something has gone wrong. I still think that we want to document that if pg_rewind fails hard midflight, then the only thing safe enough moving forward is to use a new base backup. -- Michael
Attachment
On Sun, Apr 08, 2018 at 07:22:58AM +0900, Michael Paquier wrote: > Yes, that's one of the methods I mentioned in my previous email. Now, > do we want to fail the run immediately if such a file is found? Or do > we want to report at once all such files so as the user does not need to > run multiple times pg_rewind. So I have been playing with that part... > The latter would be much more useful. While that would be nice, the result is rather ugly and this needs to generate the same error message in two code paths. So I have chewed things in such a way that one failure found makes the whole rewind to just fail, and allows retries to work. The attached has a test as well as documentation. Thoughts? > A downside here is that the file from the source would still be fetched > and copied on the target. So a file which was in read-only would become > basically writable. Don't think that it is a big deal as the has > superuser rights at this point, but that's worth mentioning. That's mentioned in the docs of the attached. -- Michael
Attachment
On Mon, Apr 09, 2018 at 01:58:26PM +0900, Michael Paquier wrote: > That's mentioned in the docs of the attached. So, I have read again this thread, and it seems to me that there are many approaches, but all of them have downsides. Hence at the end I would suggest to document the limitation with for example the attached, and call it a day. Thoughts? -- Michael