Thread: [MASSMAIL]pg_combinebackup does not detect missing files

[MASSMAIL]pg_combinebackup does not detect missing files

From

David Steele

Date:

11 April 2024, 01:36:30

Hackers,

I've been playing around with the incremental backup feature trying to 
get a sense of how it can be practically used. One of the first things I 
always try is to delete random files and see what happens.

You can delete pretty much anything you want from the most recent 
incremental backup (not the manifest) and it will not be detected.

For example:

$ pg_basebackup -c fast -D test/backup/full -F plain
$ pg_basebackup -c fast -D test/backup/incr1 -F plain -i 
/home/dev/test/backup/full/backup_manifest

$ rm test/backup/incr1/base/1/INCREMENTAL.2675
$ rm test/backup/incr1/base/1/826
$ /home/dev/test/pg/bin/pg_combinebackup test/backup/full 
test/backup/incr1 -o test/backup/combine

$ ls test/backup/incr1/base/1/2675
No such file or directory
$ ls test/backup/incr1/base/1/826
No such file or directory

I can certainly use verify to check the backups individually:

$ /home/dev/test/pg/bin/pg_verifybackup /home/dev/test/backup/incr1
pg_verifybackup: error: "base/1/INCREMENTAL.2675" is present in the 
manifest but not on disk
pg_verifybackup: error: "base/1/826" is present in the manifest but not 
on disk

But I can't detect this issue if I use verify on the combined backup:

$ /home/dev/test/pg/bin/pg_verifybackup /home/dev/test/backup/combine
backup successfully verified

Maybe the answer here is to update the docs to specify that 
pg_verifybackup should be run on all backup directories before 
pg_combinebackup is run. Right now that is not at all clear.

In fact the docs say, "pg_combinebackup will attempt to verify that the 
backups you specify form a legal backup chain". Which I guess just means 
the chain is verified and not the files, but it seems easy to misinterpret.

Overall I think it is an issue that the combine is being driven from the 
most recent incremental directory rather than from the manifest.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

16 April 2024, 13:50:40

On Wed, Apr 10, 2024 at 9:36 PM David Steele <david@pgmasters.net> wrote:
> I've been playing around with the incremental backup feature trying to
> get a sense of how it can be practically used. One of the first things I
> always try is to delete random files and see what happens.
>
> You can delete pretty much anything you want from the most recent
> incremental backup (not the manifest) and it will not be detected.

Sure, but you can also delete anything you want from the most recent
non-incremental backup and it will also not be detected. There's no
reason at all to hold incremental backup to a higher standard than we
do in general.

> Maybe the answer here is to update the docs to specify that
> pg_verifybackup should be run on all backup directories before
> pg_combinebackup is run. Right now that is not at all clear.

I don't want to make those kinds of prescriptive statements. If you
want to verify the backups that you use as input to pg_combinebackup,
you can use pg_verifybackup to do that, but it's not a requirement.
I'm not averse to having some kind of statement in the documentation
along the lines of "Note that pg_combinebackup does not attempt to
verify that the individual backups are intact; for that, use
pg_verifybackup." But I think it should be blindingly obvious to
everyone that you can't go whacking around the inputs to a program and
expect to get perfectly good output. I know it isn't blindingly
obvious to everyone, which is why I'm not averse to adding something
like what I just mentioned, and maybe it wouldn't be a bad idea to
document in a few other places that you shouldn't randomly remove
files from the data directory of your live cluster, either, because
people seem to keep doing it, but really, the expectation that you
can't just blow files away and expect good things to happen afterward
should hardly need to be stated.

I think it's very easy to go overboard with warnings of this type.
Weird stuff comes to me all the time because people call me when the
weird stuff happens, and I'm guessing that your experience is similar.
But my actual personal experience, as opposed to the cases reported to
me by others, practically never features files evaporating into the
ether. If I read a documentation page for PostgreSQL or any other
piece of software that made it sound like that was a normal
occurrence, I'd question the technical acumen of the authors. And if I
saw such warnings only for one particular feature of a piece of
software and not anywhere else, I'd wonder why the authors of the
software were trying so hard to scare me off the use of that
particular feature. I don't trust at all that incremental backup is
free of bugs -- but I don't trust that all the code anyone else has
written is free of bugs, either.

> Overall I think it is an issue that the combine is being driven from the
> most recent incremental directory rather than from the manifest.

I don't. I considered that design and rejected it for various reasons
that I still believe to be good. Even if I was wrong, we're not going
to start rewriting the implementation a week after feature freeze.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

16 April 2024, 23:25:48

On 4/16/24 23:50, Robert Haas wrote:
> On Wed, Apr 10, 2024 at 9:36 PM David Steele <david@pgmasters.net> wrote:
>> I've been playing around with the incremental backup feature trying to
>> get a sense of how it can be practically used. One of the first things I
>> always try is to delete random files and see what happens.
>>
>> You can delete pretty much anything you want from the most recent
>> incremental backup (not the manifest) and it will not be detected.
> 
> Sure, but you can also delete anything you want from the most recent
> non-incremental backup and it will also not be detected. There's no
> reason at all to hold incremental backup to a higher standard than we
> do in general.

Except that we are running pg_combinebackup on the incremental, which 
the user might reasonably expect to check backup integrity. It actually 
does a bunch of integrity checks -- but not this one.

>> Maybe the answer here is to update the docs to specify that
>> pg_verifybackup should be run on all backup directories before
>> pg_combinebackup is run. Right now that is not at all clear.
> 
> I don't want to make those kinds of prescriptive statements. If you
> want to verify the backups that you use as input to pg_combinebackup,
> you can use pg_verifybackup to do that, but it's not a requirement.
> I'm not averse to having some kind of statement in the documentation
> along the lines of "Note that pg_combinebackup does not attempt to
> verify that the individual backups are intact; for that, use
> pg_verifybackup." 

I think we should do this at a minimum.

> But I think it should be blindingly obvious to
> everyone that you can't go whacking around the inputs to a program and
> expect to get perfectly good output. I know it isn't blindingly
> obvious to everyone, which is why I'm not averse to adding something
> like what I just mentioned, and maybe it wouldn't be a bad idea to
> document in a few other places that you shouldn't randomly remove
> files from the data directory of your live cluster, either, because
> people seem to keep doing it, but really, the expectation that you
> can't just blow files away and expect good things to happen afterward
> should hardly need to be stated.

And yet, we see it all the time.

> I think it's very easy to go overboard with warnings of this type.
> Weird stuff comes to me all the time because people call me when the
> weird stuff happens, and I'm guessing that your experience is similar.
> But my actual personal experience, as opposed to the cases reported to
> me by others, practically never features files evaporating into the
> ether. 

Same -- if it happens at all it is very rare. Virtually every time I am 
able to track down the cause of missing files it is because the user 
deleted them, usually to "save space" or because they "did not seem 
important".

But given that this occurrence is pretty common in my experience, I 
think it is smart to mitigate against it, rather than just take it on 
faith that the user hasn't done anything destructive.

Especially given how pg_combinebackup works, backups are going to 
undergo a lot of user manipulation (pushing to and pull from storage, 
decompressing, untaring, etc.) and I think that means we should take 
extra care.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

17 April 2024, 15:03:55

On Tue, Apr 16, 2024 at 7:25 PM David Steele <david@pgmasters.net> wrote:
> Except that we are running pg_combinebackup on the incremental, which
> the user might reasonably expect to check backup integrity. It actually
> does a bunch of integrity checks -- but not this one.

I think it's a bad idea to duplicate all of the checks from
pg_verifybackup into pg_combinebackup. I did consider this exact issue
during development. These are separate tools with separate purposes.
I think that what a user should expect is that each tool has some job
and tries to do that job well, while leaving other jobs to other
tools.

And I think if you think about it that way, it explains a lot about
which checks pg_combinebackup does and which checks it does not do.
pg_combinebackup tries to check that it is valid to combine backup A
with backup B (and maybe C, D, E ...), and it checks a lot of stuff to
try to make sure that such an operation appears to be sensible. Those
are checks that pg_verifybackup cannot do, because pg_verifybackup
only looks at one backup in isolation. If pg_combinebackup does not do
those checks, nobody does. Aside from that, it will also report errors
that it can't avoid noticing, even if those are things that
pg_verifybackup would also have found, such as, say, the control file
not existing.

But it will not go out of its way to perform checks that are unrelated
to its documented purpose. And I don't think it should, especially if
we have another tool that already does that.

> > I'm not averse to having some kind of statement in the documentation
> > along the lines of "Note that pg_combinebackup does not attempt to
> > verify that the individual backups are intact; for that, use
> > pg_verifybackup."
>
> I think we should do this at a minimum.

Here is a patch to do that.

> Especially given how pg_combinebackup works, backups are going to
> undergo a lot of user manipulation (pushing to and pull from storage,
> decompressing, untaring, etc.) and I think that means we should take
> extra care.

We are in agreement on that point, if nothing else. I am terrified of
users having problems with pg_combinebackup and me not being able to
tell whether those problems are due to user error, Robert error, or
something else. I put a lot of effort into detecting dumb things that
I thought a user might do, and a lot of effort into debugging output
so that when things do go wrong anyway, we have a reasonable chance of
figuring out exactly where they went wrong. We do seem to have a
philosophical difference about what the scope of those checks ought to
be, and I don't really expect what I wrote above to convince you that
my position is correct, but perhaps it will convince you that I have a
thoughtful position, as opposed to just having done stuff at random.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v1-0001-docs-Mention-that-pg_combinebackup-does-not-verif.patch

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

17 April 2024, 23:09:37

On 4/18/24 01:03, Robert Haas wrote:
> On Tue, Apr 16, 2024 at 7:25 PM David Steele <david@pgmasters.net> wrote:
> 
> But it will not go out of its way to perform checks that are unrelated
> to its documented purpose. And I don't think it should, especially if
> we have another tool that already does that.
> 
>>> I'm not averse to having some kind of statement in the documentation
>>> along the lines of "Note that pg_combinebackup does not attempt to
>>> verify that the individual backups are intact; for that, use
>>> pg_verifybackup."
>>
>> I think we should do this at a minimum.
> 
> Here is a patch to do that.

I think here:

+   <application>pg_basebackup</application> only attempts to verify

you mean:

+   <application>pg_combinebackup</application> only attempts to verify

Otherwise this looks good to me.

>> Especially given how pg_combinebackup works, backups are going to
>> undergo a lot of user manipulation (pushing to and pull from storage,
>> decompressing, untaring, etc.) and I think that means we should take
>> extra care.
> 
> We are in agreement on that point, if nothing else. I am terrified of
> users having problems with pg_combinebackup and me not being able to
> tell whether those problems are due to user error, Robert error, or
> something else. I put a lot of effort into detecting dumb things that
> I thought a user might do, and a lot of effort into debugging output
> so that when things do go wrong anyway, we have a reasonable chance of
> figuring out exactly where they went wrong. We do seem to have a
> philosophical difference about what the scope of those checks ought to
> be, and I don't really expect what I wrote above to convince you that
> my position is correct, but perhaps it will convince you that I have a
> thoughtful position, as opposed to just having done stuff at random.

Fair enough. I accept that your reasoning is not random, but I'm still 
not very satisfied that the user needs to run a separate and rather 
expensive process to do the verification when pg_combinebackup already 
has the necessary information at hand. My guess is that most users will 
elect to skip verification.

At least now they'll have the information they need to make an informed 
choice.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

18 April 2024, 14:50:10

On Wed, Apr 17, 2024 at 7:09 PM David Steele <david@pgmasters.net> wrote:
> I think here:
>
> +   <application>pg_basebackup</application> only attempts to verify
>
> you mean:
>
> +   <application>pg_combinebackup</application> only attempts to verify
>
> Otherwise this looks good to me.

Good catch, thanks. Committed with that change.

> Fair enough. I accept that your reasoning is not random, but I'm still
> not very satisfied that the user needs to run a separate and rather
> expensive process to do the verification when pg_combinebackup already
> has the necessary information at hand. My guess is that most users will
> elect to skip verification.

I think you're probably right that a lot of people will skip it; I'm
just less convinced than you are that it's a bad thing. It's not a
*great* thing if people skip it, but restore time is actually just
about the worst time to find out that you have a problem with your
backups. I think users would be better served by verifying stored
backups periodically when they *don't* need to restore them. Also,
saying that we have all of the information that we need to do the
verification is only partially true:

- we do have to parse the manifest anyway, but we don't have to
compute checksums anyway, and I think that cost can be significant
even for CRC-32C and much more significant for any of the SHA variants

- we don't need to read all of the files in all of the backups. if
there's a newer full, the corresponding file in older backups, whether
full or incremental, need not be read

- incremental files other than the most recent only need to be read to
the extent that we need their data; if some of the same blocks have
been changed again, we can economize

How much you save because of these effects is pretty variable. Best
case, you have a 2-backup chain with no manifest checksums, and all
verification will have to do that you wouldn't otherwise need to do is
walk each older directory tree in toto and cross-check which files
exist against the manifest. That's probably cheap enough that nobody
would be too fussed. Worst case, you have a 10-backup (or whatever)
chain with SHA512 checksums and, say, a 50% turnover rate. In that
case, I think having verification happen automatically could be a
pretty major hit, both in terms of I/O and CPU. If your database is
1TB, it's ~5.5TB of read I/O (because one 1TB full backup and 9 0.5TB
incrementals) instead of ~1TB of read I/O, plus the checksumming.

Now, obviously you can still feel that it's totally worth it, or that
someone in that situation shouldn't even be using incremental backups,
and it's a value judgement, so fair enough. But my guess is that the
efforts that this implementation makes to minimize the amount of I/O
required for a restore are going to be important for a lot of people.

> At least now they'll have the information they need to make an informed
> choice.

Right.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

18 April 2024, 23:36:28

On 4/19/24 00:50, Robert Haas wrote:
> On Wed, Apr 17, 2024 at 7:09 PM David Steele <david@pgmasters.net> wrote:
> 
>> Fair enough. I accept that your reasoning is not random, but I'm still
>> not very satisfied that the user needs to run a separate and rather
>> expensive process to do the verification when pg_combinebackup already
>> has the necessary information at hand. My guess is that most users will
>> elect to skip verification.
> 
> I think you're probably right that a lot of people will skip it; I'm
> just less convinced than you are that it's a bad thing. It's not a
> *great* thing if people skip it, but restore time is actually just
> about the worst time to find out that you have a problem with your
> backups. I think users would be better served by verifying stored
> backups periodically when they *don't* need to restore them. 

Agreed, running verify regularly is a good idea, but in my experience 
most users are only willing to run verify once they suspect (or know) 
there is an issue. It's a pretty expensive process depending on how many 
backups you have and where they are stored.

 > Also,
> saying that we have all of the information that we need to do the
> verification is only partially true:
> 
> - we do have to parse the manifest anyway, but we don't have to
> compute checksums anyway, and I think that cost can be significant
> even for CRC-32C and much more significant for any of the SHA variants
> 
> - we don't need to read all of the files in all of the backups. if
> there's a newer full, the corresponding file in older backups, whether
> full or incremental, need not be read
> 
> - incremental files other than the most recent only need to be read to
> the extent that we need their data; if some of the same blocks have
> been changed again, we can economize
> 
> How much you save because of these effects is pretty variable. Best
> case, you have a 2-backup chain with no manifest checksums, and all
> verification will have to do that you wouldn't otherwise need to do is
> walk each older directory tree in toto and cross-check which files
> exist against the manifest. That's probably cheap enough that nobody
> would be too fussed. Worst case, you have a 10-backup (or whatever)
> chain with SHA512 checksums and, say, a 50% turnover rate. In that
> case, I think having verification happen automatically could be a
> pretty major hit, both in terms of I/O and CPU. If your database is
> 1TB, it's ~5.5TB of read I/O (because one 1TB full backup and 9 0.5TB
> incrementals) instead of ~1TB of read I/O, plus the checksumming.
> 
> Now, obviously you can still feel that it's totally worth it, or that
> someone in that situation shouldn't even be using incremental backups,
> and it's a value judgement, so fair enough. But my guess is that the
> efforts that this implementation makes to minimize the amount of I/O
> required for a restore are going to be important for a lot of people.

Sure -- pg_combinebackup would only need to verify the data that it 
uses. I'm not suggesting that it should do an exhaustive verify of every 
single backup in the chain. Though I can see how it sounded that way 
since with pg_verifybackup that would pretty much be your only choice.

The beauty of doing verification in pg_combinebackup is that it can do a 
lot less than running pg_verifybackup against every backup but still get 
a valid result. All we care about is that the output is correct -- if 
there is corruption in an unused part of an earlier backup 
pg_combinebackup doesn't need to care about that.

As far as I can see, pg_combinebackup already checks most of the boxes. 
The only thing I know that it can't do is detect missing files and that 
doesn't seem like too big a thing to handle.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

19 April 2024, 15:47:07

On Thu, Apr 18, 2024 at 7:36 PM David Steele <david@pgmasters.net> wrote:
> Sure -- pg_combinebackup would only need to verify the data that it
> uses. I'm not suggesting that it should do an exhaustive verify of every
> single backup in the chain. Though I can see how it sounded that way
> since with pg_verifybackup that would pretty much be your only choice.
>
> The beauty of doing verification in pg_combinebackup is that it can do a
> lot less than running pg_verifybackup against every backup but still get
> a valid result. All we care about is that the output is correct -- if
> there is corruption in an unused part of an earlier backup
> pg_combinebackup doesn't need to care about that.
>
> As far as I can see, pg_combinebackup already checks most of the boxes.
> The only thing I know that it can't do is detect missing files and that
> doesn't seem like too big a thing to handle.

Hmm, that's an interesting perspective. I've always been very
skeptical of doing verification only around missing files and not
anything else. I figured that wouldn't be particularly meaningful, and
that's pretty much the only kind of validation that's even
theoretically possible without a bunch of extra overhead, since we
compute checksums on entire files rather than, say, individual blocks.
And you could really only do it for the final backup in the chain,
because you should end up accessing all of those files, but the same
is not true for the predecessor backups. So it's a very weak form of
verification.

But I looked into it and I think you're correct that, if you restrict
the scope in the way that you suggest, we can do it without much
additional code, or much additional run-time. The cost is basically
that, instead of only looking for a backup_manifest entry when we
think we can reuse its checksum, we need to do a lookup for every
single file in the final input directory. Then, after processing all
such files, we need to iterate over the hash table one more time and
see what files were never touched. That seems like an acceptably low
cost to me. So, here's a patch.

I do think there's some chance that this will encourage people to
believe that pg_combinebackup is better at finding problems than it
really is or ever will be, and I also question whether it's right to
keep changing stuff after feature freeze. But I have a feeling most
people here are going to think this is worth including in 17. Let's
see what others say.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v1-0001-pg_combinebackup-Detect-missing-files-when-possib.patch

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

22 April 2024, 00:47:10

On 4/20/24 01:47, Robert Haas wrote:
> On Thu, Apr 18, 2024 at 7:36 PM David Steele <david@pgmasters.net> wrote:
>> Sure -- pg_combinebackup would only need to verify the data that it
>> uses. I'm not suggesting that it should do an exhaustive verify of every
>> single backup in the chain. Though I can see how it sounded that way
>> since with pg_verifybackup that would pretty much be your only choice.
>>
>> The beauty of doing verification in pg_combinebackup is that it can do a
>> lot less than running pg_verifybackup against every backup but still get
>> a valid result. All we care about is that the output is correct -- if
>> there is corruption in an unused part of an earlier backup
>> pg_combinebackup doesn't need to care about that.
>>
>> As far as I can see, pg_combinebackup already checks most of the boxes.
>> The only thing I know that it can't do is detect missing files and that
>> doesn't seem like too big a thing to handle.
> 
> Hmm, that's an interesting perspective. I've always been very
> skeptical of doing verification only around missing files and not
> anything else. 

Yeah, me too. There should also be some verification of the file 
contents themselves but now I can see we don't have that. For instance, 
I can do something like this:

cp test/backup/incr1/base/1/3598 test/backup/incr1/base/1/2336

And pg_combinebackup runs without complaint. Maybe missing files are 
more likely than corrupted files, but it would still be nice to check 
for both.

> I figured that wouldn't be particularly meaningful, and
> that's pretty much the only kind of validation that's even
> theoretically possible without a bunch of extra overhead, since we
> compute checksums on entire files rather than, say, individual blocks.
> And you could really only do it for the final backup in the chain,
> because you should end up accessing all of those files, but the same
> is not true for the predecessor backups. So it's a very weak form of
> verification.

I don't think it is weak if you can verify that the output is exactly as 
expected, i.e. all files are present and have the correct contents.

But in this case it would be nice to at least know that the source files 
on disk are valid (which pg_verifybackup does). Without block checksums 
it is hard to know if the final output is correct or not.

> But I looked into it and I think you're correct that, if you restrict
> the scope in the way that you suggest, we can do it without much
> additional code, or much additional run-time. The cost is basically
> that, instead of only looking for a backup_manifest entry when we
> think we can reuse its checksum, we need to do a lookup for every
> single file in the final input directory. Then, after processing all
> such files, we need to iterate over the hash table one more time and
> see what files were never touched. That seems like an acceptably low
> cost to me. So, here's a patch.

I tested the patch and it works, but there is some noise from WAL files 
since they are not in the manifest:

$ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
pg_combinebackup: warning: "pg_wal/000000010000000000000008" is present 
on disk but not in the manifest
pg_combinebackup: error: "base/1/3596" is present in the manifest but 
not on disk

Maybe it would be better to omit this warning since it could get very 
noisy depending on the number of WAL segments generated during backup.

Though I do find it a bit odd that WAL files are not in the source 
backup manifests but do end up in the manifest after a pg_combinebackup. 
It doesn't seem harmful, just odd.

> I do think there's some chance that this will encourage people to
> believe that pg_combinebackup is better at finding problems than it
> really is or ever will be, 

Given that pg_combinebackup is not verifying checksums, you are probably 
right.

> and I also question whether it's right to
> keep changing stuff after feature freeze. But I have a feeling most
> people here are going to think this is worth including in 17. Let's
> see what others say.

I think it is a worthwhile change and we are still a month away from 
beta1. We'll see if anyone disagrees.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

22 April 2024, 13:53:41

On Sun, Apr 21, 2024 at 8:47 PM David Steele <david@pgmasters.net> wrote:
> > I figured that wouldn't be particularly meaningful, and
> > that's pretty much the only kind of validation that's even
> > theoretically possible without a bunch of extra overhead, since we
> > compute checksums on entire files rather than, say, individual blocks.
> > And you could really only do it for the final backup in the chain,
> > because you should end up accessing all of those files, but the same
> > is not true for the predecessor backups. So it's a very weak form of
> > verification.
>
> I don't think it is weak if you can verify that the output is exactly as
> expected, i.e. all files are present and have the correct contents.

I don't understand what you mean here. I thought we were in agreement
that verifying contents would cost a lot more. The verification that
we can actually do without much cost can only check for missing files
in the most recent backup, which is quite weak. pg_verifybackup is
available if you want more comprehensive verification and you're
willing to pay the cost of it.

> I tested the patch and it works, but there is some noise from WAL files
> since they are not in the manifest:
>
> $ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
> pg_combinebackup: warning: "pg_wal/000000010000000000000008" is present
> on disk but not in the manifest
> pg_combinebackup: error: "base/1/3596" is present in the manifest but
> not on disk

Oops, OK, that can be fixed.

> I think it is a worthwhile change and we are still a month away from
> beta1. We'll see if anyone disagrees.

I don't plan to press forward with this in this release unless we get
a couple of +1s from disinterested parties. We're now two weeks after
feature freeze and this is design behavior, not a bug. Perhaps the
design should have been otherwise, but two weeks after feature freeze
is not the time to debate that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

23 April 2024, 23:22:58

On 4/22/24 23:53, Robert Haas wrote:
> On Sun, Apr 21, 2024 at 8:47 PM David Steele <david@pgmasters.net> wrote:
>>> I figured that wouldn't be particularly meaningful, and
>>> that's pretty much the only kind of validation that's even
>>> theoretically possible without a bunch of extra overhead, since we
>>> compute checksums on entire files rather than, say, individual blocks.
>>> And you could really only do it for the final backup in the chain,
>>> because you should end up accessing all of those files, but the same
>>> is not true for the predecessor backups. So it's a very weak form of
>>> verification.
>>
>> I don't think it is weak if you can verify that the output is exactly as
>> expected, i.e. all files are present and have the correct contents.
> 
> I don't understand what you mean here. I thought we were in agreement
> that verifying contents would cost a lot more. The verification that
> we can actually do without much cost can only check for missing files
> in the most recent backup, which is quite weak. pg_verifybackup is
> available if you want more comprehensive verification and you're
> willing to pay the cost of it.

I simply meant that it is *possible* to verify the output of 
pg_combinebackup without explicitly verifying all the backups. There 
would be overhead, yes, but it would be less than verifying each backup 
individually. For my 2c that efficiency would make it worth doing 
verification in pg_combinebackup, with perhaps a switch to turn it off 
if the user is confident in their sources.

>> I think it is a worthwhile change and we are still a month away from
>> beta1. We'll see if anyone disagrees.
> 
> I don't plan to press forward with this in this release unless we get
> a couple of +1s from disinterested parties. We're now two weeks after
> feature freeze and this is design behavior, not a bug. Perhaps the
> design should have been otherwise, but two weeks after feature freeze
> is not the time to debate that.

It doesn't appear that anyone but me is terribly concerned about 
verification, even in this weak form, so probably best to hold this 
patch until the next release. As you say, it is late in the game.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

24 April 2024, 14:05:31

On Tue, Apr 23, 2024 at 7:23 PM David Steele <david@pgmasters.net> wrote:
> > I don't understand what you mean here. I thought we were in agreement
> > that verifying contents would cost a lot more. The verification that
> > we can actually do without much cost can only check for missing files
> > in the most recent backup, which is quite weak. pg_verifybackup is
> > available if you want more comprehensive verification and you're
> > willing to pay the cost of it.
>
> I simply meant that it is *possible* to verify the output of
> pg_combinebackup without explicitly verifying all the backups. There
> would be overhead, yes, but it would be less than verifying each backup
> individually. For my 2c that efficiency would make it worth doing
> verification in pg_combinebackup, with perhaps a switch to turn it off
> if the user is confident in their sources.

Hmm, can you outline the algorithm that you have in mind? I feel we've
misunderstood each other a time or two already on this topic, and I'd
like to avoid more of that. Unless you just mean what the patch I
posted does (check if anything from the final manifest is missing from
the corresponding directory), but that doesn't seem like verifying the
output.

> >> I think it is a worthwhile change and we are still a month away from
> >> beta1. We'll see if anyone disagrees.
> >
> > I don't plan to press forward with this in this release unless we get
> > a couple of +1s from disinterested parties. We're now two weeks after
> > feature freeze and this is design behavior, not a bug. Perhaps the
> > design should have been otherwise, but two weeks after feature freeze
> > is not the time to debate that.
>
> It doesn't appear that anyone but me is terribly concerned about
> verification, even in this weak form, so probably best to hold this
> patch until the next release. As you say, it is late in the game.

Added https://commitfest.postgresql.org/48/4951/

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

17 May 2024, 05:18:18

On 4/25/24 00:05, Robert Haas wrote:
> On Tue, Apr 23, 2024 at 7:23 PM David Steele <david@pgmasters.net> wrote:
>>> I don't understand what you mean here. I thought we were in agreement
>>> that verifying contents would cost a lot more. The verification that
>>> we can actually do without much cost can only check for missing files
>>> in the most recent backup, which is quite weak. pg_verifybackup is
>>> available if you want more comprehensive verification and you're
>>> willing to pay the cost of it.
>>
>> I simply meant that it is *possible* to verify the output of
>> pg_combinebackup without explicitly verifying all the backups. There
>> would be overhead, yes, but it would be less than verifying each backup
>> individually. For my 2c that efficiency would make it worth doing
>> verification in pg_combinebackup, with perhaps a switch to turn it off
>> if the user is confident in their sources.
> 
> Hmm, can you outline the algorithm that you have in mind? I feel we've
> misunderstood each other a time or two already on this topic, and I'd
> like to avoid more of that. Unless you just mean what the patch I
> posted does (check if anything from the final manifest is missing from
> the corresponding directory), but that doesn't seem like verifying the
> output.

Yeah, it seems you are right that it is not possible to verify the 
output in all cases.

However, I think allowing the user to optionally validate the input 
would be a good feature. Running pg_verifybackup as a separate step is 
going to be a more expensive then verifying/copying at the same time. 
Even with storage tricks to copy ranges of data, pg_combinebackup is 
going to aware of files that do not need to be verified for the current 
operation, e.g. old copies of free space maps.

Additionally, if pg_combinebackup is updated to work against tar.gz, 
which I believe will be important going forward, then there would be 
little penalty to verification since all the required data would be in 
memory at some point anyway. Though, if the file is compressed it might 
be redundant since compression formats generally include checksums.

One more thing occurs to me -- if data checksums are enabled then a 
rough and ready output verification would be to test the checksums 
during combine. Data checksums aren't very good but something should be 
triggered if a bunch of pages go wrong, especially since the block 
offset is part of the checksum. This would be helpful for catching 
combine bugs.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

17 May 2024, 12:20:55

On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote:
> However, I think allowing the user to optionally validate the input
> would be a good feature. Running pg_verifybackup as a separate step is
> going to be a more expensive then verifying/copying at the same time.
> Even with storage tricks to copy ranges of data, pg_combinebackup is
> going to aware of files that do not need to be verified for the current
> operation, e.g. old copies of free space maps.

In cases where pg_combinebackup reuses a checksums from the input
manifest rather than recomputing it, this could accomplish something.
However, for any file that's actually reconstructed, pg_combinebackup
computes the checksum as it's writing the output file. I don't see how
it's sensible to then turn around and verify that the checksum that we
just computed is the same one that we now get. It makes sense to run
pg_verifybackup on the output of pg_combinebackup at a later time,
because that can catch bits that have been flipped on disk in the
meanwhile. But running the equivalent of pg_verifybackup during
pg_combinebackup would amount to doing the exact same checksum
calculation twice and checking that it gets the same answer both
times.

> One more thing occurs to me -- if data checksums are enabled then a
> rough and ready output verification would be to test the checksums
> during combine. Data checksums aren't very good but something should be
> triggered if a bunch of pages go wrong, especially since the block
> offset is part of the checksum. This would be helpful for catching
> combine bugs.

I don't know, I'm not very enthused about this. I bet pg_combinebackup
has some bugs, and it's possible that one of them could involve
putting blocks in the wrong places, but it doesn't seem especially
likely. Even if it happens, it's more likely to be that
pg_combinebackup thinks it's putting them in the right places but is
actually writing them to the wrong offset in the file, in which case a
block-checksum calculation inside pg_combinebackup is going to think
everything's fine, but a standalone tool that isn't confused will be
able to spot the damage.

It's frustrating that we can't do better verification of these things,
but to fix that I think we need better infrastructure elsewhere. For
instance, if we made pg_basebackup copy blocks from shared_buffers
rather than the filesystem, or at least copy them when they weren't
being concurrently written to the filesystem, then we'd not have the
risk of torn pages producing spurious bad checksums. If we could
somehow render a backup consistent when taking it instead of when
restoring it, we could verify tons of stuff. If we had some useful
markers of how long files were supposed to be and which ones were
supposed to be present, we could check a lot of things that are
uncheckable today. pg_combinebackup does the best it can -- or the
best I could make it do -- but there's a disappointing number of
situations where it's like "hmm, in this situation, either something
bad happened or it's just the somewhat unusual case where this happens
in the normal course of events, and we have no way to tell which it
is."

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

17 May 2024, 22:14:27

On 5/17/24 22:20, Robert Haas wrote:
> On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote:
>> However, I think allowing the user to optionally validate the input
>> would be a good feature. Running pg_verifybackup as a separate step is
>> going to be a more expensive then verifying/copying at the same time.
>> Even with storage tricks to copy ranges of data, pg_combinebackup is
>> going to aware of files that do not need to be verified for the current
>> operation, e.g. old copies of free space maps.
> 
> In cases where pg_combinebackup reuses a checksums from the input
> manifest rather than recomputing it, this could accomplish something.
> However, for any file that's actually reconstructed, pg_combinebackup
> computes the checksum as it's writing the output file. I don't see how
> it's sensible to then turn around and verify that the checksum that we
> just computed is the same one that we now get. 

Here's an example. First make a few backups:

$ pg_basebackup -c fast -X none -D test/backup/full -F plain
$ pg_basebackup -c fast -D test/backup/incr1 -F plain -i 
/home/dev/test/backup/full/backup_manifest

Then intentionally corrupt a file in the incr backup:

$ truncate -s 0 test/backup/incr1/base/5/3764_fsm

In this case pg_verifybackup will error:

$ pg_verifybackup test/backup/incr1
pg_verifybackup: error: "base/5/3764_fsm" has size 0 on disk but size 
24576 in the manifest

But pg_combinebackup does not complain:

$ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
$ ls -lah test/backup/combine/base/5/3764_fsm
-rw------- 1 dev dialout 0 May 17 22:08 test/backup/combine/base/5/3764_fsm

It would be nice if pg_combinebackup would (at least optionally but 
prefferrably by default) complain in this case rather than the user 
needing to separately run pg_verifybackup.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Tomas Vondra

Date:

18 May 2024, 11:06:11


On 5/17/24 14:20, Robert Haas wrote:
> On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote:
>> However, I think allowing the user to optionally validate the input
>> would be a good feature. Running pg_verifybackup as a separate step is
>> going to be a more expensive then verifying/copying at the same time.
>> Even with storage tricks to copy ranges of data, pg_combinebackup is
>> going to aware of files that do not need to be verified for the current
>> operation, e.g. old copies of free space maps.
> 
> In cases where pg_combinebackup reuses a checksums from the input
> manifest rather than recomputing it, this could accomplish something.
> However, for any file that's actually reconstructed, pg_combinebackup
> computes the checksum as it's writing the output file. I don't see how
> it's sensible to then turn around and verify that the checksum that we
> just computed is the same one that we now get. It makes sense to run
> pg_verifybackup on the output of pg_combinebackup at a later time,
> because that can catch bits that have been flipped on disk in the
> meanwhile. But running the equivalent of pg_verifybackup during
> pg_combinebackup would amount to doing the exact same checksum
> calculation twice and checking that it gets the same answer both
> times.
> 
>> One more thing occurs to me -- if data checksums are enabled then a
>> rough and ready output verification would be to test the checksums
>> during combine. Data checksums aren't very good but something should be
>> triggered if a bunch of pages go wrong, especially since the block
>> offset is part of the checksum. This would be helpful for catching
>> combine bugs.
> 
> I don't know, I'm not very enthused about this. I bet pg_combinebackup
> has some bugs, and it's possible that one of them could involve
> putting blocks in the wrong places, but it doesn't seem especially
> likely. Even if it happens, it's more likely to be that
> pg_combinebackup thinks it's putting them in the right places but is
> actually writing them to the wrong offset in the file, in which case a
> block-checksum calculation inside pg_combinebackup is going to think
> everything's fine, but a standalone tool that isn't confused will be
> able to spot the damage.
> 

Perhaps more importantly, can you even verify data checksums before the
recovery is completed? I don't think you can (pg_checksums certainly
does not allow doing that). Because who knows in what shape you copied
the block?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

18 May 2024, 22:30:17

On 5/18/24 21:06, Tomas Vondra wrote:
> 
> On 5/17/24 14:20, Robert Haas wrote:
>> On Fri, May 17, 2024 at 1:18 AM David Steele <david@pgmasters.net> wrote:
>>> However, I think allowing the user to optionally validate the input
>>> would be a good feature. Running pg_verifybackup as a separate step is
>>> going to be a more expensive then verifying/copying at the same time.
>>> Even with storage tricks to copy ranges of data, pg_combinebackup is
>>> going to aware of files that do not need to be verified for the current
>>> operation, e.g. old copies of free space maps.
>>
>> In cases where pg_combinebackup reuses a checksums from the input
>> manifest rather than recomputing it, this could accomplish something.
>> However, for any file that's actually reconstructed, pg_combinebackup
>> computes the checksum as it's writing the output file. I don't see how
>> it's sensible to then turn around and verify that the checksum that we
>> just computed is the same one that we now get. It makes sense to run
>> pg_verifybackup on the output of pg_combinebackup at a later time,
>> because that can catch bits that have been flipped on disk in the
>> meanwhile. But running the equivalent of pg_verifybackup during
>> pg_combinebackup would amount to doing the exact same checksum
>> calculation twice and checking that it gets the same answer both
>> times.
>>
>>> One more thing occurs to me -- if data checksums are enabled then a
>>> rough and ready output verification would be to test the checksums
>>> during combine. Data checksums aren't very good but something should be
>>> triggered if a bunch of pages go wrong, especially since the block
>>> offset is part of the checksum. This would be helpful for catching
>>> combine bugs.
>>
>> I don't know, I'm not very enthused about this. I bet pg_combinebackup
>> has some bugs, and it's possible that one of them could involve
>> putting blocks in the wrong places, but it doesn't seem especially
>> likely. Even if it happens, it's more likely to be that
>> pg_combinebackup thinks it's putting them in the right places but is
>> actually writing them to the wrong offset in the file, in which case a
>> block-checksum calculation inside pg_combinebackup is going to think
>> everything's fine, but a standalone tool that isn't confused will be
>> able to spot the damage.
> 
> Perhaps more importantly, can you even verify data checksums before the
> recovery is completed? I don't think you can (pg_checksums certainly
> does not allow doing that). Because who knows in what shape you copied
> the block?

Yeah, you'd definitely need a list of blocks you knew to be valid at 
backup time, which sounds like a lot more work that just some overall 
checksumming scheme.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

20 May 2024, 17:09:46

On Fri, May 17, 2024 at 6:14 PM David Steele <david@pgmasters.net> wrote:
> Then intentionally corrupt a file in the incr backup:
>
> $ truncate -s 0 test/backup/incr1/base/5/3764_fsm
>
> In this case pg_verifybackup will error:
>
> $ pg_verifybackup test/backup/incr1
> pg_verifybackup: error: "base/5/3764_fsm" has size 0 on disk but size
> 24576 in the manifest
>
> But pg_combinebackup does not complain:
>
> $ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
> $ ls -lah test/backup/combine/base/5/3764_fsm
> -rw------- 1 dev dialout 0 May 17 22:08 test/backup/combine/base/5/3764_fsm
>
> It would be nice if pg_combinebackup would (at least optionally but
> prefferrably by default) complain in this case rather than the user
> needing to separately run pg_verifybackup.

My first reaction here is that it would be better to have people run
pg_verifybackup for this. If we try to do this in pg_combinebackup,
we're either going to be quite limited in the amount of validation we
can do (which might lure users into a false sense of security) or
we're going to make things quite a bit more complicated and expensive.

Perhaps there's something here that is worth doing; I haven't thought
about this deeply and can't really do so at present. I do believe in
reasonable error detection, which I hope goes without saying, but I
also believe strongly in orthogonality: a tool should do one job and
do it as well as possible.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

20 May 2024, 21:41:59

On 5/21/24 03:09, Robert Haas wrote:
> On Fri, May 17, 2024 at 6:14 PM David Steele <david@pgmasters.net> wrote:
>> Then intentionally corrupt a file in the incr backup:
>>
>> $ truncate -s 0 test/backup/incr1/base/5/3764_fsm
>>
>> In this case pg_verifybackup will error:
>>
>> $ pg_verifybackup test/backup/incr1
>> pg_verifybackup: error: "base/5/3764_fsm" has size 0 on disk but size
>> 24576 in the manifest
>>
>> But pg_combinebackup does not complain:
>>
>> $ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
>> $ ls -lah test/backup/combine/base/5/3764_fsm
>> -rw------- 1 dev dialout 0 May 17 22:08 test/backup/combine/base/5/3764_fsm
>>
>> It would be nice if pg_combinebackup would (at least optionally but
>> prefferrably by default) complain in this case rather than the user
>> needing to separately run pg_verifybackup.
> 
> My first reaction here is that it would be better to have people run
> pg_verifybackup for this. If we try to do this in pg_combinebackup,
> we're either going to be quite limited in the amount of validation we
> can do (which might lure users into a false sense of security) or
> we're going to make things quite a bit more complicated and expensive.
> 
> Perhaps there's something here that is worth doing; I haven't thought
> about this deeply and can't really do so at present. I do believe in
> reasonable error detection, which I hope goes without saying, but I
> also believe strongly in orthogonality: a tool should do one job and
> do it as well as possible.

OK, that seems like a good place to leave this for now.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

02 August 2024, 13:37:26

On Fri, Apr 19, 2024 at 11:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm, that's an interesting perspective. I've always been very
> skeptical of doing verification only around missing files and not
> anything else. I figured that wouldn't be particularly meaningful, and
> that's pretty much the only kind of validation that's even
> theoretically possible without a bunch of extra overhead, since we
> compute checksums on entire files rather than, say, individual blocks.
> And you could really only do it for the final backup in the chain,
> because you should end up accessing all of those files, but the same
> is not true for the predecessor backups. So it's a very weak form of
> verification.
>
> But I looked into it and I think you're correct that, if you restrict
> the scope in the way that you suggest, we can do it without much
> additional code, or much additional run-time. The cost is basically
> that, instead of only looking for a backup_manifest entry when we
> think we can reuse its checksum, we need to do a lookup for every
> single file in the final input directory. Then, after processing all
> such files, we need to iterate over the hash table one more time and
> see what files were never touched. That seems like an acceptably low
> cost to me. So, here's a patch.
>
> I do think there's some chance that this will encourage people to
> believe that pg_combinebackup is better at finding problems than it
> really is or ever will be, and I also question whether it's right to
> keep changing stuff after feature freeze. But I have a feeling most
> people here are going to think this is worth including in 17. Let's
> see what others say.

There was no hue and cry to include this in v17 and I think that ship
has sailed at this point, but we could still choose to include this as
an enhancement for v18 if people want it. I think David's probably in
favor of that (but I'm not 100% sure) and I have mixed feelings about
it (explained above) so what I'd really like is some other opinions on
whether this idea is good, bad, or indifferent.

Here is a rebased version of the patch. No other changes since v1.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v2-0001-pg_combinebackup-Detect-missing-files-when-possib.patch

Re: pg_combinebackup does not detect missing files

From

David Steele

Date:

05 August 2024, 03:58:44

On 8/2/24 20:37, Robert Haas wrote:
> On Fri, Apr 19, 2024 at 11:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Hmm, that's an interesting perspective. I've always been very
>> skeptical of doing verification only around missing files and not
>> anything else. I figured that wouldn't be particularly meaningful, and
>> that's pretty much the only kind of validation that's even
>> theoretically possible without a bunch of extra overhead, since we
>> compute checksums on entire files rather than, say, individual blocks.
>> And you could really only do it for the final backup in the chain,
>> because you should end up accessing all of those files, but the same
>> is not true for the predecessor backups. So it's a very weak form of
>> verification.
>>
>> But I looked into it and I think you're correct that, if you restrict
>> the scope in the way that you suggest, we can do it without much
>> additional code, or much additional run-time. The cost is basically
>> that, instead of only looking for a backup_manifest entry when we
>> think we can reuse its checksum, we need to do a lookup for every
>> single file in the final input directory. Then, after processing all
>> such files, we need to iterate over the hash table one more time and
>> see what files were never touched. That seems like an acceptably low
>> cost to me. So, here's a patch.
>>
>> I do think there's some chance that this will encourage people to
>> believe that pg_combinebackup is better at finding problems than it
>> really is or ever will be, and I also question whether it's right to
>> keep changing stuff after feature freeze. But I have a feeling most
>> people here are going to think this is worth including in 17. Let's
>> see what others say.
> 
> There was no hue and cry to include this in v17 and I think that ship
> has sailed at this point, but we could still choose to include this as
> an enhancement for v18 if people want it. I think David's probably in
> favor of that (but I'm not 100% sure) and I have mixed feelings about
> it (explained above) so what I'd really like is some other opinions on
> whether this idea is good, bad, or indifferent.

I'm still in favor but if nobody else is interested then I'm not going 
to push on it.

Regards,
-David

Re: pg_combinebackup does not detect missing files

From

Amul Sul

Date:

06 August 2024, 12:37:04

On Fri, Aug 2, 2024 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 11:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> [...]
> Here is a rebased version of the patch. No other changes since v1.
>

Here are two minor comments on this:

$ pg_combinebackup  /tmp/backup_full/ /tmp/backup_incr2/
/tmp/backup_incr3/ -o /tmp/backup_comb
pg_combinebackup: warning: "pg_wal/000000010000000000000020" is
present on disk but not in the manifest

This warning shouldn’t be reported, since we don’t include WAL in the
backup manifest ? Also, I found that the final resultant manifest
includes this WAL entry:

$ head /tmp/backup_comb/backup_manifest | grep pg_wal
{ "Path": "pg_wal/000000010000000000000020", "Size": 16777216,
"Last-Modified": "2024-08-06 11:54:16 GMT" },

---

+# Set up another new database instance.  force_initdb is used because
+# we want it to be a separate cluster with a different system ID.
+my $node2 = PostgreSQL::Test::Cluster->new('node2');
+$node2->init(force_initdb => 1, has_archiving => 1, allows_streaming => 1);
+$node2->append_conf('postgresql.conf', 'summarize_wal = on');
+$node2->start;
+

Unused cluster-node in the test.

Regards,
Amul

Re: pg_combinebackup does not detect missing files

From

Robert Haas

Date:

21 August 2024, 18:03:47

On Sun, Aug 4, 2024 at 11:58 PM David Steele <david@pgmasters.net> wrote:
> I'm still in favor but if nobody else is interested then I'm not going
> to push on it.

OK, so since this email was sent, Amul reviewed the patch (thanks,
Amul!) but didn't take a position on whether it was a good idea.
Nobody else has responded. Hence, I'm withdrawing this patch. If a
bunch of people show up to say we should really do this, we can
revisit the issue when that happens.

--
Robert Haas
EDB: http://www.enterprisedb.com