Re: backup manifests - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: backup manifests |
Date | |
Msg-id | 20200323224217.GN13712@tamriel.snowman.net Whole thread Raw |
In response to | Re: backup manifests (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: backup manifests
|
List | pgsql-hackers |
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > I think I forgot an initializer. Try this version. Just took a quick look through this. I'm pretty sure David wants to look at it too. Anyway, some comments below. > diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml > index f139ba0231..d1ff53e8e8 100644 > --- a/doc/src/sgml/protocol.sgml > +++ b/doc/src/sgml/protocol.sgml > @@ -2466,7 +2466,7 @@ The commands accepted in replication mode are: > </varlistentry> > > <varlistentry id="protocol-replication-base-backup" xreflabel="BASE_BACKUP"> > - <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal>] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal><replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ] [ <literal>NOVERIFY_CHECKSUMS</literal>] > + <term><literal>BASE_BACKUP</literal> [ <literal>LABEL</literal> <replaceable>'label'</replaceable> ] [ <literal>PROGRESS</literal>] [ <literal>FAST</literal> ] [ <literal>WAL</literal> ] [ <literal>NOWAIT</literal> ] [ <literal>MAX_RATE</literal><replaceable>rate</replaceable> ] [ <literal>TABLESPACE_MAP</literal> ] [ <literal>NOVERIFY_CHECKSUMS</literal>] [ <literal>MANIFEST</literal> <replaceable>manifest_option</replaceable> ] [ <literal>MANIFEST_CHECKSUMS</literal><replaceable>checksum_algorithm</replaceable> ] > <indexterm><primary>BASE_BACKUP</primary></indexterm> > </term> > <listitem> > @@ -2576,6 +2576,37 @@ The commands accepted in replication mode are: > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><literal>MANIFEST</literal></term> > + <listitem> > + <para> > + When this option is specified with a value of <literal>ye'</literal> > + or <literal>force-escape</literal>, a backup manifest is created > + and sent along with the backup. The latter value forces all filenames > + to be hex-encoded; otherwise, this type of encoding is performed only > + for files whose names are non-UTF8 octet sequences. > + <literal>force-escape</literal> is intended primarily for testing > + purposes, to be sure that clients which read the backup manifest > + can handle this case. For compatibility with previous releases, > + the default is <literal>MANIFEST 'no'</literal>. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > + <term><literal>MANIFEST_CHECKSUMS</literal></term> > + <listitem> > + <para> > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > + <literal>SHA224</literal>, <literal>SHA256</literal>, > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > + The default is <literal>CRC32C</literal>. > + </para> > + </listitem> > + </varlistentry> > </variablelist> > </para> > <para> While I get the desire to have a default here that includes checksums, the way the command is structured, it strikes me as odd that the lack of MANIFEST_CHECKSUMS in the command actually results in checksums being included. I would think that we'd either: - have the lack of MANIFEST_CHECKSUMS mean 'No checksums' or - Require MANIFEST_CHECKSUMS to be specified and not have it be optional We aren't expecting people to actually be typing these commands out and so I don't think it's a *huge* deal to have it the way you've written it, but it still strikes me as odd. I don't think I have a real preference between the two options that I suggest above, maybe very slightly in favor of the first. > diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml > index 90638aad0e..bf6963a595 100644 > --- a/doc/src/sgml/ref/pg_basebackup.sgml > +++ b/doc/src/sgml/ref/pg_basebackup.sgml > @@ -561,6 +561,69 @@ PostgreSQL documentation > </para> > </listitem> > </varlistentry> > + > + <varlistentry> > + <term><option>--no-manifest</option></term> > + <listitem> > + <para> > + Disables generation of a backup manifest. If this option is not > + specified, the server will and send generate a backup manifest > + which can be verified using <xref linkend="app-pgvalidatebackup" />. > + </para> > + </listitem> > + </varlistentry> How about "If this option is not specified, the server will generate and send a backup manifest which can be verified using ..." > + <varlistentry> > + <term><option>--manifest-checksums=<replaceable class="parameter">algorithm</replaceable></option></term> > + <listitem> > + <para> > + Specifies the algorithm that should be used to checksum each file > + for purposes of the backup manifest. Currently, the available > + algorithms are <literal>NONE</literal>, <literal>CRC32C</literal>, > + <literal>SHA224</literal>, <literal>SHA256</literal>, > + <literal>SHA384</literal>, and <literal>SHA512</literal>. > + The default is <literal>CRC32C</literal>. > + </para> As I recall, there was an invitation to argue about the defaults at one point, and so I'm going to say here that I would advocate for a different default than 'crc32c'. Specifically, I would think sha256 or 512 would be better. I don't recall seeing a debate about this that conclusively found crc32c to be better, but I'm happy to go back and reread anything someone wants to point me at. > + <para> > + If <literal>NONE</literal> is selected, the backup manifest will > + not contain any checksums. Otherwise, it will contain a checksum > + of each file in the backup using the specified algorithm. In addition, > + the manifest itself will always contain a <literal>SHA256</literal> > + checksum of its own contents. The <literal>SHA</literal> algorithms > + are significantly more CPU-intensive than <literal>CRC32C</literal>, > + so selecting one of them may increase the time required to complete > + the backup. > + </para> It also seems a bit silly to me that using the defaults means having to deal with two different algorithms- crc32c and sha256. Considering how fast these algorithms are, compared to everything else involved in a backup (particularly one that's likely going across a network...), I wonder if we should say "may slightly increase" above. > + <para> > + On the other hand, <literal>CRC32C</literal> is not a cryptographic > + hash function, so it is only suitable for protecting against > + inadvertent or random modifications to a backup. An adversary > + who can modify the backup could easily do so in such a way that > + the CRC does not change, whereas a SHA collision will be hard > + to manufacture. (However, note that if the attacker also has access > + to modify the backup manifest itself, no checksum algorithm will > + provide any protection.) An additional advantage of the > + <literal>SHA</literal> family of functions is that they output > + a much larger number of bits. > + </para> I'm not really sure that this paragraph is sensible to include.. We certainly don't talk about adversaries and cryptographic hash functions when we talk about our page-level checksums, for example. I'm not completely against including it, but I don't want to give the impression that this is something we routinely consider or that lack of discussion elsewhere implies we have protections against a determined attacker. > diff --git a/doc/src/sgml/ref/pg_validatebackup.sgml b/doc/src/sgml/ref/pg_validatebackup.sgml > new file mode 100644 > index 0000000000..1c171f6970 > --- /dev/null > +++ b/doc/src/sgml/ref/pg_validatebackup.sgml > @@ -0,0 +1,232 @@ > +<!-- > +doc/src/sgml/ref/pg_validatebackup.sgml > +PostgreSQL documentation > +--> > + > +<refentry id="app-pgvalidatebackup"> > + <indexterm zone="app-pgvalidatebackup"> > + <primary>pg_validatebackup</primary> > + </indexterm> > + > + <refmeta> > + <refentrytitle>pg_validatebackup</refentrytitle> > + <manvolnum>1</manvolnum> > + <refmiscinfo>Application</refmiscinfo> > + </refmeta> > + > + <refnamediv> > + <refname>pg_validatebackup</refname> > + <refpurpose>verify the integrity of a base backup of a > + <productname>PostgreSQL</productname> cluster</refpurpose> > + </refnamediv> "verify the integrity of a backup taken using pg_basebackup" > + <refsect1> > + <title> > + Description > + </title> > + <para> > + <application>pg_validatebackup</application> is used to check the integrity > + of a database cluster backup. The backup being checked should have been > + created by <command>pg_basebackup</command> or some other tool that includes > + a <literal>backup_manifest</literal> file with the backup. The backup > + must be stored in the "plain" format; a "tar" format backup can be checked > + after extracting it. Backup manifests are created by the server beginning > + with <productname>PostgreSQL</productname> version 13, so older backups > + cannot be validated using this tool. > + </para> This seems to invite the idea that pg_validatebackup should be able to work with external backup solutions- but I'm a bit concerned by that idea because it seems like it would then mean we'd have to be particularly careful when changing things in this area, and I'm not thrilled by that. I'd like to make sure that new versions of pg_validatebackup work with older backups, and, ideally, older versions of pg_validatebackup would work even with newer backups, all of which I think the json structure of the manifest helps us with, but that's when we're building the manifest and know what it's going to look like. Maybe to put it another way- would a patch be accepted to make pg_validatebackup work with other manifests..? If not, then I'd keep this to the more specific "this tool is used to validate backups taken using pg_basebackup". > + <para> > + <application>pg_validatebackup</application> reads the manifest file of a > + backup, verifies the manifest against its own internal checksum, and then > + verifies that the same files are present in the target directory as in the > + manifest itself. It then verifies that each file has the expected checksum, > + unless the backup was taken the checksum algorithm set to "was taken with the checksum algorithm"... > + <literal>none</literal>, in which case checksum verification is not > + performed. The presence or absence of directories is not checked, except > + indirectly: if a directory is missing, any files it should have contained > + will necessarily also be missing. Certain files and directories are > + excluded from verification: > + </para> > + > + <itemizedlist> > + <listitem> > + <para> > + <literal>backup_manifest</literal> is ignored because the backup > + manifest is logically not part of the backup and does not include > + any entry for itself. > + </para> > + </listitem> This seems a bit confusing, doesn't it? The backup_manifest must exist, and its checksum is internal, and is checked, isn't it? Why say that it's excluded..? > + <listitem> > + <para> > + <literal>pg_wal</literal> is ignored because WAL files are sent > + separately from the backup, and are therefore not described by the > + backup manifest. > + </para> > + </listitem> I don't agree with the choice to exclude the WAL files, considering they're an integral part of a backup, to exclude them means that if they've been corrupted at all then the entire backup is invalid. You don't want to be discovering that when you're trying to do a restore of a backup that you took with pg_basebackup and which pg_validatebackup says is valid. After all, the tool being used here, pg_basebackup, *does* also stream the WAL files- there's no reason why we can't calculate a checksum on them and store that checksum somewhere and use it to validate the WAL files. This, in my opinion, is actually a show-stopper for this feature. Claiming it's a valid backup when we don't check the absolutely necessary-for-restore WAL is making a false claim, no matter how well it's documented. I do understand that it's possible to run pg_basebackup without the WAL files being grabbed as part of that run- in such a case, we should be able to detect that was the case for the backup and when running pg_validatebackup we should issue a WARNING that the WAL files weren't able to be verified (we could have an option to suppress that warning if people feel that's needed). > + <listitem> > + <para> > + <literal>postgesql.auto.conf</literal>, > + <literal>standby.signal</literal>, > + and <literal>recovery.signal</literal> are ignored because they may > + sometimes be created or modified by the backup client itself. > + (For example, <literal>pg_basebackup -R</literal> will modify > + <literal>postgresql.auto.conf</literal> and create > + <literal>standby.signal</literal>.) > + </para> > + </listitem> > + </itemizedlist> > + </refsect1> Not really thrilled with this (pg_basebackup certainly could figure out the checksum for those files...), but I also don't think it's a huge issue as they can be recreated by a user (unlike a WAL file..). I got through most of the pg_basebackup changes, and they looked pretty good in general. Will try to review more tomorrow. Thanks, Stephen
Attachment
pgsql-hackers by date: