Re: backup manifests - Mailing list pgsql-hackers
| From | Andres Freund |
|---|---|
| Subject | Re: backup manifests |
| Date | |
| Msg-id | 20200331225034.odt3l2w4h3dr6ggk@alap3.anarazel.de Whole thread Raw |
| In response to | Re: backup manifests (Robert Haas <robertmhaas@gmail.com>) |
| Responses |
Re: backup manifests
Re: backup manifests |
| List | pgsql-hackers |
Hi,
On 2020-03-31 14:10:34 -0400, Robert Haas wrote:
> I made an attempt to implement this.
Awesome!
> In the attached patch set, 0001 I'm going to work on those things. I
> would appreciate *very timely* feedback on anything people do or do
> not like about this, because I want to commit this patch set by the
> end of the work week and that isn't very far away. I would also
> appreciate if people would bear in mind the principle that half a loaf
> is better than none, and further improvements can be made in future
> releases.
>
> As part of my light testing, I tried promoting a standby that was
> running pg_basebackup, and found that pg_basebackup failed like this:
>
> pg_basebackup: error: could not get COPY data stream: ERROR: the
> standby was promoted during online backup
> HINT: This means that the backup being taken is corrupt and should
> not be used. Try taking another online backup.
> pg_basebackup: removing data directory "/Users/rhaas/pgslave2"
>
> My first thought was that this error message is hard to reconcile with
> this comment:
>
> /*
> * Send timeline history files too. Only the latest timeline history
> * file is required for recovery, and even that only if there happens
> * to be a timeline switch in the first WAL segment that contains the
> * checkpoint record, or if we're taking a base backup from a standby
> * server and the target timeline changes while the backup is taken.
> * But they are small and highly useful for debugging purposes, so
> * better include them all, always.
> */
>
> But then it occurred to me that this might be a cascading standby.
Yea. The check just prevents the walsender's database from being
promoted:
/*
* Check if the postmaster has signaled us to exit, and abort with an
* error in that case. The error handler further up will call
* do_pg_abort_backup() for us. Also check that if the backup was
* started while still in recovery, the server wasn't promoted.
* do_pg_stop_backup() will check that too, but it's better to stop
* the backup early than continue to the end and fail there.
*/
CHECK_FOR_INTERRUPTS();
if (RecoveryInProgress() != backup_started_in_recovery)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during online backup"),
errhint("This means that the backup being taken is corrupt "
"and should not be used. "
"Try taking another online backup.")));
and
if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("the standby was promoted during online backup"),
errhint("This means that the backup being taken is corrupt "
"and should not be used. "
"Try taking another online backup.")));
So that just prevents promotions of the current node, afaict.
> Regardless, it seems like a really good idea to store a list of WAL
> ranges rather than a single start/end/timeline, because even if it's
> impossible today it might become possible in the future.
Indeed.
> Still, unless there's an easy way to set up a test scenario where
> multiple WAL ranges need to be verified, it may be hard to test that
> this code actually behaves properly.
I think it'd be possible to test without a fully cascading setup, by
creating an initial base backup, then do some work to create a bunch of
new timelines, and then start the initial base backup. That'd have to
follow all those timelines. Not sure that's better than a cascading
setup though.
> +/*
> + * Add information about the WAL that will need to be replayed when restoring
> + * this backup to the manifest.
> + */
> +static void
> +AddWALInfoToManifest(manifest_info *manifest, XLogRecPtr startptr,
> + TimeLineID starttli, XLogRecPtr endptr, TimeLineID endtli)
> +{
> + List *timelines = readTimeLineHistory(endtli);
should probably happen after the manifest->buffile check.
> + ListCell *lc;
> + bool first_wal_range = true;
> + bool found_ending_tli = false;
> +
> + /* If there is no buffile, then the user doesn't want a manifest. */
> + if (manifest->buffile == NULL)
> + return;
Not really about this patch/function specifically: I wonder if this'd
look better if you added ManifestEnabled() macro instead of repeating
the comment repeatedly.
> + /* Unless --no-parse-wal was specified, we will need pg_waldump. */
> + if (!no_parse_wal)
> + {
> + int ret;
> +
> + pg_waldump_path = pg_malloc(MAXPGPATH);
> + ret = find_other_exec(argv[0], "pg_waldump",
> + "pg_waldump (PostgreSQL) " PG_VERSION "\n",
> + pg_waldump_path);
> + if (ret < 0)
> + {
> + char full_path[MAXPGPATH];
> +
> + if (find_my_exec(argv[0], full_path) < 0)
> + strlcpy(full_path, progname, sizeof(full_path));
> + if (ret == -1)
> + pg_log_fatal("The program \"%s\" is needed by %s but was\n"
> + "not found in the same directory as \"%s\".\n"
> + "Check your installation.",
> + "pg_waldump", "pg_validatebackup", full_path);
> + else
> + pg_log_fatal("The program \"%s\" was found by \"%s\" but was\n"
> + "not the same version as %s.\n"
> + "Check your installation.",
> + "pg_waldump", full_path, "pg_validatebackup");
> + }
> + }
ISTM, and this can definitely wait for another time, that we should have
one wrapper doing all of this, instead of having quite a few copies of
very similar logic to the above.
> +/*
> + * Attempt to parse the WAL files required to restore from backup using
> + * pg_waldump.
> + */
> +static void
> +parse_required_wal(validator_context *context, char *pg_waldump_path,
> + char *wal_directory, manifest_wal_range *first_wal_range)
> +{
> + manifest_wal_range *this_wal_range = first_wal_range;
> +
> + while (this_wal_range != NULL)
> + {
> + char *pg_waldump_cmd;
> +
> + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" --timeline=%u --start=%X/%X --end=%X/%X\n",
> + pg_waldump_path, wal_directory, this_wal_range->tli,
> + (uint32) (this_wal_range->start_lsn >> 32),
> + (uint32) this_wal_range->start_lsn,
> + (uint32) (this_wal_range->end_lsn >> 32),
> + (uint32) this_wal_range->end_lsn);
> + if (system(pg_waldump_cmd) != 0)
> + report_backup_error(context,
> + "WAL parsing failed for timeline %u",
> + this_wal_range->tli);
> +
> + this_wal_range = this_wal_range->next;
> + }
> +}
Should we have a function to properly escape paths in cases like this?
Not that it's likely or really problematic, but the quoting for path
could be "circumvented".
Greetings,
Andres Freund
pgsql-hackers by date: