Re: backup manifests - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: backup manifests |
Date | |
Msg-id | 20200331225034.odt3l2w4h3dr6ggk@alap3.anarazel.de Whole thread Raw |
In response to | Re: backup manifests (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: backup manifests
Re: backup manifests |
List | pgsql-hackers |
Hi, On 2020-03-31 14:10:34 -0400, Robert Haas wrote: > I made an attempt to implement this. Awesome! > In the attached patch set, 0001 I'm going to work on those things. I > would appreciate *very timely* feedback on anything people do or do > not like about this, because I want to commit this patch set by the > end of the work week and that isn't very far away. I would also > appreciate if people would bear in mind the principle that half a loaf > is better than none, and further improvements can be made in future > releases. > > As part of my light testing, I tried promoting a standby that was > running pg_basebackup, and found that pg_basebackup failed like this: > > pg_basebackup: error: could not get COPY data stream: ERROR: the > standby was promoted during online backup > HINT: This means that the backup being taken is corrupt and should > not be used. Try taking another online backup. > pg_basebackup: removing data directory "/Users/rhaas/pgslave2" > > My first thought was that this error message is hard to reconcile with > this comment: > > /* > * Send timeline history files too. Only the latest timeline history > * file is required for recovery, and even that only if there happens > * to be a timeline switch in the first WAL segment that contains the > * checkpoint record, or if we're taking a base backup from a standby > * server and the target timeline changes while the backup is taken. > * But they are small and highly useful for debugging purposes, so > * better include them all, always. > */ > > But then it occurred to me that this might be a cascading standby. Yea. The check just prevents the walsender's database from being promoted: /* * Check if the postmaster has signaled us to exit, and abort with an * error in that case. The error handler further up will call * do_pg_abort_backup() for us. Also check that if the backup was * started while still in recovery, the server wasn't promoted. * do_pg_stop_backup() will check that too, but it's better to stop * the backup early than continue to the end and fail there. */ CHECK_FOR_INTERRUPTS(); if (RecoveryInProgress() != backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); and if (strcmp(backupfrom, "standby") == 0 && !backup_started_in_recovery) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("the standby was promoted during online backup"), errhint("This means that the backup being taken is corrupt " "and should not be used. " "Try taking another online backup."))); So that just prevents promotions of the current node, afaict. > Regardless, it seems like a really good idea to store a list of WAL > ranges rather than a single start/end/timeline, because even if it's > impossible today it might become possible in the future. Indeed. > Still, unless there's an easy way to set up a test scenario where > multiple WAL ranges need to be verified, it may be hard to test that > this code actually behaves properly. I think it'd be possible to test without a fully cascading setup, by creating an initial base backup, then do some work to create a bunch of new timelines, and then start the initial base backup. That'd have to follow all those timelines. Not sure that's better than a cascading setup though. > +/* > + * Add information about the WAL that will need to be replayed when restoring > + * this backup to the manifest. > + */ > +static void > +AddWALInfoToManifest(manifest_info *manifest, XLogRecPtr startptr, > + TimeLineID starttli, XLogRecPtr endptr, TimeLineID endtli) > +{ > + List *timelines = readTimeLineHistory(endtli); should probably happen after the manifest->buffile check. > + ListCell *lc; > + bool first_wal_range = true; > + bool found_ending_tli = false; > + > + /* If there is no buffile, then the user doesn't want a manifest. */ > + if (manifest->buffile == NULL) > + return; Not really about this patch/function specifically: I wonder if this'd look better if you added ManifestEnabled() macro instead of repeating the comment repeatedly. > + /* Unless --no-parse-wal was specified, we will need pg_waldump. */ > + if (!no_parse_wal) > + { > + int ret; > + > + pg_waldump_path = pg_malloc(MAXPGPATH); > + ret = find_other_exec(argv[0], "pg_waldump", > + "pg_waldump (PostgreSQL) " PG_VERSION "\n", > + pg_waldump_path); > + if (ret < 0) > + { > + char full_path[MAXPGPATH]; > + > + if (find_my_exec(argv[0], full_path) < 0) > + strlcpy(full_path, progname, sizeof(full_path)); > + if (ret == -1) > + pg_log_fatal("The program \"%s\" is needed by %s but was\n" > + "not found in the same directory as \"%s\".\n" > + "Check your installation.", > + "pg_waldump", "pg_validatebackup", full_path); > + else > + pg_log_fatal("The program \"%s\" was found by \"%s\" but was\n" > + "not the same version as %s.\n" > + "Check your installation.", > + "pg_waldump", full_path, "pg_validatebackup"); > + } > + } ISTM, and this can definitely wait for another time, that we should have one wrapper doing all of this, instead of having quite a few copies of very similar logic to the above. > +/* > + * Attempt to parse the WAL files required to restore from backup using > + * pg_waldump. > + */ > +static void > +parse_required_wal(validator_context *context, char *pg_waldump_path, > + char *wal_directory, manifest_wal_range *first_wal_range) > +{ > + manifest_wal_range *this_wal_range = first_wal_range; > + > + while (this_wal_range != NULL) > + { > + char *pg_waldump_cmd; > + > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" --timeline=%u --start=%X/%X --end=%X/%X\n", > + pg_waldump_path, wal_directory, this_wal_range->tli, > + (uint32) (this_wal_range->start_lsn >> 32), > + (uint32) this_wal_range->start_lsn, > + (uint32) (this_wal_range->end_lsn >> 32), > + (uint32) this_wal_range->end_lsn); > + if (system(pg_waldump_cmd) != 0) > + report_backup_error(context, > + "WAL parsing failed for timeline %u", > + this_wal_range->tli); > + > + this_wal_range = this_wal_range->next; > + } > +} Should we have a function to properly escape paths in cases like this? Not that it's likely or really problematic, but the quoting for path could be "circumvented". Greetings, Andres Freund
pgsql-hackers by date: