Re: trying again to get incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: trying again to get incremental backup
Date
Msg-id CA+TgmoaiOhjd0pL630MSpZ_fyjFsARFC8RDBWxWL2FXPv65bJA@mail.gmail.com
Whole thread Raw
In response to Re: trying again to get incremental backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: trying again to get incremental backup
Re: trying again to get incremental backup
List pgsql-hackers
On Mon, Dec 4, 2023 at 3:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Considering all this, what I'm inclined to do is go and put
> UPLOAD_MANIFEST back, instead of INCREMENTAL_WAL_RANGE, and adjust
> accordingly. But first: does anybody see more problems here that I may
> have missed?

OK, so here's a new version with UPLOAD_MANIFEST put back. I wrote a
long comment explaining why that's believed to be necessary and
sufficient. I committed 0001 and 0002 from the previous series also,
since it doesn't seem like anyone has further comments on those
renamings.

This version also improves (at least, IMHO) the way that we wait for
WAL summarization to finish. Previously, you either caught up fully
within 60 seconds or you died. I didn't like that, because it seemed
like some people would get timeouts when the operation was slowly
progressing and would eventually succeed. So what this version does
is:

- Every 10 seconds, it logs a warning saying that it's still waiting
for WAL summarization. That way, a human operator can understand
what's happening easily, and cancel if they want.

- If 60 seconds go by without the WAL summarizer ingesting even a
single WAL record, it times out. That way, if the WAL summarizer is
dead or totally stuck (e.g. debugger attached, hung I/O) the user
won't be left waiting forever even if they never cancel. But if it's
just slow, it probably won't time out, and the operation should
eventually succeed.

To me, this seems like a reasonable compromise. It might be
unreasonable if WAL summarization is proceeding at a very low but
non-zero rate. But it's hard for me to think of a situation where that
will happen, with the exception of when CPU or I/O are badly
overloaded. But in those cases, the WAL generation rate is probably
also not that high, because apparently the system is paralyzed, so
maybe the wait won't even be that bad, especially given that
everything else on the box should be super-slow too. Plus, even if we
did want to time out in such a case, it's hard to know how slow is too
slow. In any event, I think most failures here are likely to be
complete failures, where the WAL summarizer just doesn't, so the fact
that this times out in those cases seems to me to likely be as much as
we need to do here. But if someone sees a problem with this or has a
clever idea how to make it better, I'm all ears.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: UBSan pointer overflow in xlogreader.c
Next
From: Robert Haas
Date:
Subject: Re: backtrace_on_internal_error