Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done - Mailing list pgsql-hackers

From Andrey M. Borodin
Subject Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done
Date
Msg-id 171297CF-48B3-49D1-B5B2-BAFEFCA63712@yandex-team.ru
Whole thread Raw
Responses Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done
List pgsql-hackers
Hi Roman!
Thanks for raising the issue. I think the root cause is that many systems imply that higher number of timeline id means
morerecent timeline write. This invariant is not uphold. It's not even more recent timeline start. 
"latest timeline" effectively means "random timeline".

> On 17 Jan 2025, at 06:05, Roman Eskin <r.eskin@arenadata.io> wrote:
>
> 5. Switch back instance_1 and instance_2 to the original
> configuration. And here, when we try to start instance_2 as Replica,
> we'll get a FATAL:
> "FATAL: requested timeline 2 is not a child of this server's history
> DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
> history of the requested timeline, the server forked off from that
> timeline at 0/3023538."

I think here you can just specify target timeline for the standby instance_1 and it will continue recovery from
instance_2.

Having say that, I must admit that we observe something similar approximately 2 times a week, tried several fixes, but
stillhave to live with it. 
In our case we have a "resetup" cron job, which will automatically rebuild replica from backup if Postgres cannot start
recoveryfor some hours. 
So in our case this looks like extra 3 hours of standby downtime.

I'm not sure if this is a result of pgconsul not setting up target timeline or some other error...

Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra
_timeout_is a source of complexity, uncertainty and despair. 

Thanks!


Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Michail Nikolaev
Date:
Subject: Re: Issue with markers in isolation tester? Or not?
Next
From: Robert Treat
Date:
Subject: Re: Eagerly scan all-visible pages to amortize aggressive vacuum