Re: Timeline switching with partial WAL records can break replica recovery - Mailing list pgsql-hackers

From Alyona Vinter
Subject Re: Timeline switching with partial WAL records can break replica recovery
Date
Msg-id CAGWv16JqHWZRnWUcTTEMF=0f+zqpboU4t+eKMANeTJObecYPXA@mail.gmail.com
Whole thread Raw
In response to Re: Timeline switching with partial WAL records can break replica recovery  (Nataliia <k.natalissa@gmail.com>)
List pgsql-hackers
Hi!

I've noticed an issue with pg_rewind caused by my patches. 

Some logs for issue demonstration:
pg_rewind: Source timeline history:
pg_rewind: 1: 0/00000000 - 0/03002048
pg_rewind: 2: 0/03002048 - 0/00000000
pg_rewind: Target timeline history:
pg_rewind: 1: 0/00000000 - 0/00000000
pg_rewind: servers diverged at WAL location 0/03002048 on timeline 1
pg_rewind: error: could not find previous WAL record at 0/03002048: invalid record length at 0/03002048: expected at least 24, got 0

When a common timeline ends with an overwritten contrecord, the divergence point may not point to the start of a valid WAL record on the target, causing errors and making rewind impossible.
To handle this case, I suggest looking for a checkpoint preceding the divergence point starting from the last checkpoint on the target rather than from the divergence point itself when the common timeline is unfinished on the target. This ensures we always begin from a known-valid position in WAL. 

I'd appreciate any feedback!

Best Regards,
Alyona Vinter
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Conflict detection for update_deleted in logical replication
Next
From: Amit Kapila
Date:
Subject: Re: pgsql: Preserve conflict-relevant data during logical replication.