Home > mailing lists

Thread: pg_rewind WAL segments deletion pitfall

pg_rewind WAL segments deletion pitfall

From

Полина Бунгина

Date:

23 August 2022, 15:46:30

Hello,

It seems for me that there is currently a pitfall in the pg_rewind implementation.

Imagine the following situation:

There is a cluster consisting of a primary with the following configuration: wal_level=‘replica’, archive_mode=‘on’ and a replica.

The primary that is not fast enough in archiving WAL segments (e.g. network issues, high CPU/Disk load...)
The primary fails
The replica is promoted
We are not lucky enough, the new and the old primary’s timelines diverged, we need to run pg_rewind
We are even less lucky: the old primary still has some WAL segments with .ready signal files that were generated before the point of divergence and were not archived. (e.g. 000000020004D20200000095.done, 000000020004D20200000096.ready, 000000020004D20200000097.ready, 000000020004D20200000098.ready)
The promoted primary runs for some time and recycles the old WAL segments.
We revive the old primary and try to rewind it
When pg_rewind finished successfully, we see that the WAL segments with .ready files are removed, because they were already absent on the promoted replica. We end up in a situation where we completely lose some WAL segments, even though we had a clear sign that they were not archived and more importantly, pg_rewind read these segments while collecting information about the data blocks.
The old primary fails to start because of the missing WAL segments (more strictly, the records between the last common checkpoint and the point of divergence) with the following log record: "ERROR: requested WAL segment 000000020004D20200000096 has already been removed"

In this situation, after pg_rewind:
archived:

000000020004D20200000095

000000020004D20200000099.partial

000000030004D20200000099

the following segments are lost:

000000020004D20200000096

000000020004D20200000097

000000020004D20200000098

Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms of creating filemap for WALs? Can it preserve the WAL segments that contain those potentially lost records (> the last common checkpoint and < the point of divergence) on the target? (see the patch attached)

If I am missing something however, please correct me or explain why it is not possible to implement this straightforward solution.

Thank you,

Polina Bungina

Attachment

v1-0001-pg_rewind-wal-deletion.patch

Re: pg_rewind WAL segments deletion pitfall

From

Kyotaro Horiguchi

Date:

25 August 2022, 07:49:17

In the first place, this is not a bug. (At least doesn't seem.)
If you mean to propose behavioral changes, -hackers is the place.

At Tue, 23 Aug 2022 17:46:30 +0200, Полина Бунгина <bungina@gmail.com> wrote in 
>    4. We are not lucky enough, the new and the old primary’s timelines
>    diverged, we need to run pg_rewind
>    5. We are even less lucky: the old primary still has some WAL segments
>    with .ready signal files that were generated before the point of divergence
>    and were not archived.

That dones't harm pg_rewind at all.

>    6. The promoted primary runs for some time and recycles the old WAL
>    segments.
>    7. We revive the old primary and try to rewind it
>    8. When pg_rewind finished successfully, we see that the WAL segments
>    with .ready files are removed, because they were already absent on the
>    promoted replica. We end up in a situation where we completely lose some
>    WAL segments, even though we had a clear sign that they were not
>    archived and
>    more importantly, pg_rewind read these segments while collecting
>    information about the data blocks.

In terms of syncing the old primary to the new primary, no data has
been lost. The "lost" segments are anyway unusable for the new primary
since they no longer compatible with it.  How do you intended to use
the WAL files for the incompatible cluster?

>    9. The old primary fails to start because of the missing WAL segments
>    (more strictly, the records between the last common checkpoint and the
>    point of divergence) with the following log record: "ERROR:  requested WAL
>    segment 000000020004D20200000096 has already been removed"

That means that the tail end of the rewound old primary has been lost
on the new primary's pg_wal.  In that case, you need to somehow
copy-in the archived WAL files on the new primary. You can just do
that or you can set up restore_command properly.
> Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
> of creating filemap for WALs? Can it preserve the WAL segments that contain
> those potentially lost records (> the last common checkpoint and  < the
> point of divergence) on the target? (see the patch attached)

Since they are not really needed once rewind completes.

> If I am missing something however, please correct me or explain why it is
> not possible to implement this straightforward solution.

Maybe you're mistaking the operation.  If I understand the situation
correctly, I think the following steps replays your "issue" and then
resolve that.


# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'always'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
pg_ctl -D newprim promote
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D oldprim stop
echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
# pg_rewind -D oldprim --source-server='port=5433' # fails
pg_rewind -D oldprim --source-server='port=5433' -c
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim

>  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000020000000000000003 has already
beenremoved
 

[ctrl-C]
======

Now that the old primary requires older WAL files *on the new
primary*.  Here, define restore command to do that.

=====
echo "restore_command='cp `pwd`/newarch/%f %p'">> oldprim/postgresql.conf
postgres -D oldprim
=====

Now the old primary run as the standby of the new primary.

> LOG:  restored log file "000000020000000000000006" from archive
> LOG:  consistent recovery state reached at 0/30020B0
> LOG:  database system is ready to accept read-only connections

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_rewind WAL segments deletion pitfall

From

Alexander Kukushkin

Date:

25 August 2022, 08:34:40

Hello Kyotaro,

On Thu, 25 Aug 2022 at 09:49, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

In the first place, this is not a bug. (At least doesn't seem.)
If you mean to propose behavioral changes, -hackers is the place.

Well, maybe... We can always change it.

> 8. When pg_rewind finished successfully, we see that the WAL segments
> with .ready files are removed, because they were already absent on the
> promoted replica. We end up in a situation where we completely lose some
> WAL segments, even though we had a clear sign that they were not
> archived and
> more importantly, pg_rewind read these segments while collecting
> information about the data blocks.

In terms of syncing the old primary to the new primary, no data has
been lost. The "lost" segments are anyway unusable for the new primary
since they no longer compatible with it. How do you intended to use
the WAL files for the incompatible cluster?

These files are required for the old primary to start as a replica.

> 9. The old primary fails to start because of the missing WAL segments
> (more strictly, the records between the last common checkpoint and the
> point of divergence) with the following log record: "ERROR: requested WAL
> segment 000000020004D20200000096 has already been removed"

That means that the tail end of the rewound old primary has been lost
on the new primary's pg_wal.

Correct. The old primary was down for about 20m and we have checkpoint_timeout = 5m, so the new primary already recycled them.

In that case, you need to somehow
copy-in the archived WAL files on the new primary. You can just do
that or you can set up restore_command properly.

These files never made it to the archive because the server crashed. The only place where they existed was pg_wal in the old primary.

> Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms
> of creating filemap for WALs? Can it preserve the WAL segments that contain
> those potentially lost records (> the last common checkpoint and < the
> point of divergence) on the target? (see the patch attached)

Since they are not really needed once rewind completes.

The pg_rewind creates the backup_label file with START WAL LOCATION and CHECKPOINT LOCATION that point to the last common checkpoint.

Removed files are between the last common checkpoint and diverged WAL location, and therefore are required for Postgres to do successful recovery.

Since these files never made it to the archive and are also absent on the new primary the old primary can't start as a replica.

And I will emphasize one more time, that these files were removed by pg_rewind despite the known fact that they are required to perform a recovery.

> If I am missing something however, please correct me or explain why it is
> not possible to implement this straightforward solution.

Maybe you're mistaking the operation.

We are not (Patroni author is here).

If I understand the situation
correctly, I think the following steps replays your "issue" and then
resolve that.

# killall -9 postgres
# rm -r oldprim newprim oldarch newarch oldprim.log newprim.log
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'always'">> oldprim/postgresql.conf

With archive_mode = always you can't reproduce it.

It is very rarely people set it to always in production due to the overhead.

echo "archive_command = 'cp %p `pwd`/oldarch/%f'">> oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> oldprim/postgresql.conf
echo "archive_command = 'cp %p `pwd`/newarch/%f'">> newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start
pg_ctl -D newprim promote
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint'
pg_ctl -D oldprim stop

The archive_mode has to be set to on and the archive_command should be failing when you do pg_ctl -D oldprim stop

echo "restore_command = 'cp `pwd`/oldarch/%f %p'">> oldprim/postgresql.conf
# pg_rewind -D oldprim --source-server='port=5433' # fails
pg_rewind -D oldprim --source-server='port=5433' -c
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal

postgres -D oldprim

> FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000003 has already been removed

Regards,

Alexander Kukushkin