Re: Newly created replication slot may be invalidated by checkpoint - Mailing list pgsql-hackers

From Vitaly Davydov
Subject Re: Newly created replication slot may be invalidated by checkpoint
Date
Msg-id 15922-68ca9280-4f-37de2c40@245457797
Whole thread Raw
In response to Newly created replication slot may be invalidated by checkpoint  ("suyu.cmj" <mengjuan.cmj@alibaba-inc.com>)
List pgsql-hackers
Hi suyu.cmj

> The commit 2090edc6f32f652a2c introduced a change that the
> minimal restart_lsn is obtained at the start of checkpoint creation. If a
> replication slot is created and performs a WAL reservation concurrently, the
> WAL segment contains the new slot's restart_lsn could be removed by the ongoing
> checkpoint.

Thank you for reporting this issue. I agree, the issue with slot invalidation
seems to take place in REL_17_STABLE and earlier, but it is not reproducible in
18+ versions because of different implementation. The problem may appear if
the first persistent slot is created during checkpoint, when slot's oldest lsn
is invalid. I'm not sure how it works when some other persistent slots exist.
Probably, invalidation is still possible if the reservation happens with lsn
older than the oldest lsn of existing slots.

In 17 and earlier verions, when checkpoint is started in takes slot's oldest lsn
using XLogGetReplicationSlotMinimumLSN(). This value will be used later in WAL
segments removal. If a new slot reserved the WAL between getting of slots'
oldest lsn and WAL removal, it may be invalidated. It happens because
ReplicationSlotReserveWal() checks XLogCtl->lastRemovedSegNo but the segments
are not yet removed. There is a subtle thing, when the wal reservation completes
at the same time when the checkpointer is between KeepLogSeg and
RemoveOldXlogFiles where XLogCtl->lastRemovedSegNo is updated. The slot will not
be invalidated but the segments, reserved by the new slot, may be removed, I guess.

In 17 and earlier we tried to create a compatible solution, when oldest lsn was
taken before slot syncing to disk. In the master branch we added a new
last_saved_restart_lsn into ReplicationSlot structure which seems to be a better
solution.

I prepared a simple fix [1] for 17 and earlier versions. It seems it fixes the
problem with first persistent slot creation. I also think, it should work as it
was before the patch that added this bug.

I also did some changes in the original test script, for 17 ([2]) and 18 ([3])
versions.

I continue to investigate and test it.

[1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch
[2] v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch
[3] v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch

With best regards,
Vitaly

Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Make TID Scans recalculate the TIDs less often
Next
From: Etsuro Fujita
Date:
Subject: Re: someone else to do the list of acknowledgments