The following bug has been logged on the website:
Bug reference: 18640
Logged by: nikhil kotak
Email address: kotak.nikhil@gmail.com
PostgreSQL version: 14.10
Operating system: Redhat Enterprise Linux 7.9
Description:
We are running a Patroni HA setup with PostgreSQL in our environment, where
we have 2 or more replicas depending on the application tier. During regular
maintenance activities, such as OS patching or weekend server shutdowns, we
stop the Patroni service on the replicas while the original leader remains
up.
These maintenance activities typically last around 2-3 hours. Once the
servers are returned to operational status by our UNIX System Administrator,
we attempt to restart the Patroni service on the replicas. However, we
frequently encounter an issue where the replicas remain out of sync, and the
following error message appears in the alert logs:
"Could not receive data from WAL stream: ERROR: requested WAL segment <123>
has already been removed"
Upon investigation, we observe that the WAL segment no longer exists on the
replica (which was down), but it still exists on the leader. After manually
copying the missing WAL segment from the leader to the replica, the replica
successfully resumes syncing on its own.
Issue: The problem is that the replica does not automatically attempt to
fetch the missing WAL segments from the primary once it is brought back
online. We are forced to manually intervene, which adds unnecessary
complexity and delay in restoring HA functionality after downtime.
Expected Behavior: We expect the replica to automatically request and fetch
the missing WAL segments from the primary (leader) upon startup, ensuring it
can sync up without manual intervention.
Could you please help us understand why this behavior occurs, and whether it
can be addressed within PostgreSQL or Patroni to ensure automatic recovery
for the replicas?