Thread: BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments
BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18640 Logged by: nikhil kotak Email address: kotak.nikhil@gmail.com PostgreSQL version: 14.10 Operating system: Redhat Enterprise Linux 7.9 Description: We are running a Patroni HA setup with PostgreSQL in our environment, where we have 2 or more replicas depending on the application tier. During regular maintenance activities, such as OS patching or weekend server shutdowns, we stop the Patroni service on the replicas while the original leader remains up. These maintenance activities typically last around 2-3 hours. Once the servers are returned to operational status by our UNIX System Administrator, we attempt to restart the Patroni service on the replicas. However, we frequently encounter an issue where the replicas remain out of sync, and the following error message appears in the alert logs: "Could not receive data from WAL stream: ERROR: requested WAL segment <123> has already been removed" Upon investigation, we observe that the WAL segment no longer exists on the replica (which was down), but it still exists on the leader. After manually copying the missing WAL segment from the leader to the replica, the replica successfully resumes syncing on its own. Issue: The problem is that the replica does not automatically attempt to fetch the missing WAL segments from the primary once it is brought back online. We are forced to manually intervene, which adds unnecessary complexity and delay in restoring HA functionality after downtime. Expected Behavior: We expect the replica to automatically request and fetch the missing WAL segments from the primary (leader) upon startup, ensuring it can sync up without manual intervention. Could you please help us understand why this behavior occurs, and whether it can be addressed within PostgreSQL or Patroni to ensure automatic recovery for the replicas?