BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18640: Replica Sync Failure After Downtime in Patroni HA Setup Due to Missing WAL Segments
Date
Msg-id 18640-2a2df650791eab97@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18640
Logged by:          nikhil kotak
Email address:      kotak.nikhil@gmail.com
PostgreSQL version: 14.10
Operating system:   Redhat Enterprise Linux 7.9
Description:

We are running a Patroni HA setup with PostgreSQL in our environment, where
we have 2 or more replicas depending on the application tier. During regular
maintenance activities, such as OS patching or weekend server shutdowns, we
stop the Patroni service on the replicas while the original leader remains
up.

These maintenance activities typically last around 2-3 hours. Once the
servers are returned to operational status by our UNIX System Administrator,
we attempt to restart the Patroni service on the replicas. However, we
frequently encounter an issue where the replicas remain out of sync, and the
following error message appears in the alert logs:

"Could not receive data from WAL stream: ERROR: requested WAL segment <123>
has already been removed"

Upon investigation, we observe that the WAL segment no longer exists on the
replica (which was down), but it still exists on the leader. After manually
copying the missing WAL segment from the leader to the replica, the replica
successfully resumes syncing on its own.

Issue: The problem is that the replica does not automatically attempt to
fetch the missing WAL segments from the primary once it is brought back
online. We are forced to manually intervene, which adds unnecessary
complexity and delay in restoring HA functionality after downtime.

Expected Behavior: We expect the replica to automatically request and fetch
the missing WAL segments from the primary (leader) upon startup, ensuring it
can sync up without manual intervention.

Could you please help us understand why this behavior occurs, and whether it
can be addressed within PostgreSQL or Patroni to ensure automatic recovery
for the replicas?


pgsql-bugs by date:

Previous
From: Richard Guo
Date:
Subject: Re: BUG #18634: Wrong varnullingrels with merge ... when not matched by source
Next
From: PG Bug reporting form
Date:
Subject: BUG #18641: Logical decoding of two-phase commit fails with TOASTed default values