Re: Unnecessary delay in streaming replication due to replay lag - Mailing list pgsql-hackers

From sunil s
Subject Re: Unnecessary delay in streaming replication due to replay lag
Date
Msg-id CAOG6S4_fFCU6iV4uvrdC8oDRmQqbjn8cBcpTXQjAE0W8sCQrAg@mail.gmail.com
Whole thread Raw
In response to Re: Unnecessary delay in streaming replication due to replay lag  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
> When this parameter is set to 'startup' or 'consistency', what happens
> if replication begins early and the startup process fails to replay
> a WAL record—say, due to corruption—before reaching the replication
> start point? In that case, the standby might fail to recover correctly
> because of missing WAL records,

Let’s compare with and without these patch changes ,

Without the patch:
Scenario 1: With a large recovery_min_apply_delay (e.g., 2 hours)
Even in this case, the flush acknowledgment for streamed WALs is sent, and the primary already recycled those WAL files.
If a corrupted record is encountered later during replay then streaming of those records is not possible.

Scenario 2: With recovery_min_apply_delay = 0 or in normal standby operation
In this case the restart_lsn is advanced based on flushPtr, allowing the primary to recycle the corresponding WAL files.
If a corrupt record is encountered during replaying local wal records, then streaming will also fail here right ?.

With this patch:
Starting the WAL receiver early(let’s say at consistent point) will allow us to prefetch the records more early in the redo loop instead of waiting till we exhaust locally available wal.

Even if the WAL receiver hadn’t started early, those WAL segments would have been recycled, since the restart_lsn would have advanced.
Therefore, the record corruption behaviour  is unchanged, but the benefit from this patch is reduced replay lag.
    • Reduces replay lag when recovery_min_apply_delay is large, as reported in https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql [2].
    • Mitigates delay for standbys lagging due to network bandwidth or latency or slow disk write(HDD).
    • faster recovery
    • Currently till wal reciver is started the acknowledgement for commit  is not sent for waiting transaction, since wal reciver is not running.With this new change the waiting transaction will get unblocked as soon as we apply the record. 

In normal condition also the slot is advanced based on flushPtr, even if the mode is remote_apply.We fixed a corrupt scenario for cont record at the end of last locally available segment.Previously we were starting 
at the last stage/corrupt record(like cont record [1]  ) but now much early.

If there is a situation where the wal record is retained in primary then we can restart the wal receiver from old lsnptr in case of corrupt record, which would be older LSN than what we are starting as part of early streaming.
This same mechanism is used in standby where we switch b/w wal source.
I don’t see any scenario where the new workflow would break existing behavior.
Could you point out the specific case you’re concerned about? Understanding that will help us refine the implementation.

> while a transaction waiting for synchronous replication may have already been acknowledged as committed.
> Wouldn't that lead to a serious problem?

Without the patch:
If the synchronous replication mode is flush(on), then even with a recovery_min_apply_delay set for larger value(e.g., 2 hours), the transaction is acknowledged as committed before the record is actually applied on the standby.
If the mode is remote_apply, the primary waits until the record is applied on the standby, which includes waiting for the configured recovery delay.

With the patch:
The behavior remains the same with respect to synchronous_commit — it still depends on whether the mode is flush or remote_apply.

So we can see a similar situation when recovery_min_apply_delay set for larger value(e.g., 2 hours)/a slow apply situation where  all the wal files are streamed but not replayed.

AFAIU this patch doesn’t introduce any new behavior.In a normal situation where the WAL receiver is continuously streaming, we would anyway received those WAL segments without waiting for
replaying to finish right.

The only difference is we are initating walreciever more early in the recovery loop, which will going to benifit us in many ways.In system where replay is slow due to low power hardware/system resource or the 
low network bandwidth/slower disk write (HDD)  will makes the standby to lag behind Primary.

By prefetching the wal records early will avoid more wal build up in primary, which would avoid running out of disk space and also benifit us for faster standby recovery.
Faster recovery means faster application availability/lower downtime in case of sync commit enabled.


> src/test/recovery/t/050_archive_enabled_standby.pl is missing the
> ending newline. Is that intentional?
Thanks for reporting. Fixed in the new rebased patch.

Reference:
[1] 
https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844
[2] https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql

Thanks & Regards,
Sunil S
Attachment

pgsql-hackers by date:

Previous
From: Xuneng Zhou
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded
Next
From: Tomas Vondra
Date:
Subject: Re: Changing the state of data checksums in a running cluster