Thread: RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

From

"egashira.yusuke@fujitsu.com"

Date:

25 March 2021, 00:23:52

> The replication between "NODE-A" and "NODE-B" is synchronous replication,
> and between "NODE-B" and "NODE-C" is asynchronous.
> 
> "NODE-A" <-[synchronous]-> "NODE-B" <-[non-synchronous]-> "NODE-C"
> 
> When the primary server "NODE-A" crashed due to full WAL storage and
> "NODE-B" promoted, the downstream standby server "NODE-C" aborted with
> following messages.
> 
> 2021-03-11 11:26:28.470 JST [85228] LOG:  invalid contrecord length 26 at
> 0/5FFFFF0
> 2021-03-11 11:26:28.470 JST [85232] FATAL:  terminating walreceiver process
> due to administrator command
> 2021-03-11 11:26:28.470 JST [85228] PANIC:  could not open file
> "pg_wal/000000020000000000000005": No such file or directory
> 2021-03-11 11:26:28.492 JST [85260] LOG:  started streaming WAL from primary
> at 0/5000000 on timeline 2
> 2021-03-11 11:26:29.260 JST [85227] LOG:  startup process (PID 85228) was
> terminated by signal 6: Aborted

I would like to clarify the conditions under which this "abort" occurred to explain to the customer.

By the result of pg_waldump, I think that the conditions are followings. 

1) A partially written (across the following segment files) WAL record is recorded at the end of the WAL segment file.
and
2) The WAL segment file of 1) is the last WAL segment file that standby server received, and
3) The standby server promoted.

I think that the above conditions will be met only when the primary server crashed due to full WAL storage.

Is my idea correct?

Regards.
Yusuke, Egahsira.


-----Original Message-----
From: PG Bug reporting form <noreply@postgresql.org> 
Sent: Thursday, March 11, 2021 5:06 PM
To: pgsql-bugs@lists.postgresql.org
Cc: Egashira, Yusuke/江頭 勇佑 <egashira.yusuke@fujitsu.com>
Subject: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

The following bug has been logged on the website:

Bug reference:      16922
Logged by:          yusuke egashira
Email address:      egashira.yusuke@fujitsu.com
PostgreSQL version: 13.2
Operating system:   Red Hat Enterprise Linux 8
Description:        

Hi.

My customer uses PostgreSQL with a cascading replication.
The replication between "NODE-A" and "NODE-B" is synchronous replication,
and between "NODE-B" and "NODE-C" is asynchronous.

"NODE-A" <-[synchronous]-> "NODE-B" <-[non-synchronous]-> "NODE-C"

When the primary server "NODE-A" crashed due to full WAL storage and
"NODE-B" promoted, the downstream standby server "NODE-C" aborted with
following messages.

2021-03-11 11:26:28.470 JST [85228] LOG:  invalid contrecord length 26 at
0/5FFFFF0
2021-03-11 11:26:28.470 JST [85232] FATAL:  terminating walreceiver process
due to administrator command
2021-03-11 11:26:28.470 JST [85228] PANIC:  could not open file
"pg_wal/000000020000000000000005": No such file or directory
2021-03-11 11:26:28.492 JST [85260] LOG:  started streaming WAL from primary
at 0/5000000 on timeline 2
2021-03-11 11:26:29.260 JST [85227] LOG:  startup process (PID 85228) was
terminated by signal 6: Aborted
2021-03-11 11:26:29.260 JST [85227] LOG:  terminating any other active
server processes
2021-03-11 11:26:29.290 JST [85227] LOG:  database system is shut down


I found this behavior in PostgreSQL 13.2, 12.6, 11.11, 10.16, 9.6.21 and
9.5.24.

The reproduce procedure in PostgreSQL 13.2 on RHEL8 is:

===========================
# Create temporary wal data device.
mkdir -p /tmp/wal
dd if=/dev/zero of=/tmp/waldata bs=100MB count=1
sudo mkfs -t xfs /tmp/waldata
sudo mount -t xfs -o loop /tmp/waldata /tmp/wal
sudo chown postgres:postgres /tmp/wal
# mkdir -p /tmp/wal/pg_wal

# Create primary NODE-A.
initdb -D /tmp/nodea -X /tmp/wal/pg_wal

cat >>/tmp/nodea/postgresql.conf <<_EOF_
port=5432
synchronous_standby_names = 'nodeb'
restart_after_crash = off
wal_keep_size = 10GB
_EOF_

pg_ctl -D /tmp/nodea start -w -l /tmp/nodea_logfile

# Craete standby NODE-B.
pg_basebackup -D /tmp/nodeb -R -X fetch -d
"postgresql:///:5432?application_name=nodeb"
echo "port = 5433" >> /tmp/nodeb/postgresql.conf
pg_ctl -D /tmp/nodeb start -w -l /tmp/nodeb_logfile

# Create standby NODE-C
pg_basebackup -D /tmp/nodec -p 5433 -R -X fetch -d
"postgresql:///:5433?application_name=nodec"
echo "port = 5434" >> /tmp/nodec/postgresql.conf 
pg_ctl -D /tmp/nodec start -w

# Execute SQLs for WAL full.
createdb -p 5432
cat <<_psql | psql -p 5432
  create table test (a int, b text);
  insert into test values (generate_series(1,10000000), 'abcdefghijk');
_psql

# ** After down primary NODE-A **
pg_ctl -D /tmp/nodeb promote 

# In this timing, the NODE-C will down.
===========================

The stack trace of coredump with debug build is following:
# This stack trace was taken in a different environment than the other
attached logs.

#0  0x00007f73e3f1f8df in raise () from /lib64/libc.so.6
#1  0x00007f73e3f09cf5 in abort () from /lib64/libc.so.6
#2  0x0000000000878c83 in errfinish (filename=<optimized out>,
filename@entry=0x8cc67e "xlog.c", lineno=lineno@entry=3741,
funcname=funcname@entry=0x8d45c8 <__func__.28536> "XLogFileRead") at
elog.c:592
#3  0x0000000000507730 in XLogFileRead (segno=<optimized out>, tli=2,
source=source@entry=XLOG_FROM_STREAM, notfoundOk=notfoundOk@entry=false,
emode=22) at xlog.c:3739
#4  0x0000000000509c62 in WaitForWALToBecomeAvailable (tliRecPtr=134217712,
fetching_ckpt=<optimized out>, randAccess=false, RecPtr=117448704) at
xlog.c:12411
#5  XLogPageRead (xlogreader=0x282b478, targetPagePtr=<optimized out>,
reqLen=8192, targetRecPtr=134217712, readBuf=0x2858af8 "\006\321\a") at
xlog.c:11908
#6  0x0000000000517e25 in ReadPageInternal (state=state@entry=0x282b478,
pageptr=pageptr@entry=134209536, reqLen=reqLen@entry=8192) at
xlogreader.c:608
#7  0x000000000051886a in XLogReadRecord (state=state@entry=0x282b478,
errormsg=errormsg@entry=0x7ffd148db3f8) at xlogreader.c:329
#8  0x000000000050797e in ReadRecord (xlogreader=xlogreader@entry=0x282b478,
emode=emode@entry=15, fetching_ckpt=fetching_ckpt@entry=false) at
xlog.c:4346
#9  0x000000000050e205 in ReadRecord (fetching_ckpt=false, emode=15,
xlogreader=0x282b478) at xlog.c:4340
#10 StartupXLOG () at xlog.c:7385
#11 0x00000000006f72d1 in StartupProcessMain () at startup.c:204
#12 0x000000000051cf5b in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffd148df240) at bootstrap.c:443
#13 0x00000000006f343f in StartChildProcess (type=StartupProcess) at
postmaster.c:5492
#14 0x00000000006f6938 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x282acc0) at postmaster.c:1404
#15 0x000000000047ecd9 in main (argc=3, argv=0x282acc0) at main.c:210


I also attach the debug5 level logs of NODE-C.

2021-03-11 11:06:39.136 JST [84342] LOG:  replication terminated by primary
server
2021-03-11 11:06:39.136 JST [84342] DETAIL:  End of WAL reached on timeline
1 at 0/6000000.
2021-03-11 11:06:39.140 JST [84342] LOG:  fetching timeline history file for
timeline 2 from primary server
2021-03-11 11:06:39.148 JST [84342] DEBUG:  walreceiver ended streaming and
awaits new instructions
2021-03-11 11:06:39.148 JST [84338] LOG:  new target timeline is 2
2021-03-11 11:06:39.148 JST [84338] DEBUG:  switched WAL source from stream
to archive after failure
2021-03-11 11:06:39.148 JST [84338] DEBUG:  could not open file
"pg_wal/000000020000000000000006": No such file or directory
2021-03-11 11:06:39.148 JST [84338] DEBUG:  switched WAL source from archive
to stream after failure
2021-03-11 11:06:39.148 JST [84342] LOG:  restarted WAL streaming at
0/6000000 on timeline 2
2021-03-11 11:06:39.148 JST [84342] DEBUG:  sendtime 2021-03-11
11:06:39.148674+09 receipttime 2021-03-11 11:06:39.148723+09 replication
apply delay 17910 ms transfer latency 1 ms
2021-03-11 11:06:39.148 JST [84342] DEBUG:  creating and filling new WAL
file
2021-03-11 11:06:39.240 JST [84342] DEBUG:  done creating and filling new
WAL file
2021-03-11 11:06:39.240 JST [84342] DEBUG:  sending write 0/6000048 flush
0/5FFFFF0 apply 0/5FFFFF0
2021-03-11 11:06:39.242 JST [84342] DEBUG:  sending write 0/6000048 flush
0/6000048 apply 0/5FFFFF0
2021-03-11 11:06:39.486 JST [84342] DEBUG:  sendtime 2021-03-11
11:06:39.486387+09 receipttime 2021-03-11 11:06:39.486449+09 replication
apply delay 18248 ms transfer latency 1 ms
2021-03-11 11:06:39.486 JST [84342] DEBUG:  sending write 0/6000080 flush
0/6000048 apply 0/5FFFFF0
2021-03-11 11:06:39.486 JST [84342] DEBUG:  sending write 0/6000080 flush
0/6000080 apply 0/5FFFFF0
2021-03-11 11:06:39.489 JST [84338] LOG:  invalid contrecord length 26 at
0/5FFFFF0
2021-03-11 11:06:39.489 JST [84342] FATAL:  terminating walreceiver process
due to administrator command
2021-03-11 11:06:39.489 JST [84342] DEBUG:  shmem_exit(1): 1
before_shmem_exit callbacks to make
2021-03-11 11:06:39.489 JST [84342] DEBUG:  shmem_exit(1): 6 on_shmem_exit
callbacks to make
2021-03-11 11:06:39.489 JST [84342] DEBUG:  proc_exit(1): 2 callbacks to
make
2021-03-11 11:06:39.489 JST [84342] DEBUG:  exit(1)
2021-03-11 11:06:39.489 JST [84342] DEBUG:  shmem_exit(-1): 0
before_shmem_exit callbacks to make
2021-03-11 11:06:39.489 JST [84342] DEBUG:  shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2021-03-11 11:06:39.489 JST [84342] DEBUG:  proc_exit(-1): 0 callbacks to
make
2021-03-11 11:06:39.489 JST [84338] DEBUG:  switched WAL source from stream
to archive after failure
2021-03-11 11:06:39.489 JST [84338] DEBUG:  could not open file
"pg_wal/000000020000000000000005": No such file or directory
2021-03-11 11:06:39.489 JST [84338] DEBUG:  switched WAL source from archive
to stream after failure
2021-03-11 11:06:39.489 JST [84338] PANIC:  could not open file
"pg_wal/000000020000000000000005": No such file or directory
2021-03-11 11:06:39.501 JST [84337] DEBUG:  reaping dead processes
2021-03-11 11:06:39.513 JST [84367] DEBUG:  find_in_dynamic_libpath: trying
"/home/egashira/pgsql/lib/libpqwalreceiver"
2021-03-11 11:06:39.513 JST [84367] DEBUG:  find_in_dynamic_libpath: trying
"/home/egashira/pgsql/lib/libpqwalreceiver.so"
2021-03-11 11:06:39.541 JST [84367] LOG:  started streaming WAL from primary
at 0/5000000 on timeline 2
2021-03-11 11:06:39.542 JST [84367] DEBUG:  sendtime 2021-03-11
11:06:39.536374+09 receipttime 2021-03-11 11:06:39.54219+09 replication
apply delay 56 ms transfer latency 6 ms
2021-03-11 11:06:39.542 JST [84367] DEBUG:  creating and filling new WAL
file
2021-03-11 11:06:39.696 JST [84367] DEBUG:  done creating and filling new
WAL file


In addition, this issue may not occur depending on the timing.
When this issue didn't occur, the last WAL record of NODE-A was not
transferred to NODE-B.
I attach the result of pg_waldump for NODE-A, NODE-B and NODE-C.

[Reproduced]
$ pg_waldump nodea/pg_wal/000000010000000000000005  | tail -n 4
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFED0, prev 0/05FFFE88, desc: INSERT off 181 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF18, prev 0/05FFFED0, desc: INSERT off 182 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF60, prev 0/05FFFF18, desc: INSERT off 183 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFFA8, prev 0/05FFFF60, desc: INSERT off 184 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759

$ pg_waldump nodeb/pg_wal/000000010000000000000005  | tail -n 4
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFED0, prev 0/05FFFE88, desc: INSERT off 181 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF18, prev 0/05FFFED0, desc: INSERT off 182 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF60, prev 0/05FFFF18, desc: INSERT off 183 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFFA8, prev 0/05FFFF60, desc: INSERT off 184 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759

$ pg_waldump nodec/pg_wal/000000010000000000000005  | tail -n 4
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFED0, prev 0/05FFFE88, desc: INSERT off 181 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF18, prev 0/05FFFED0, desc: INSERT off 182 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF60, prev 0/05FFFF18, desc: INSERT off 183 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFFA8, prev 0/05FFFF60, desc: INSERT off 184 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759


[Not reproduced]
$ pg_waldump nodea/pg_wal/000000010000000000000005  | tail -n 4
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFED0, prev 0/05FFFE88, desc: INSERT off 181 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF18, prev 0/05FFFED0, desc: INSERT off 182 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFF60, prev 0/05FFFF18, desc: INSERT off 183 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05FFFFA8, prev 0/05FFFF60, desc: INSERT off 184 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3759

$ pg_waldump nodeb/pg_wal/000000010000000000000005  | tail -n 4
pg_waldump: fatal: error in WAL record at 0/5BFFF88: invalid magic number
0000 in log segment 000000010000000000000005, offset 12582912
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFEB0, prev 0/05BFFE68, desc: INSERT off 2 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFEF8, prev 0/05BFFEB0, desc: INSERT off 3 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFF40, prev 0/05BFFEF8, desc: INSERT off 4 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFF88, prev 0/05BFFF40, desc: INSERT off 5 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446

$ pg_waldump nodec/pg_wal/000000010000000000000005  | tail -n 4
pg_waldump: fatal: error in WAL record at 0/5BFFF88: invalid magic number
0000 in log segment 000000010000000000000005, offset 12582912
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFEB0, prev 0/05BFFE68, desc: INSERT off 2 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFEF8, prev 0/05BFFEB0, desc: INSERT off 3 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFF40, prev 0/05BFFEF8, desc: INSERT off 4 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446
rmgr: Heap        len (rec/tot):     71/    71, tx:        487, lsn:
0/05BFFF88, prev 0/05BFFF40, desc: INSERT off 5 flags 0x00, blkref #0: rel
1663/16384/16385 blk 3446

Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

From

Kyotaro Horiguchi

Date:

29 March 2021, 02:34:57

Hello.

(Mmm. Sorry for annoying, but added some persons in Cc:)

This is the same issue with what is discussed in [1] and recently
reported by [2].

[1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com

[2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com


At Thu, 25 Mar 2021 00:23:52 +0000, "egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com> wrote in 
> > The replication between "NODE-A" and "NODE-B" is synchronous replication,
> > and between "NODE-B" and "NODE-C" is asynchronous.
> > 
> > "NODE-A" <-[synchronous]-> "NODE-B" <-[non-synchronous]-> "NODE-C"
> > 
> > When the primary server "NODE-A" crashed due to full WAL storage and
> > "NODE-B" promoted, the downstream standby server "NODE-C" aborted with
> > following messages.
> > 
> > 2021-03-11 11:26:28.470 JST [85228] LOG:  invalid contrecord length 26 at
> > 0/5FFFFF0
> > 2021-03-11 11:26:28.470 JST [85232] FATAL:  terminating walreceiver process
> > due to administrator command
> > 2021-03-11 11:26:28.470 JST [85228] PANIC:  could not open file
> > "pg_wal/000000020000000000000005": No such file or directory
> > 2021-03-11 11:26:28.492 JST [85260] LOG:  started streaming WAL from primary
> > at 0/5000000 on timeline 2
> > 2021-03-11 11:26:29.260 JST [85227] LOG:  startup process (PID 85228) was
> > terminated by signal 6: Aborted
> 
> I would like to clarify the conditions under which this "abort" occurred to explain to the customer.
> 
> By the result of pg_waldump, I think that the conditions are followings. 
> 
> 1) A partially written (across the following segment files) WAL record is recorded at the end of the WAL segment
file.and
 
> 2) The WAL segment file of 1) is the last WAL segment file that standby server received, and
> 3) The standby server promoted.
> 
> I think that the above conditions will be met only when the primary server crashed due to full WAL storage.
> 
> Is my idea correct?

The diagnosis looks correct to me, but the cause of the crash is
irrelevant.  A disk full just makes the crash hit accurately on the
vital point.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

From

"egashira.yusuke@fujitsu.com"

Date:

02 April 2021, 11:09:21

Thanks for replying.

> This is the same issue with what is discussed in [1] and recently
> reported by [2].
>
> [1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com
>
> [2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

By reading the suggested thread, I understood that the cause of this problem is that the standby instance was promoted
withsegment boundary WAL record only partial transferred. 
So the primary server's full WAL storage was not necessarily a factor.

Just the other day, a similar crash occurred in the same customer when an intermediate standby in a cascaded
replicationwas promoted even though the WAL was not full in primary. 
I think it's the same issue because the logged WAL position at the down point was just the segment boundary.


I applied the suggested threads patch [1] and [2] to master (055fee7eb4dcc78e58672aef146334275e1cc40d) and tested it.

However, with applying the patch [1] only, this abort issue was not solved when I performed my reproduce procedure.

On the other hand, the suggested thread [2]'s patch seems to solve this issue.
With applying the patch [2], the cascade standby server's abort no longer occurred.
The logfile also shows that it prevented the abort while same condition.
I attach the logfile of NODE-C. (debug5)

2021-04-02 19:20:27.430 JST [110914] DEBUG:  sending write 0/6000000 flush 0/6000000 apply 0/5FFFFE0 (reply requested)
2021-04-02 19:20:27.430 JST [110914] DEBUG:  sendtime 2021-04-02 19:20:27.430755+09 receipttime 2021-04-02
19:20:27.430776+09replication apply delay 91585 ms transfer latency 1 ms 
2021-04-02 19:20:35.932 JST [110914] LOG:  replication terminated by primary server
2021-04-02 19:20:35.932 JST [110914] DETAIL:  End of WAL reached on timeline 1 at 0/6000000.
2021-04-02 19:20:35.932 JST [110914] LOG:  fetching timeline history file for timeline 2 from primary server
2021-04-02 19:20:35.943 JST [110914] DEBUG:  walreceiver ended streaming and awaits new instructions
2021-04-02 19:20:35.943 JST [110910] LOG:  new target timeline is 2
2021-04-02 19:20:35.943 JST [110910] DEBUG:  switched WAL source from stream to archive after failure
2021-04-02 19:20:35.943 JST [110910] DEBUG:  could not open file "pg_wal/000000020000000000000006": No such file or
directory
2021-04-02 19:20:35.943 JST [110910] DEBUG:  switched WAL source from archive to stream after failure
2021-04-02 19:20:35.943 JST [110914] LOG:  restarted WAL streaming at 0/6000000 on timeline 2
2021-04-02 19:20:35.943 JST [110914] DEBUG:  sendtime 2021-04-02 19:20:35.943432+09 receipttime 2021-04-02
19:20:35.94346+09replication apply delay 100098 ms transfer latency 1 ms 
2021-04-02 19:20:35.943 JST [110914] DEBUG:  creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG:  done creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG:  sending write 0/6000038 flush 0/5FFFFE0 apply 0/5FFFFE0
2021-04-02 19:20:36.033 JST [110914] DEBUG:  sending write 0/6000038 flush 0/6000038 apply 0/5FFFFE0

=> After the timeline changing, re-reading of the old segment file has occurred, but it has not cause the abort.

A patch in thread [2] appears to be required to fix this issue. I hope that this patch will be committed.

Regards.
Yusuke, Egashira.