Thread: BUG #15360: Cascade slave cannot continue replication without manualrestart after master server crash/restart.
BUG #15360: Cascade slave cannot continue replication without manualrestart after master server crash/restart.
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15360 Logged by: Maxim Boguk Email address: maxim.boguk@gmail.com PostgreSQL version: 9.6.10 Operating system: Linux Ubuntu Description: Hi, I have the following cascading replication configuration (using only streaming replication without wal archive): master db -> main stanby replica db -> multiple readonly replicas. All using 9.6.10 Postgresql. For unknown (yet) reason master server had been stuck (no postgresql but server as whole) and manually restarted. After it very strange thing happen: The main stanby replica connected to restarted master successfully and continued replication, but every readonly replica connected to main standby didn't start replication process until manual restart. After manual restart - they continue replication without any further complains. Readonly replica have the next errors in the log: second cascade slave: 2018-09-01 02:02:59 MSK 28822 @ from [vxid: txid:0] [] DETAIL: last completed transaction was at log time 2018-09-01 02:02:59.501032+03 ... different production selects logged... 2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 2018-09-01 03:36:55 MSK 30264 @ from [vxid: txid:0] [] FATAL: terminating walreceiver process due to administrator command 2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 2018-09-01 03:36:55 MSK 28819 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 ... ... and so on until manual restart issued 2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: received fast shutdown request 2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: aborting any active transactions ... 2018-09-01 04:05:12 MSK 28815 @ from [vxid: txid:0] [] LOG: database system is shut down 2018-09-01 04:05:15 MSK 39686 @ from [vxid: txid:0] [] LOG: database system was shut down in recovery at 2018-09-01 04:05:12 MSK 2018-09-01 04:05:15 MSK 39686 @ from [vxid: txid:0] [] LOG: entering standby mode 2018-09-01 04:05:15 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: redo starts at D9EB/18907C38 2018-09-01 04:07:25 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: consistent recovery state reached at D9ED/E1FB7A60 2018-09-01 04:07:26 MSK 39686 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 2018-09-01 04:07:26 MSK 40123 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at D9ED/EA000000 on timeline 1 ... no errors and all ok after In the same time main standby replica have the following lines in log: 2018-09-01 02:02:59 MSK 2558 @ from [vxid: txid:0] [] DETAIL: last completed transaction was at log time 2018-09-01 02:02:59.377119+03 ... 2018-09-01 03:18:36 MSK 2971 @ from [vxid: txid:0] [] FATAL: terminating walreceiver due to timeout 2018-09-01 03:28:31 MSK 4754 @ from [vxid: txid:0] [] FATAL: could not connect to the primary server: could not connect to server: No route to host ... 2018-09-01 03:35:10 MSK 6025 @ from [vxid: txid:0] [] FATAL: could not connect to the primary server: FATAL: the database system is starting up ... 2018-09-01 03:36:55 MSK 6276 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at D9ED/EA000000 on timeline 1 2018-09-01 03:36:55 MSK 2553 @ from [vxid:1/0 txid:0] [] LOG: invalid contrecord length 2431 at D9ED/EAA2FB00 2018-09-01 03:36:55 MSK 6276 @ from [vxid: txid:0] [] FATAL: terminating walreceiver process due to administrator command 2018-09-01 03:36:55 MSK 2553 @ from [vxid:1/0 txid:0] [] LOG: invalid resource manager ID 101 at D9ED/EAA76618 2018-09-01 03:36:55 MSK 6277 @ from [vxid: txid:0] [] LOG: started streaming WAL from primary at D9ED/EA000000 on timeline 1 ... RO replicas restarts now 2018-09-01 03:50:39 MSK 8548 replica@[unknown] from 172.16.18.13 [vxid:222/0 txid:0] [idle] LOG: received replication command: IDENTIFY_SYSTEM 2018-09-01 03:50:39 MSK 8548 replica@[unknown] from 172.16.18.13 [vxid:222/0 txid:0] [idle] LOG: received replication command: START_REPLICATION D9ED/EA000000 TIMELINE 1 ... no errors and all ok after Questions: is this behavior expected or something fishy going on here? I never seen such situation with cascade streaming replication before. Kind Regards, Maxim