[BUGS] BUG #14702: Streaming replication broken after server closedconnection unexpectedly - Mailing list pgsql-bugs
From | girgen@pingpong.net |
---|---|
Subject | [BUGS] BUG #14702: Streaming replication broken after server closedconnection unexpectedly |
Date | |
Msg-id | 20170612215251.26502.58139@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 14702 Logged by: Palle Girgensohn Email address: girgen@pingpong.net PostgreSQL version: 9.6.2 Operating system: FreeBSD 11.0 Description: Hi! Setup is simple streaming replication: master -> slave. There is a replication slot at the master, so xlogs should not be removed until the client has received them properly. Suddenly, the slave logs this: [75678-1] db=,user= LOG: restartpoint complete: wrote 2795 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=269.858 s, sync=3.768 s, total=273.636 s; sync files=242, longest=3.763 s, average=0.015 s; distance=26953 kB, estimate=58915 kB [75679-1] db=,user= LOG: recovery restart point at 226B/E9DF4E30 [75679-2] db=,user= DETAIL: last completed transaction was at log time 2017-05-29 00:57:19.221918+02 [75680-1] db=,user= LOG: restartpoint starting: time [5-1] db=,user= FATAL: could not send data to WAL stream: server closed the connection unexpectedly [5-2] This probably means the server terminated abnormally [5-3] before or while processing the request. [5-4] [246-1] db=,user= FATAL: invalid memory alloc request size 1600487424 The master has not relevant information in its log. After this, the slave could not be started again, each time the same error about "invalid memory alloc request size 1600487424". Looking more closely, the last xlog file, let's call it 0000EB, is currupt on the slave, having a different checksum from the proper one at the master. The oldest xlog saved on the master is 0000EB, all older ones are properly deleted due to retention; they have been delivered to the slave. On the slave, the youngest one is 0000EA (the file before 0000EB). Hence there is no common xlog file on both systems, since the 0000EB file is corrupt on the slave. Removing the corrupt 0000EB from the slave, and starting the slave again, changes the error messate from "invalid memory alloc" to [1-1] db=,user= LOG: ending log output to stderr [1-2] db=,user= HINT: Future log output will go to log destination "syslog". [2-1] db=,user= LOG: database system was interrupted while in recovery at log time 2017-05-29 00:48:11 CEST [2-2] db=,user= HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. [3-1] db=,user= LOG: entering standby mode [4-1] db=,user= LOG: redo starts at 226B/E9DF4E30 [2-1] db=,user= LOG: started streaming WAL from primary at 226B/EA000000 on timeline 1 [3-1] db=,user= FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000226B000000EA has already been removed OK, better... I have the 0000EA on the slave. So I copied that 0000EA over back to the master's pg_xlog directory. The slave was happy without further ado: [2-1] db=,user= LOG: started streaming WAL from primary at 226B/EA000000 on timeline 1 [5-1] db=,user= LOG: consistent recovery state reached at 226B/EBE89CB8 [2-1] db=,user= LOG: database system is ready to accept read only connections [2-1] db=,user= LOG: restartpoint starting: time And life is peachy again. Now, I don't know exactly what happened when the slave lost track, but the bug, I think, is that the streamed WAL was corrupt, and still accepted by the slave *and* hence removed from the master. It required too much experience to fix that. The slave should not accept a not fully transported WAL file. It seems it happened during some connection failure between the slave and master, but still it should preferrably fail more gracefully. Are the mechanisms implemented to support that, and they failed, or is it just not implemented? Palle -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
pgsql-bugs by date: