Physical slot restart_lsn advances incorrectly requiring restore from archive - Mailing list pgsql-hackers
From | Hsu, John |
---|---|
Subject | Physical slot restart_lsn advances incorrectly requiring restore from archive |
Date | |
Msg-id | 77734732-44A4-4209-8C2F-3AF36C9D4D18@amazon.com Whole thread Raw |
Responses |
Re: Physical slot restart_lsn advances incorrectly requiring restore from archive
|
List | pgsql-hackers |
Hi hackers, We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to restorefrom archive in order for replication to resume. The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3. WAL write spans two WAL segments . Write to first WAL segment is complete but not to the second segment. Write to first WAL segment is acknowledged as flushed from the Postgres replica. Primary restarts before the write to second segment completes. It also means the complete WAL was never written. Crash recovery finishes at a record before the incomplete WAL write. Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed. Primary crashes because it ran out of space: 2020-07-10 01:23:31.399 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2BFD2000apply 0/216ACCD0 2020-07-10 01:23:31.401 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2C000000apply 0/216AE728 2020-07-10 01:23:31.504 UTC::@:[4548]:DEBUG: creating and filling new WAL file 2020-07-10 01:23:31.511 UTC::@:[4548]:PANIC: could not write to file "pg_wal/xlogtemp.4548": No space left on device 2020-07-10 01:23:31.518 UTC::@:[4543]:DEBUG: reaping dead processes Crash recovery beings: 2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: checkpoint record is at 0/2B51B030 2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE .. 2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: starting up replication slots 2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state" 2020-07-10 01:23:36.078 UTC::@:[8677]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000 (extra debug logs I added) 2020-07-10 01:23:36.081 UTC::@:[8677]:LOG: redo starts at 0/2A65AE08 ... 2020-07-10 01:23:36.325 UTC::@:[8677]:LOG: redo done at 0/2BFFFFB0 ... 2020-07-10 01:23:36.330 UTC::@:[8677]:LOG: checkpoint starting: end-of-recovery immediate 2020-07-10 01:23:36.332 UTC::@:[8677]:DEBUG: performing replication slot checkpoint ... 2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=13 file=base/13934/2662 time=0.001 msec 2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=14 file=base/13934/2663 time=0.001 msec 2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=15 file=base/13934/24586 time=0.001 msec 2020-07-10 01:23:36.385 UTC::@:[8677]:LOG: could not signal for checkpoint: checkpointer is not running 2020-07-10 01:23:36.385 UTC::@:[8677]:DEBUG: creating and filling new WAL file 2020-07-10 01:23:36.397 UTC::@:[8677]:PANIC: could not write to file "pg_wal/xlogtemp.8677": No space left on device Primary runs out of space during crash recovery. Space is freed up afterwards and crash recovery beings again. 2020-07-10 01:32:45.804 UTC::@:[16329]:DEBUG: checkpoint record is at 0/2B51B030 2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE ... 2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: starting up replication slots 2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state" 2020-07-10 01:32:45.806 UTC::@:[16329]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000 ... 2020-07-10 01:32:45.809 UTC::@:[16329]:LOG: redo starts at 0/2A65AE08 2020-07-10 01:32:46.043 UTC::@:[16329]:DEBUG: could not open file "pg_wal/00000001000000000000002C": No such file or directory 2020-07-10 01:32:46.043 UTC::@:[16329]:LOG: redo done at 0/2BFFFFB0 Redo finishes at 0/2BFFFFB0 even though the flush we received from the replica is already at 0/2C000000. This is problematic because the replica reconnects to the slot telling it to start past the new redo point. 2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: received replication command: START_REPLICATIONSLOT "physical_slot_1" 0/2C000000 TIMELINE 1 2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: "walreceiver" has now caught up withupstream server 2020-07-10 01:32:50.774 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2BFFFFF0apply 0/2BFFFFF0 2020-07-10 01:32:50.775 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2C000B80apply 0/2BFFFFF0 This leads to a mismatch between at the end of 0/2B and what was streamed to the replica. Replica logs: 2020-07-10 01:32:50.671 UTC::@:[24899]:LOG: started streaming WAL from primary at 0/2C000000 on timeline 1 ... 2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: could not restore file "00000001000000000000002C" from archive: child processexited with exit code 1 2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: invalid contrecord length 90 at 0/2BFFFFF0 2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: switched WAL source from archive to stream after failure Now the physical slot has advanced past the 0/2B which is what the replica actually needs. postgres=> select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn| confirmed_flush_lsn ---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- physical_slot_1 | | physical | | | f | f | | | | 0/2C000B80 | (1 row) 0/2B can now rotate out on the primary and requires restoring from archive in order for replication to resume. The attached patch (against 10) attempts to address this by keeping track of the first flushLsn in the current segNo, andwait until we receive one after that before updating. This prevents the WAL from rotating out of the primary and a rebootfrom the replica will fix it instead of needing to restore from archive. With the patch: Primary goes into crash recovery and we avoid updating the restart_lsn of the slot: 2020-07-10 18:50:12.686 UTC::@:[6160]:LOG: redo starts at 0/2D417108 2020-07-10 18:50:12.965 UTC::@:[6160]:DEBUG: could not open file "pg_wal/00000001000000000000002F": No such file or directory 2020-07-10 18:50:12.965 UTC::@:[6160]:LOG: redo done at 0/2EFFFF90 ... 2020-07-10 18:59:32.987 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: received replication command: START_REPLICATIONSLOT "physical_slot_2" 0/2F000000 TIMELINE 1 2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2EFFFFD0apply 0/2EFFFFD0 2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo, updateto match 2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2F020000apply 0/2EFFFFD0 2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo, updateto match Replica logs: 2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: could not restore file "00000001000000000000002F" from archive: child processexited with exit code 1 2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: invalid contrecord length 58 at 0/2EFFFFD0 Since the flushLSN hasn't advanced past the first one in the restartSegNo, it doesn't get updated in future checkpoints. postgres=> select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn| confirmed_flush_lsn ---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+--------------------- physical_slot_2 | | physical | | | f | f | | | | 0/2DFF8000 | (1 row) Rebooting the replica allows replication to resume from the slot and restart_lsn advances normally. Thanks, John H
Attachment
pgsql-hackers by date: