Re: Switching timeline over streaming replication - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Switching timeline over streaming replication |
Date | |
Msg-id | 50745884.6040008@vmware.com Whole thread Raw |
In response to | Re: Switching timeline over streaming replication (Amit Kapila <amit.kapila@huawei.com>) |
Responses |
Re: Switching timeline over streaming replication
Re: Switching timeline over streaming replication |
List | pgsql-hackers |
On 06.10.2012 15:58, Amit Kapila wrote: > One more test seems to be failed. Apart from this, other tests are passed. > > 2. a. Master M-1 > b. Standby S-1 follows M-1 > c. insert 10 records on M-1. verify all records are visible on M-1,S-1 > d. Stop S-1 > e. insert 2 records on M-1. > f. Stop M-1 > g. Start S-1 > h. Promote S-1 > i. Make M-1 recovery.conf such that it should connect to S-1 > j. Start M-1. Below error comes on M-1 which is expected as M-1 has more > data. > LOG: database system was shut down at 2012-10-05 16:45:39 IST > LOG: entering standby mode > LOG: consistent recovery state reached at 0/176A070 > LOG: record with zero length at 0/176A070 > LOG: database system is ready to accept read only connections > LOG: streaming replication successfully connected to primary > LOG: fetching timeline history file for timeline 2 from primary > server > LOG: replication terminated by primary server > DETAIL: End of WAL reached on timeline 1 > LOG: walreceiver ended streaming and awaits new instructions > LOG: new timeline 2 forked off current database system timeline 1 > before current recovery point 0/176A070 > LOG: re-handshaking at position 0/1000000 on tli 1 > LOG: replication terminated by primary server > DETAIL: End of WAL reached on timeline 1 > LOG: walreceiver ended streaming and awaits new instructions > LOG: new timeline 2 forked off current database system timeline 1 > before current recovery point 0/176A070 > k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which > is a problem. > l. check in S-1. Records inserted in step-e are not present. > m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1 > server following log is getting printed. > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > 000000020000000000000001, offset 0 > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > 000000020000000000000001, offset 0 > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > 000000020000000000000001, offset 0 > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > 000000020000000000000001, offset 0 > LOG: out-of-sequence timeline ID 1 (after 2) in log segment > 000000020000000000000001, offset 0 Hmm, seems we need to keep track of which timeline we've used to recover before. Before restart, the master correctly notices that timeline 2 forked off earlier in its history, so it cannot recover to that timeline. But after restart the master begins recovery from the previous checkpoint, and because timeline 2 forked off timeline 1 after the checkpoint, it concludes that it can follow that timeline. It doesn't realize that it had some already recovered/flushed some WAL in timeline 1 after the fork-point. Attached is a new version of the patch. I committed the refactoring of XLogPageRead() already, as that was a readability improvement even without this patch. All the reported issues should be fixed now, although I will continue testing this tomorrow. I added various checks that that the correct timeline is followed during recovery. minRecoveryPoint is now accompanied by a timeline ID, so that when we restart recovery, we check that we recover back to minRecoveryPoint along the same timeline as last time. Also, it now checks at beginning of recovery that the checkpoint record comes from the correct timeline. That fixes the problem that you reported above. I also adjusted the error messages on timeline history problems to be more clear. - Heikki
Attachment
pgsql-hackers by date: