Home > mailing lists

Re: Switching timeline over streaming replication - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Switching timeline over streaming replication
Date	October 9, 2012 17:02:15
Msg-id	50745884.6040008@vmware.com Whole thread Raw
In response to	Re: Switching timeline over streaming replication (Amit Kapila <amit.kapila@huawei.com>)
Responses	Re: Switching timeline over streaming replication Re: Switching timeline over streaming replication
List	pgsql-hackers

Tree view

On 06.10.2012 15:58, Amit Kapila wrote:
> One more test seems to be failed. Apart from this, other tests are passed.
>
> 2. a. Master M-1
>     b. Standby S-1 follows M-1
>     c. insert 10 records on M-1. verify all records are visible on M-1,S-1
>     d. Stop S-1
>     e. insert 2 records on M-1.
>     f. Stop M-1
>     g. Start S-1
>     h. Promote S-1
>     i. Make M-1 recovery.conf such that it should connect to S-1
>     j. Start M-1. Below error comes on M-1 which is expected as M-1 has more
> data.
>        LOG:  database system was shut down at 2012-10-05 16:45:39 IST
>        LOG:  entering standby mode
>        LOG:  consistent recovery state reached at 0/176A070
>        LOG:  record with zero length at 0/176A070
>        LOG:  database system is ready to accept read only connections
>        LOG:  streaming replication successfully connected to primary
>        LOG:  fetching timeline history file for timeline 2 from primary
> server
>        LOG:  replication terminated by primary server
>        DETAIL:  End of WAL reached on timeline 1
>        LOG:  walreceiver ended streaming and awaits new instructions
>        LOG:  new timeline 2 forked off current database system timeline 1
> before current recovery point 0/176A070
>        LOG:  re-handshaking at position 0/1000000 on tli 1
>        LOG:  replication terminated by primary server
>        DETAIL:  End of WAL reached on timeline 1
>        LOG:  walreceiver ended streaming and awaits new instructions
>        LOG:  new timeline 2 forked off current database system timeline 1
> before current recovery point 0/176A070
>     k. Stop M-1. Start M-1. It is able to successfully connect to S-1 which
> is a problem.
>     l. check in S-1. Records inserted in step-e are not present.
>     m. Now insert records in S-1. M-1 doesn't recieve any records. On M-1
> server following log is getting printed.
>        LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
> 000000020000000000000001, offset 0
>        LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
> 000000020000000000000001, offset 0
>        LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
> 000000020000000000000001, offset 0
>        LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
> 000000020000000000000001, offset 0
>        LOG:  out-of-sequence timeline ID 1 (after 2) in log segment
> 000000020000000000000001, offset 0

Hmm, seems we need to keep track of which timeline we've used to recover
before. Before restart, the master correctly notices that timeline 2
forked off earlier in its history, so it cannot recover to that
timeline. But after restart the master begins recovery from the previous
checkpoint, and because timeline 2 forked off timeline 1 after the
checkpoint, it concludes that it can follow that timeline. It doesn't
realize that it had some already recovered/flushed some WAL in timeline
1 after the fork-point.

Attached is a new version of the patch. I committed the refactoring of
XLogPageRead() already, as that was a readability improvement even
without this patch. All the reported issues should be fixed now,
although I will continue testing this tomorrow. I added various checks
that that the correct timeline is followed during recovery.
minRecoveryPoint is now accompanied by a timeline ID, so that when we
restart recovery, we check that we recover back to minRecoveryPoint
along the same timeline as last time. Also, it now checks at beginning
of recovery that the checkpoint record comes from the correct timeline.
That fixes the problem that you reported above. I also adjusted the
error messages on timeline history problems to be more clear.

- Heikki

Attachment

streaming-tli-switch-4.patch.gz

pgsql-hackers by date:

From: Sébastien Lardière
Date: 09 October 2012, 16:28:17
Subject: Re: Truncate if exists

From: Martijn van Oosterhout
Date: 09 October 2012, 18:58:39
Subject: Re: Detecting libpq connections improperly shared via fork()

Re: Switching timeline over streaming replication - Mailing list pgsql-hackers

Attachment

Previous

Next