Re: Cannot rebuild a standby server - Mailing list pgsql-admin
From | John Scalia |
---|---|
Subject | Re: Cannot rebuild a standby server |
Date | |
Msg-id | 53A484E8.2060200@gmail.com Whole thread Raw |
In response to | Re: Cannot rebuild a standby server (Kevin Grittner <kgrittn@ymail.com>) |
Responses |
Re: Cannot rebuild a standby server
|
List | pgsql-admin |
Well, I did finally get it working by adding -X s -c fast to the pg_basebackup command. Kevin, if I didn't copy WALs over, the database still refused to start as it claimed it was looking for a one of the firstspecific files. Also, I've not seen any references to removing certain files like a backup_label file in the standby's data directory causing problems. The other files I removedwere the old postgresql.pid file from the primary and a file called archiving_active, which I use for controlling whether postgresql writes WAL files or not. Seems a little funnyto me that I've done this same procedure for over 4 months with no problems, and today was the first time it bit me. On 6/20/2014 2:09 PM, Kevin Grittner wrote: > John Scalia <jayknowsunix@gmail.com> wrote: > >> In the true definition of insanity, I've tried to rebuild a standby >> streaming replication server using the following steps several times: >> >> 1) ensure the postgresql data directory, /var/lib/pgsql/9.3/data, is empty. >> 2) run: pg_basebackup -h <primary server> -D /var/lib/pgsql/9.3/data >> 3) manually copy the WAL's from the primary server's pg_xlog directory >> to the directory specified in the standby's recovery.conf restore_command. > Step 3 is enough to cause database corruption on the replica. > >> 4) rm any artifacts from the standby's new data directory like the >> backup_label file. > So is that. > >> 5) copy the saved recovery.conf into the standby's data directory and check >> it is accurate. >> 6) Start the database using "service postgresql-9.3 start" >> >> Every time, however, the following appears in the pg_log/postgresql-Fri.log: >> <timestamp> LOG: entering standby mode >> <timestamp> LOG: restored log file "00000003.history" >> <timestamp> LOG: invalid secondary checkpoint record >> <timestamp> PANIC: could not locate a valid checkpoint record > Yep, that's about the best result you can expect with the above > procedure; it is also occasionally possible to get it to start, but > if it did there would almost certainly be data loss or corruption. > >> All this was originally caused by testing the failover mechanism in pgpool. That >> didn't succeed and I'm trying to get the servers back to their original >> states. I've done this kind >> of thing before, but don't know what's wrong with this effort. What have >> I missed? > You should enable WAL archiving and the restore_command in > recovery.conf should copy WAL files from the archive. The pg_xlog > directory should be empty when starting recovery unless the primary > is stopped and you only copy pg_xlog files from the stopped server > into the pg_xlog directory of the recovery cluster. Don't delete > the backup_label file, because it has the information recovery > needs about the point from which it should start WAL replay -- > without it, it will have to guess, and is very likely to get that > wrong. > > The documentation is your friend. It gives pretty specific > instructions for what to do. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
pgsql-admin by date: