Re: Fwd: Data corruption after restarting replica - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: Fwd: Data corruption after restarting replica |
Date | |
Msg-id | 54E50347.9020906@aklaver.com Whole thread Raw |
In response to | Fwd: Data corruption after restarting replica (Novák, Petr <novakp@avast.com>) |
Responses |
Re: Fwd: Data corruption after restarting replica
|
List | pgsql-general |
On 02/16/2015 02:44 AM, Novák, Petr wrote: > Hello, > > sorry for posting to second list, but as I've received no reply > there, I'm trying my luck here. > > Thanks > Petr > > > ---------- Forwarded message ---------- > From: Novák, Petr <novakp@avast.com> > Date: Tue, Feb 10, 2015 at 12:49 PM > Subject: Data corruption after restarting replica > To: pgsql-bugs@postgresql.org > > > Hi all, > > we're experiencing data corruption after switching streamed replica to primary. > This is not the first time I've encountered this issue, so I'l try to > describe it in more detail. > > For this particular cluster we have 6 servers in two datacenters (3 in > each). There are two instances running on each server, each with its > own port and datadir. On the first two servers in each datacenter one > instance is primary and the other is replica for the primary from the > other server. Third server holds two offsite replicas from the other > datacenter (for DR purposes) > > Each replica was set up by taking pg_basebackup from primary > (pg_basebackup -h <hostname> -p 5430 -D /data2/basebackup -P -v -U > <user> -x -c fast). Then directories from initdb were replaced with > the ones from basebackup (only the configuration files remained) and > the replica started and was successfully connected to primary. It was > running with no problem keeping up with the primary. We were > experiencing some connection problem between the two datacenters, but > replication didn't break. > > Then we needed to take one datacenter offline due to hardware > maintenance. So I've switched the applications down, verified that no > more clients were connected to primary, then shut the primary down and > restarted replica without recovery.conf and the application were > started using the new db with no problem. Other replica even > successfully reconnected to this new primary. What other replica? > > Few hours from the switch lines appeared in the server log (which > didn't appear before), indicating a corruption: > > ERROR: index "account_username_key" contains unexpected zero page at > block 1112135 > ERROR: right sibling's left-link doesn't match: block 476354 links to > 1062443 instead of expected 250322 in index "account_pkey" > > ..and many more reporting corruption in several other indexes. What happened to the primary you shut down? > > The issue was resolved by creating new indexes and dropping the > affected ones, although there were already some duplicities in the > data, that has to be resolved, as some of the indexes were unique. > > This particular case uses Postgres 9.1.14 on both primary and replica. > But I've experienced similar behavior on 9.2.9. OS Centos 6.6 in all > cases. This may mean, that there can be something wrong with our > configuration or the replication setup steps, but I've set up another > instance using the same steps with no problem. > > Fsync related setting are at their defaults. Data directories are on > RAID10 arrays, with BBUs. Filesystem is ext4 mounted with nobarrier > option. > > Database is fairly large ~120GB with several 50mil+ tables, lots of > indexes and FK constraints. It is mostly queried, > updates/inserts/deletes are only several rows/s. > > Any help will be appreciated. > > Petr Novak > > System Engineer > Avast s.r.o. > > -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: