Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid - Mailing list pgsql-general
From | Kevin Grittner |
---|---|
Subject | Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid |
Date | |
Msg-id | 1361479613.45008.YahooMailNeo@web162902.mail.bf1.yahoo.com Whole thread Raw |
In response to | Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid (Ned Wolpert <ned.wolpert@imemories.com>) |
List | pgsql-general |
Ned Wolpert <ned.wolpert@imemories.com> wrote:=0A=0A> I'm doing a postmorte= m on a corruption event we had. I have an=0A> idea on what happened, but no= t sure. I figure I'd share what=0A> happened and see if I'm close to right = here.=0A>=0A> Running 9.1.6 with hot-standby, archiving 4 months of wal fil= es,=0A> and even a nightly pg_dump all.=0A=0AThose WAL files aren't going t= o be of much use without an=0Aoccasional base backup to apply them to.=0A= =0A> 50G database.=A0 Trying to update or delete a row in a small (21=0A> r= ow, but heavily used table) would lock up completely. Never=0A> finish.=0A= =0AHow long is "never" in hours or days?=0A=0A> Removed all clients, restar= ted the db instance, no joy.=0A> Check pg_stat_activity, and nothing that w= asn't idle.... run the=0A> delete, locked up.=0A=0ADid you have any rows in= pg_prepared_xacts that had lingered for=0Alonger than you were waiting for= the delete?=0A=0A> Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.rel= ation =3D=0A> pg_class.oid;) with no clients touching this row=0A=0AOf cour= se you need to be connected to the right database when you=0Arun this, and = you need to look at relation locks -- row locks don't=0Ashow up in that vie= w.=0A=0A> fresh restart of the db, and saw virtualtransactions against this= =0A> same table where the values would be -1/nnnnn were nnnnn was a=0A> hug= e number.=A0 Turns out we had about 159 entries from different=0A> tables i= n the database.=0A=0ASounds like prepared transactions.=A0 Again, what was = in=0Apg_prepared_xacts?=0A=0A> Checked hot-standby and, of course, no locks= or anything.=0A> Switched to hot-standby.=0A=0AMeaning you promoted it to = master?=0A=0A> Hot-standby instantly gained these locks, Also noticed that = 2=0A> prepared transactions migrated to the hot-standby. Binary=0A> upgrade= d to 9.1.8, locks still existed. Ended up executing the=0A> one thing we kn= ew would work. Take the site down, pg_dumpall to=0A> fresh instance. Everyt= hing is fine.=0A=0AThat all sounds consistent with a flaky transaction mana= ger.=0A=0A> A little more background: We were running 9.1.4 back when 9.1.6= =0A> came out. We saw there was possible corruption issues and did a=0A> bi= nary upgrade and reindexing. Everything seemed to be fine, but=0A> I wonder= if we really had problems back then. We rebuilt the=0A> hot-standby after = the binary upgrade via normal restore and=0A> wal-file replays.=A0 I should= also note that this row that had the=0A> lock on it that would not go away= , was created by an app server=0A> that was killed (via -9) since it was no= n-responsive, and the row=0A> 'create date' (in db and logs) is the exact t= ime the app server=0A> was killed.=0A=0AIt sounds as though you abruptly te= rminated the process running=0Ayour transaction manager, which left it unaw= are of one or more=0Adangling prepared transactions.=A0 Further, it sounds = like your=0Atransaction manager doesn't go looking for such detritus.=A0 If= it's=0Anot going to watch for such problems, you had better do so.=A0 Any= =0Aprepared transaction which is sitting in pg_prepared_xacts for more=0Ath= an a few seconds, I would consider suspect.=A0 After a few minutes=0Ahours = I would consider them to be a problem.=A0 After a day I would=0Aconsider th= e transaction manager to have fallen on its face, and=0Awould go clean thin= gs up by either committing or rolling back the=0Aprepared transaction(s).= =0A=0A> I was wondering if a) these virtualtransactions that start with=0A>= '-1/' indicate a problem, b) if this could have happened from pre=0A> 9.1.= 6 corruption that was fixed in 9.1.6. Or, could this have=0A> occurred when= we killed that app server? Or.... am I looking in=0A> the wrong place.=0A= =0A-1 as the process doesn't indicate a problem per se.=A0 It indicates=0At= he transaction has been "prepared" and is no longer associated=0Awith a bac= kend process or connection.=0A=0ASomething which was assuming the role of a= transaction manager told=0Aa transaction (or many of them) to prepare for = commit as part of a=0Adistributed transaction.=A0 A transaction which says = it successfully=0Acompleted such a PREPARE statement must hold all its lock= s and keep=0Aall changes pending until it is told to commit or roll back, e= ven=0Aacross database restarts.=A0 It sounds like things were left in this= =0Astate for a very long time, which can lead to all kinds of=0Aproblems, n= otably bloat and blocking.=0A=0A> I do still have the old data directories = so I can start them up=0A> and check out the dataset. Any advice?=0A=0AI wo= uld start it up and see what's in pg_prepared_xacts.=0A=0A-- =0AKevin Gritt= ner=0AEnterpriseDB: http://www.enterprisedb.com=0AThe Enterprise PostgreSQL= Company
pgsql-general by date: