Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid - Mailing list pgsql-general

From Kevin Grittner
Subject Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid
Date
Msg-id 1361479613.45008.YahooMailNeo@web162902.mail.bf1.yahoo.com
Whole thread Raw
In response to Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid  (Ned Wolpert <ned.wolpert@imemories.com>)
List pgsql-general
Ned Wolpert <ned.wolpert@imemories.com> wrote:=0A=0A> I'm doing a postmorte=
m on a corruption event we had. I have an=0A> idea on what happened, but no=
t sure. I figure I'd share what=0A> happened and see if I'm close to right =
here.=0A>=0A> Running 9.1.6 with hot-standby, archiving 4 months of wal fil=
es,=0A> and even a nightly pg_dump all.=0A=0AThose WAL files aren't going t=
o be of much use without an=0Aoccasional base backup to apply them to.=0A=
=0A> 50G database.=A0 Trying to update or delete a row in a small (21=0A> r=
ow, but heavily used table) would lock up completely. Never=0A> finish.=0A=
=0AHow long is "never" in hours or days?=0A=0A> Removed all clients, restar=
ted the db instance, no joy.=0A> Check pg_stat_activity, and nothing that w=
asn't idle.... run the=0A> delete, locked up.=0A=0ADid you have any rows in=
 pg_prepared_xacts that had lingered for=0Alonger than you were waiting for=
 the delete?=0A=0A> Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.rel=
ation =3D=0A> pg_class.oid;) with no clients touching this row=0A=0AOf cour=
se you need to be connected to the right database when you=0Arun this, and =
you need to look at relation locks -- row locks don't=0Ashow up in that vie=
w.=0A=0A> fresh restart of the db, and saw virtualtransactions against this=
=0A> same table where the values would be -1/nnnnn were nnnnn was a=0A> hug=
e number.=A0 Turns out we had about 159 entries from different=0A> tables i=
n the database.=0A=0ASounds like prepared transactions.=A0 Again, what was =
in=0Apg_prepared_xacts?=0A=0A> Checked hot-standby and, of course, no locks=
 or anything.=0A> Switched to hot-standby.=0A=0AMeaning you promoted it to =
master?=0A=0A> Hot-standby instantly gained these locks, Also noticed that =
2=0A> prepared transactions migrated to the hot-standby. Binary=0A> upgrade=
d to 9.1.8, locks still existed. Ended up executing the=0A> one thing we kn=
ew would work. Take the site down, pg_dumpall to=0A> fresh instance. Everyt=
hing is fine.=0A=0AThat all sounds consistent with a flaky transaction mana=
ger.=0A=0A> A little more background: We were running 9.1.4 back when 9.1.6=
=0A> came out. We saw there was possible corruption issues and did a=0A> bi=
nary upgrade and reindexing. Everything seemed to be fine, but=0A> I wonder=
 if we really had problems back then. We rebuilt the=0A> hot-standby after =
the binary upgrade via normal restore and=0A> wal-file replays.=A0 I should=
 also note that this row that had the=0A> lock on it that would not go away=
, was created by an app server=0A> that was killed (via -9) since it was no=
n-responsive, and the row=0A> 'create date' (in db and logs) is the exact t=
ime the app server=0A> was killed.=0A=0AIt sounds as though you abruptly te=
rminated the process running=0Ayour transaction manager, which left it unaw=
are of one or more=0Adangling prepared transactions.=A0 Further, it sounds =
like your=0Atransaction manager doesn't go looking for such detritus.=A0 If=
 it's=0Anot going to watch for such problems, you had better do so.=A0 Any=
=0Aprepared transaction which is sitting in pg_prepared_xacts for more=0Ath=
an a few seconds, I would consider suspect.=A0 After a few minutes=0Ahours =
I would consider them to be a problem.=A0 After a day I would=0Aconsider th=
e transaction manager to have fallen on its face, and=0Awould go clean thin=
gs up by either committing or rolling back the=0Aprepared transaction(s).=
=0A=0A> I was wondering if a) these virtualtransactions that start with=0A>=
 '-1/' indicate a problem, b) if this could have happened from pre=0A> 9.1.=
6 corruption that was fixed in 9.1.6. Or, could this have=0A> occurred when=
 we killed that app server? Or.... am I looking in=0A> the wrong place.=0A=
=0A-1 as the process doesn't indicate a problem per se.=A0 It indicates=0At=
he transaction has been "prepared" and is no longer associated=0Awith a bac=
kend process or connection.=0A=0ASomething which was assuming the role of a=
 transaction manager told=0Aa transaction (or many of them) to prepare for =
commit as part of a=0Adistributed transaction.=A0 A transaction which says =
it successfully=0Acompleted such a PREPARE statement must hold all its lock=
s and keep=0Aall changes pending until it is told to commit or roll back, e=
ven=0Aacross database restarts.=A0 It sounds like things were left in this=
=0Astate for a very long time, which can lead to all kinds of=0Aproblems, n=
otably bloat and blocking.=0A=0A> I do still have the old data directories =
so I can start them up=0A> and check out the dataset. Any advice?=0A=0AI wo=
uld start it up and see what's in pg_prepared_xacts.=0A=0A-- =0AKevin Gritt=
ner=0AEnterpriseDB: http://www.enterprisedb.com=0AThe Enterprise PostgreSQL=
 Company

pgsql-general by date:

Previous
From: "James B. Byrne"
Date:
Subject: Re: Need help extripating plpgsql
Next
From: Adrian Klaver
Date:
Subject: Re: Need help extripating plpgsql