Corrupt Table - Mailing list pgsql-general
From | Bryan White |
---|---|
Subject | Corrupt Table |
Date | |
Msg-id | 003b01c01e63$8ce09d20$2dd260d1@arcamax.com Whole thread Raw |
Responses |
Re: Corrupt Table
|
List | pgsql-general |
I have apparently picked up a corrupt record in a table. What happend: Yesterday at one point the database seems to hang. There were three backend processes consuming large amounts of CPU time. I stopped the server and rebooted (3 months since last reboot). The database restarted and seemed to be fine. Then last night the nightly backups failed apparently when reading the 'customer' table. The database restarted itself. There have been a couple of database restarts since then. As far as I can tell it is the customer table that is the problem. Here is what a failure looks like in the log file: -------------------- Server process (pid 2864) exited with status 139 at Thu Sep 14 10:13:11 2000 Terminating any active server processes... 000914.10:13:11.425 [5879] NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally.... ------------------- The last entry is repeated multiple times. I have written a small utility program (pganal). It looks for inconsistancies in page layout and tuple layout. My original intent was to parse the tuple internal structure as well but that proved to be more complex that I was ready to handle at the time. Anyway I stopped the database, copied the customer file to another directory and restarted the database. Here is the pganal output from this copy: -------------------------- Analyzing customer Page 25878 ERROR: pd_lower is too small pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0 Page 31395 ERROR: pd_lower is too small pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0 Page 32950 ERROR: pd_lower is too small pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0 Tuple 71453.0 Ofs=8029 Len=164 Flags=1 Error: tuple overwrites pd_special Tuple 71453.4 Ofs=7346 Len=208 Flags=1 Error: tuple overlaps another Tuple 71453.40 Ofs=1365 Len=160 Flags=1 Error: tuple overlaps another Page 71958 ERROR: pd_lower has odd value pd_lower=11886 pd_upper=24239 pd_special=109 pd_opaque.od_pagesize=0 Page 73622 ERROR: pd_lower is too small pd_lower=0 pd_upper=0 pd_special=0 pd_opaque.od_pagesize=0 Page Summary Data Pages = 76555 Unused Pages = 0 New Pages = 0 Empty Pages = 0 Bad Pages = 5 Total Pages = 76560 Tuple Summary O/L Error Tuples = 1 Overlaped Tuples = 2 Unused Tuples = 47994 Used Tuples = 3698495 Total Tuples = 3746492 -------------------------- I suspect the 'pd_lower is too small' may be just my misunderstanding of the page layout. The three tuple errors (all on the same page) and the 'pd_lower has odd value' error seem to be real. 'pd_lower has odd value' comes from: int nitems = (phd->pd_lower - sizeof(*phd)) / sizeof(ItemIdData); if(nitems * sizeof(ItemIdData) != phd->pd_lower - sizeof(*phd)) pderr = "pd_lower has odd value"; Basically it means the pd_lower did not leave room for an integral number of ItemIDData structures. I seem to have two separate corrupt pages. I can post the full source to pganal if anyone is interested. Its about 300 lines. My question is how do I proceed from here. Going back to the previous day's backup would be very painful in terms of lost data. I suspect the answer is to perform surgery on the bad pages and then rebuild indexes but this is a scary idea. Has anyone else created tools to deal with this kind of problem? Bryan White, ArcaMax.com, VP of Technology You can't deny that it is not impossible, can you.
pgsql-general by date: