8.1 index corruption woes - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | 8.1 index corruption woes |
Date | |
Msg-id | 20080707211607.GG4715@alvh.no-ip.org Whole thread Raw |
Responses |
Re: 8.1 index corruption woes
|
List | pgsql-hackers |
Hi, We've detected what I think is some sort of index corruption in 8.1. The server is running 8.1.11, so AFAICT the problem with truncated pages in vacuum is already patched and accounted for (i.e. we reindexed, and a bit later the problem presented itself again). There haven't been any relevant fixes after that AFAICT. What we see is that after a bit of updating the index, it starts having tuples that poing to heap entries which are marked unused. I detected one of these almost by accident, and then built a tool to discover them by examining pg_filedump output. Basically what it does is scan the heap, note which heap tuples are marked "unused", and then scan the index and for each index tuple in leaf pages, see if it points to an unused heap tuple. The number of occurrences is amazingly high. Right after a reindex there isn't any occurrence; but after a while of application load, a lot of them appear. I catalogued this as index corruption: the theory is that as soon as the unused heap tuple is reused, the index will have a pointer with the wrong index key pointing to a live heap tuple. (We saw an occurrence of this too, but I wasn't motivated enough to write a tool to verify the data in index vs. heap tuples.) However, seeing the high prevalence of the problem, I started considering whether the tool is misreading the output anyway -- i.e. maybe it's a bug in the tool, or a misconception on my part. The tool output looks like this: INDEX PTR TO UNUSED HEAP: index 273375 (78,18) -> (5530, 17) INDEX PTR TO UNUSED HEAP: index 273375 (96,84) -> (5436, 3) INDEX PTR TO UNUSED HEAP: index 273375 (111,1) -> (1317, 26) INDEX PTR TO UNUSED HEAP: index 273375 (145,1) -> (1665, 26) INDEX PTR TO UNUSED HEAP: index 273375 (174,1) -> (2656, 17) INDEX PTR TO UNUSED HEAP: index 273375 (199,1) -> (1745, 21) INDEX PTR TO UNUSED HEAP: index 273375 (207,1) -> (358, 26) INDEX PTR TO UNUSED HEAP: index 273375 (214,1) -> (2800, 17) (many more lines of the same stuff) What this means (first line) is that on index 273375, page 78, offset 18, there's a pointer to heap page 5530, offset 17; but that heap offset shows up as unused. This is from the heap: Block 5530 ******************************************************** <Header> ----- Block Offset: 0x02b34000 Offsets: Lower 136 (0x0088) Block: Size 8192 Version 3 Upper 992 (0x03e0) LSN: logid 130 recoff 0xd53b3090 Special 8192 (0x2000) Items: 29 Free Space: 856 Length (including item array): 140 [ ... snip other items ...] Item 17 -- Length: 0 Offset: 504 (0x01f8) Flags: 0x00 This is from the index: Block 78 ******************************************************** <Header> ----- Block Offset: 0x0009c000 Offsets: Lower 788 (0x0314) Block: Size 8192 Version 3 Upper 5104 (0x13f0) LSN: logid 131 recoff 0x01b3c6e0 Special 8176 (0x1ff0) Items: 192 Free Space: 4316 Length (including item array): 792 [ ... snip other items ... ] Item 18 -- Length: 16 Offset: 5184 (0x1440) Flags: USED Block Id: 5530 linp Index: 17 Size: 16 Has Nulls: 0 Has Varwidths: 0 Of course, we're very concerned about this problem. Personally I am prepared to believe that this could be a hardware problem, because no one else seems to be reporting this kind of thing, but I don't see how it could cause this particular issue and not more wide-ranging data corruption. I would like comments on whether this is really a problem or I am just misreading pg_filedump output. If we confirm this, we can try to proceed to investigate it further. I attach the checker tool in case it is of any interest. Thanks, -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
pgsql-hackers by date: