Thread: 8.1 index corruption woes
Hi, We've detected what I think is some sort of index corruption in 8.1. The server is running 8.1.11, so AFAICT the problem with truncated pages in vacuum is already patched and accounted for (i.e. we reindexed, and a bit later the problem presented itself again). There haven't been any relevant fixes after that AFAICT. What we see is that after a bit of updating the index, it starts having tuples that poing to heap entries which are marked unused. I detected one of these almost by accident, and then built a tool to discover them by examining pg_filedump output. Basically what it does is scan the heap, note which heap tuples are marked "unused", and then scan the index and for each index tuple in leaf pages, see if it points to an unused heap tuple. The number of occurrences is amazingly high. Right after a reindex there isn't any occurrence; but after a while of application load, a lot of them appear. I catalogued this as index corruption: the theory is that as soon as the unused heap tuple is reused, the index will have a pointer with the wrong index key pointing to a live heap tuple. (We saw an occurrence of this too, but I wasn't motivated enough to write a tool to verify the data in index vs. heap tuples.) However, seeing the high prevalence of the problem, I started considering whether the tool is misreading the output anyway -- i.e. maybe it's a bug in the tool, or a misconception on my part. The tool output looks like this: INDEX PTR TO UNUSED HEAP: index 273375 (78,18) -> (5530, 17) INDEX PTR TO UNUSED HEAP: index 273375 (96,84) -> (5436, 3) INDEX PTR TO UNUSED HEAP: index 273375 (111,1) -> (1317, 26) INDEX PTR TO UNUSED HEAP: index 273375 (145,1) -> (1665, 26) INDEX PTR TO UNUSED HEAP: index 273375 (174,1) -> (2656, 17) INDEX PTR TO UNUSED HEAP: index 273375 (199,1) -> (1745, 21) INDEX PTR TO UNUSED HEAP: index 273375 (207,1) -> (358, 26) INDEX PTR TO UNUSED HEAP: index 273375 (214,1) -> (2800, 17) (many more lines of the same stuff) What this means (first line) is that on index 273375, page 78, offset 18, there's a pointer to heap page 5530, offset 17; but that heap offset shows up as unused. This is from the heap: Block 5530 ******************************************************** <Header> ----- Block Offset: 0x02b34000 Offsets: Lower 136 (0x0088) Block: Size 8192 Version 3 Upper 992 (0x03e0) LSN: logid 130 recoff 0xd53b3090 Special 8192 (0x2000) Items: 29 Free Space: 856 Length (including item array): 140 [ ... snip other items ...] Item 17 -- Length: 0 Offset: 504 (0x01f8) Flags: 0x00 This is from the index: Block 78 ******************************************************** <Header> ----- Block Offset: 0x0009c000 Offsets: Lower 788 (0x0314) Block: Size 8192 Version 3 Upper 5104 (0x13f0) LSN: logid 131 recoff 0x01b3c6e0 Special 8176 (0x1ff0) Items: 192 Free Space: 4316 Length (including item array): 792 [ ... snip other items ... ] Item 18 -- Length: 16 Offset: 5184 (0x1440) Flags: USED Block Id: 5530 linp Index: 17 Size: 16 Has Nulls: 0 Has Varwidths: 0 Of course, we're very concerned about this problem. Personally I am prepared to believe that this could be a hardware problem, because no one else seems to be reporting this kind of thing, but I don't see how it could cause this particular issue and not more wide-ranging data corruption. I would like comments on whether this is really a problem or I am just misreading pg_filedump output. If we confirm this, we can try to proceed to investigate it further. I attach the checker tool in case it is of any interest. Thanks, -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > We've detected what I think is some sort of index corruption in 8.1. > The server is running 8.1.11, so AFAICT the problem with truncated pages > in vacuum is already patched and accounted for (i.e. we reindexed, and a > bit later the problem presented itself again). There haven't been any > relevant fixes after that AFAICT. > What we see is that after a bit of updating the index, it starts having > tuples that poing to heap entries which are marked unused. Do you actually see any observed problem, or is this conclusion based entirely on your pg_filedump analysis? I suspect a problem with your analysis script, although a quick scan of the code didn't find an issue. Another point to keep in mind, if you are trying to analyze files belonging to a live database, is that what you can see in the filesystem may not be the "current" contents of every page. For typical access patterns it'd be unsurprising for the visible index pages to lag behind those of the heap, since they'd be "hotter" and tend to stay in shared buffers longer. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > We've detected what I think is some sort of index corruption in 8.1. > > The server is running 8.1.11, so AFAICT the problem with truncated pages > > in vacuum is already patched and accounted for (i.e. we reindexed, and a > > bit later the problem presented itself again). There haven't been any > > relevant fixes after that AFAICT. > > > What we see is that after a bit of updating the index, it starts having > > tuples that poing to heap entries which are marked unused. > > Do you actually see any observed problem, or is this conclusion based > entirely on your pg_filedump analysis? Well, yeah, this all started because the guys started getting weird results in queries, and found out that disabling index scans returned different results. > I suspect a problem with your analysis script, although a quick scan > of the code didn't find an issue. Well, that's why I posted the corresponding pg_filedump output -- manual examination shows of that shows that the script is drawing the correct conclusions. > Another point to keep in mind, if you are trying to analyze files > belonging to a live database, is that what you can see in the filesystem > may not be the "current" contents of every page. For typical access > patterns it'd be unsurprising for the visible index pages to lag behind > those of the heap, since they'd be "hotter" and tend to stay in shared > buffers longer. Hmm, I think the files come from a PITR slave that's not online. I'll ask to be sure. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Another point to keep in mind, if you are trying to analyze files >> belonging to a live database, is that what you can see in the filesystem >> may not be the "current" contents of every page. For typical access >> patterns it'd be unsurprising for the visible index pages to lag behind >> those of the heap, since they'd be "hotter" and tend to stay in shared >> buffers longer. > Hmm, I think the files come from a PITR slave that's not online. I'll > ask to be sure. 8.1 didn't have restartpoint code, and of course bgwriter isn't running; so I believe filesystem pages on a PITR slave could be arbitrarily far out of date if the corresponding shared buffer got touched regularly. Try doing the analysis on the master immediately after a CHECKPOINT command. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> Another point to keep in mind, if you are trying to analyze files > >> belonging to a live database, is that what you can see in the filesystem > >> may not be the "current" contents of every page. For typical access > >> patterns it'd be unsurprising for the visible index pages to lag behind > >> those of the heap, since they'd be "hotter" and tend to stay in shared > >> buffers longer. > > > Hmm, I think the files come from a PITR slave that's not online. I'll > > ask to be sure. > > 8.1 didn't have restartpoint code, and of course bgwriter isn't running; > so I believe filesystem pages on a PITR slave could be arbitrarily far > out of date if the corresponding shared buffer got touched regularly. > Try doing the analysis on the master immediately after a CHECKPOINT > command. Okay. I'll also check out tomorrow the LSN on the involved pages to see if there's anything obviously bogus. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jul 7, 2008, at 7:39 PM, Tom Lane wrote: > Another point to keep in mind, if you are trying to analyze files > belonging to a live database, is that what you can see in the > filesystem > may not be the "current" contents of every page. For typical access > patterns it'd be unsurprising for the visible index pages to lag > behind > those of the heap, since they'd be "hotter" and tend to stay in shared > buffers longer. None of the tests were done on the production database. Most of the checks were not done on a PITR restore; they were done on a SAN-level snapshot that had been run through the recovery process (startup postmaster on snapshot, let it recover, shut down). I hadn't thought about checkpointing; I'll make sure to do that next time we take a snapshot. We also analyzed a single table from a completely different (much larger) database. In that case the analysis was done on a PITR- recovered "slave" that was up and running, but nothing should have been writing to the table at all, and it would have been up long enough that it would have checkpointed after exiting PITR recovery (though IIRC there's a manual checkpoint done at exit of PITR recovery anyway). That check didn't show as many questionable index pointers, but there were some (again, the bulk of them were index pointers that were using the first line pointer slot in the index page). -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Alvaro Herrera wrote: > Tom Lane wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > We've detected what I think is some sort of index corruption in 8.1. > > > The server is running 8.1.11, so AFAICT the problem with truncated pages > > > in vacuum is already patched and accounted for (i.e. we reindexed, and a > > > bit later the problem presented itself again). There haven't been any > > > relevant fixes after that AFAICT. > > > > > What we see is that after a bit of updating the index, it starts having > > > tuples that poing to heap entries which are marked unused. > > > > Do you actually see any observed problem, or is this conclusion based > > entirely on your pg_filedump analysis? > > Well, yeah, this all started because the guys started getting weird > results in queries, and found out that disabling index scans returned > different results. I neglected to mention that further analysis of the failed index scans showed that index entries were pointing to heap tuples with completely different data. I was tracking one of those, when I noticed that in the same index page was an entry pointing to a heap entry marked as unused. > > I suspect a problem with your analysis script, although a quick scan > > of the code didn't find an issue. I don't want to discard this idea, because we're getting a very unusually high number of bogus entries. However, they are all (or a very high percentage of them) the very first entry on each index page. I want to confirm that the leftmost on a leaf btree page is a valid item, and not something like the lower bound value? (I think we only store high bounds on internal pages, not leaf pages, but I'm not sure). > > Another point to keep in mind, if you are trying to analyze files > > belonging to a live database, is that what you can see in the filesystem > > may not be the "current" contents of every page. For typical access > > patterns it'd be unsurprising for the visible index pages to lag behind > > those of the heap, since they'd be "hotter" and tend to stay in shared > > buffers longer. It was confirmed that the servers are not live, and a checkpoint has been executed by the recovery code. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > I don't want to discard this idea, because we're getting a very > unusually high number of bogus entries. However, they are all (or a > very high percentage of them) the very first entry on each index page. > I want to confirm that the leftmost on a leaf btree page is a valid > item, and not something like the lower bound value? (I think we only > store high bounds on internal pages, not leaf pages, but I'm not sure). Er ... no. Per nbtree/README: : On a page that is not rightmost in its tree level, the "high key" is : kept in the page's first item, and real data items start at item 2. : The link portion of the "high key" item goes unused. A page that is : rightmost has no "high key", so data items start with the first item. : Putting the high key at the left, rather than the right, may seem odd, : but it avoids moving the high key as we add data items. (Right offhand, it looks like _bt_split just copies the item that it's cloning the index key of. Maybe it would be worth setting the item pointer invalid, to prevent confusion in future?) regards, tom lane
On Jul 11, 2008, at 2:17 PM, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: >> I don't want to discard this idea, because we're getting a very >> unusually high number of bogus entries. However, they are all (or a >> very high percentage of them) the very first entry on each index >> page. >> I want to confirm that the leftmost on a leaf btree page is a valid >> item, and not something like the lower bound value? (I think we only >> store high bounds on internal pages, not leaf pages, but I'm not >> sure). > > Er ... no. Per nbtree/README: > > : On a page that is not rightmost in its tree level, the "high key" is > : kept in the page's first item, and real data items start at item 2. > : The link portion of the "high key" item goes unused. A page that is > : rightmost has no "high key", so data items start with the first > item. > : Putting the high key at the left, rather than the right, may seem > odd, > : but it avoids moving the high key as we add data items. > > (Right offhand, it looks like _bt_split just copies the item that it's > cloning the index key of. Maybe it would be worth setting the item > pointer invalid, to prevent confusion in future?) Another option would be to change pg_filedump to understand what the first item in the index is. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828