Thread: ERROR: invalid memory alloc request size, and others
We recently upgraded from 8.1.4 to 8.2.0 on Fedora Core 6, and are now seeing a few rather ominous-looking messages. The problem started with this one, during an update involving a rather complex view: ERROR: invalid memory alloc request size 1174405120 Then, when attempting to re-run the same update statement: PANIC: cannot abort transaction 8682091, it was already committed and then again: PANIC: right sibling's left-link doesn't match I ran reindex on the tables in question, which fixed the problem in the short term and allowed the update to complete, but then I got the memory allocation error and these overnight: ERROR: invalid page header in block 3362 of relation "index_clin_dal_staff_id" ERROR: invalid page header in block 2325 of relation "index_clin_dal_batch" I will gladly provide additional information to help track down and hopefully solve this problem, but at this point I'm not sure what else to include. Any help would be much appreciated. Thanks, Jonathan
On Tue, 2007-01-09 at 13:38, Jonathan Hedstrom wrote: > We recently upgraded from 8.1.4 to 8.2.0 on Fedora Core 6, and are now > seeing a few rather ominous-looking messages. > > The problem started with this one, during an update involving a rather > complex view: > > ERROR: invalid memory alloc request size 1174405120 > > Then, when attempting to re-run the same update statement: > > PANIC: cannot abort transaction 8682091, it was already committed > > and then again: > > PANIC: right sibling's left-link doesn't match > > > I ran reindex on the tables in question, which fixed the problem in the > short term and allowed the update to complete, but then I got the memory > allocation error and these overnight: > > ERROR: invalid page header in block 3362 of relation > "index_clin_dal_staff_id" > ERROR: invalid page header in block 2325 of relation "index_clin_dal_batch" > > I will gladly provide additional information to help track down and > hopefully solve this problem, but at this point I'm not sure what else > to include. First step, update to 8.2.1, it came out yesterday, and there were a few bugs that got stomped. Don't know if they are related to your problem, but having the latest version is usually a "good thing". Also, schedule some maintenance window for your server to run memtest86 and possibly something to check for bad blocks on your drives. Often errors like the invalid memory alloc request size you're seeing, and the link doesn't match one are caused by bad hardware.
Scott Marlowe wrote: > On Tue, 2007-01-09 at 13:38, Jonathan Hedstrom wrote: > >> We recently upgraded from 8.1.4 to 8.2.0 on Fedora Core 6, and are now >> seeing a few rather ominous-looking messages. >> >> The problem started with this one, during an update involving a rather >> complex view: >> >> ERROR: invalid memory alloc request size 1174405120 >> >> Then, when attempting to re-run the same update statement: >> >> PANIC: cannot abort transaction 8682091, it was already committed >> >> and then again: >> >> PANIC: right sibling's left-link doesn't match >> >> >> I ran reindex on the tables in question, which fixed the problem in the >> short term and allowed the update to complete, but then I got the memory >> allocation error and these overnight: >> >> ERROR: invalid page header in block 3362 of relation >> "index_clin_dal_staff_id" >> ERROR: invalid page header in block 2325 of relation "index_clin_dal_batch" >> >> I will gladly provide additional information to help track down and >> hopefully solve this problem, but at this point I'm not sure what else >> to include. >> > > First step, update to 8.2.1, it came out yesterday, and there were a few > bugs that got stomped. Don't know if they are related to your problem, > but having the latest version is usually a "good thing". > > I noticed that 8.2.1 had been released shortly after I sent off my initial email. I've upgraded and hopefully that will take care of it. > Also, schedule some maintenance window for your server to run memtest86 > and possibly something to check for bad blocks on your drives. Often > errors like the invalid memory alloc request size you're seeing, and the > link doesn't match one are caused by bad hardware. I'll try to run these tests soon, but this is a production server, so scheduling downtime takes a bit of planning. I should also mention that this is the exact same server we've been running 8.1.4 on for about 6 months without a problem (doing the same nightly update causing the problem etc), so I'm a bit skeptical about having suddenly developed hardware issues. Thanks, Jonathan
Scott Marlowe <smarlowe@g2switchworks.com> writes: > On Tue, 2007-01-09 at 13:38, Jonathan Hedstrom wrote: >> We recently upgraded from 8.1.4 to 8.2.0 on Fedora Core 6, and are now >> seeing a few rather ominous-looking messages. > First step, update to 8.2.1, it came out yesterday, and there were a few > bugs that got stomped. Definitely good advice, although none of these messages look related to the known fixes. > Also, schedule some maintenance window for your server to run memtest86 > and possibly something to check for bad blocks on your drives. +1 ... I have not seen any instance of "invalid page header" that could be traced to a Postgres bug. The cases I've been able to study all seemed to involve either flaky hardware or kernel-level bugs (such as dumping a fragment of some unrelated file into a Postgres table :-() regards, tom lane
Tom Lane wrote: > Scott Marlowe <smarlowe@g2switchworks.com> writes: > >> Also, schedule some maintenance window for your server to run memtest86 >> and possibly something to check for bad blocks on your drives. >> > > +1 ... I have not seen any instance of "invalid page header" that could > be traced to a Postgres bug. The cases I've been able to study all > seemed to involve either flaky hardware or kernel-level bugs (such as > dumping a fragment of some unrelated file into a Postgres table :-() > > regards, tom lane > Since it sounds like this is either a hardware or a kernel issue, we're wondering if our downtime would be better spent rebooting to the standard FC6 kernel, or trying some of the aforementioned hardware tests... We are running a xen kernel: 2.6.18-1.2798.fc6xen and getting these kernel errors in our logs: Jan 7 18:51:23 ws116 kernel: SKB BUG: Invalid truesize (4012) len=16384, sizeof(sk_buff)=172 Jan 7 18:51:23 ws116 kernel: SKB BUG: Invalid truesize (4012) len=16384, sizeof(sk_buff)=172 Jan 9 08:52:12 ws116 kernel: SKB BUG: Invalid truesize (4012) len=16384, sizeof(sk_buff)=172 Jan 9 13:07:35 ws116 kernel: SKB BUG: Invalid truesize (4012) len=16384, sizeof(sk_buff)=172 (The memory alloc error first occured early in the morning on the 8th). Thanks, Jonathan
Jonathan Hedstrom <jhedstrom@desc.org> writes: > We are running a xen kernel: 2.6.18-1.2798.fc6xen When did you start doing that ... any relation to the time when the problems started? I've heard some unkind remarks about the stability of Xen, though I have no direct knowledge about it. regards, tom lane
Jonathan Hedstrom wrote: > Tom Lane wrote: >> Scott Marlowe <smarlowe@g2switchworks.com> writes: >> >>> Also, schedule some maintenance window for your server to run memtest86 >>> and possibly something to check for bad blocks on your drives. >>> >> +1 ... I have not seen any instance of "invalid page header" that could >> be traced to a Postgres bug. The cases I've been able to study all >> seemed to involve either flaky hardware or kernel-level bugs (such as >> dumping a fragment of some unrelated file into a Postgres table :-() >> >> regards, tom lane >> > Since it sounds like this is either a hardware or a kernel issue, we're > wondering if our downtime would be better spent rebooting to the > standard FC6 kernel, or trying some of the aforementioned hardware tests... > > We are running a xen kernel: 2.6.18-1.2798.fc6xen This is the base Xen kernel from the FC 6 release. There have been 3 updates released since then (most recently 01-Jan). I see a number of Xen fixes in the changelog, and I know that the major factor in the slippage of the FC 6 release was getting Xen into the distro -- so I would definitely expect some Xen bugs in the initial cut from the release. Simplest advice I can think of: if you don't need Xen, go back to the stock (albeit most recent update) kernel. If you do need Xen, try the most recent update of the stock kernel anyway. If the problems persist, you've at least eliminated one variable. If they go away, you've got the culprit. Note that I don't use Xen, so I'm not completely up-to-date... Last I knew there were issues preventing Xen from being included in the upstream Linux kernel (vendors are patching it in individually), and that speaks volumes as to the "newness" of the technology. Andrew
Jonathan Hedstrom wrote: > Scott Marlowe wrote: >> On Tue, 2007-01-09 at 13:38, Jonathan Hedstrom wrote: >> >>> We recently upgraded from 8.1.4 to 8.2.0 on Fedora Core 6, and are now >>> seeing a few rather ominous-looking messages. [ SNIP ] >> Also, schedule some maintenance window for your server to run memtest86 >> and possibly something to check for bad blocks on your drives. Often >> errors like the invalid memory alloc request size you're seeing, and the >> link doesn't match one are caused by bad hardware. > > I'll try to run these tests soon, but this is a production server, so > scheduling downtime takes a bit of planning. I should also mention that > this is the exact same server we've been running 8.1.4 on for about 6 > months without a problem (doing the same nightly update causing the > problem etc), so I'm a bit skeptical about having suddenly developed > hardware issues. Not having experienced any of the above issues myself, the following are totally off-the-cuff: You indicated you are now running Fedora Core 6, and that your previous 8.1.4 configuration was running for ~6 months prior to upgrading. I assume that because it is a production server, the FC 6 upgrade was performed at the same time as the 8.2 upgrade. You indicated that you're running on the exact same hardware as before, so unless you opened the machine, relocated it, etc. as part of the upgrade, hardware issues would seem unlikely. FC 6 brings a number of changes to the table -- the biggest one being the 2.6.18 Linux kernel. The 2.6.18 kernel updates were released for FC 5 in the middle of October, but as you are talking about a production server, I would imagine you did not upgrade at that time. Have you applied all current FC 6 updates? I definitely recommend making sure you are current with all updates. A quick check shows that FC 6 has released almost as many package updates as since its release as FC 5 has since it was released (6 months more time). I know there were a number of issues that were queued up and released right after the FC 6 release, to avoid slipping the release date any further than it had already slipped. If you can't attribute the issues to any software problems, I would definitely start looking at the hardware, as Scott & Tom suggested. As unlikely as hardware issues might seem, that seems to be the most frequent time those little buggers manage to show their teeth :^) Hope this helps... Andrew
Andrew Kroeger wrote: > This is the base Xen kernel from the FC 6 release. There have been 3 > updates released since then (most recently 01-Jan). I see a number of > Xen fixes in the changelog, and I know that the major factor in the > slippage of the FC 6 release was getting Xen into the distro -- so I > would definitely expect some Xen bugs in the initial cut from the > release. > We had downloaded the kernel updates, but after doing so, forgot to reboot in order to use the updated kernel... > Simplest advice I can think of: if you don't need Xen, go back to the > stock (albeit most recent update) kernel. If you do need Xen, try the > most recent update of the stock kernel anyway. If the problems > persist, you've at least eliminated one variable. If they go away, > you've got the culprit. > We downloaded the most recent stock FC6 kernel and rebooted to that. Hopefully this will take care of the issue. I reindexed the tables related to the failing update. Other than that, is there any cleanup work I should do relating to these errors? Thanks for all the quick responses. -Jonathan
Tom Lane wrote: > Jonathan Hedstrom <jhedstrom@desc.org> writes: > >> We are running a xen kernel: 2.6.18-1.2798.fc6xen >> > > When did you start doing that ... any relation to the time when the > problems started? > We started using the xen kernel on the 3rd, and there wasn't any indication of a problem until the 8th, so that means the update ran w/o problem on the 4th and 5th (it doesn't run over the weekend). We have decided to run the stock fc6 kernel for now. -Jonathan
Attachment
Jonathan Hedstrom wrote: > We downloaded the most recent stock FC6 kernel and rebooted to that. > Hopefully this will take care of the issue. We've been up and running for 2 days now on the stock kernel, and haven't seen any of these errors. I'm thinking the issue is resolved. Thanks again for all the replies. -Jonathan