Thread: Frustration
Hi I will admit I am getting reeeeeallllly frustrated right now. Currently postgresql is crashing approximately once every 5 minutes on me template1=> select version(); version ------------------------------------------------------------------- PostgreSQL 6.5.2 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66 (1 row) I am not doing anything except vary basic commands, things like inserts and updates and nothing involving too many expressions. Now, I know nobody can debug anything from what I have just said, but I cannot get a better set of bug reports. I CANT get postgres to send out debug For example. I start it using: /usr/bin/postmaster -o "-F -S 10240" -d 3 -S -N 512 -B 3000 -D/var/lib/pgsql/data -o -F > /tmp/postmasterout 2> /tmp/postmastererr Spot in there I have -d 3 and redirect (this under /bin/sh) to /tmp Now, after repeated backend crashes, I have: [postgres@home bin]$ cat /tmp/postmastererr FindExec: found "/usr/bin/postgres" using argv[0] binding ShmemCreate(key=52e2c1, size=31684608) [postgres@home bin]$ cat /tmp/postmasterout [postgres@home bin]$ Or exactly NOTHING This is out of the box 6.5.2, no changes made, no changes made in the config except to make it install into the right place. I just need to get some debug, so I can actually report something. Am I doing something very dumb, or SHOULD there be debug here and there isnt? I am about ready to pull my hair out over this. I NEED to have a stable database, and crashing EVERY five minutes is not helping me at all {:-( Also, I seem to remember that someone posted here that when one backend crashed, it shouldnt close the other backends any more. Well, mine does. NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally andpossibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your databasesystem connection and exit. Please reconnect to the database system and repeat your query. I am getting this about every five minutes. I wish I knew what was doing it. Even if the backend recovered and just perforned the query again, that would be enough, the overhead of checking to see if the database has crashed EVERY TIME I start or finish performing a query is a huge overhead. I can appreciate that the backend that crashed cannot do this, but the others surely can! Rollback and start again, instead of rollback and panic Appologies if I sound a bit stressed right now, I was under the impression I had tested my system, and so I opened it to the public, and now it is blowing up in my face BADLY. If someone can tell me WHAT I am doing wrong with getting the debug info, please please do! I am just watching it blow up again as we speak, and I must get SOMETHING fixed asap ~Michael
Michael Simms <grim@argh.demon.co.uk> writes: > Now, I know nobody can debug anything from what I have just said, but > I cannot get a better set of bug reports. I CANT get postgres to send > out debug > /usr/bin/postmaster -o "-F -S 10240" -d 3 -S -N 512 -B 3000 -D/var/lib/pgsql/data -o -F > /tmp/postmasterout 2> /tmp/postmastererr Don't use the -S switch (the second one, not the one inside -o). Looking in postmaster.c, I see that causes it to redirect stdout/stderr to /dev/null (probably not too hot an idea, but that's doubtless been like that for a *long* time). Instead launch with something like nohup postmaster switches... </dev/null >logfile 2>errfile & Good luck figuring out where the real problem is... regards, tom lane
> Good luck figuring out where the real problem is... > > regards, tom lane Well, thanks to tom, I know what was wrong, and I have found the problem, or one of them at least... FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting. Okee, that segment of code is, well, its some deep down internals that are as clear as mud to me. Anyone in the know have an idea what this does? Just to save you looking, it is included below. One question, is that does postgresql Inc have a 'normal person' support level? I ask that cos I was planning on getting some of the commercial support, and whilst it is a reasonable price to pay for corporations or people with truckloads of money, I am a humble developer with more expenses than income, and $600 is just way out of my league {:-( If not, fair enough, just thought Id ask cos the support I have had from this list is excellent and I wanted to provide some payback to the developoment group. ~Michael /** WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf'* is cleared. Because IO_IN_PROGRESS conflicts are* expected to be rare, there is only one BufferIO* lock in the entire system. All processesblock* on this semaphore when they try to use a buffer* that someone else is faultingin. Whenever a* process finishes an IO and someone is waiting for* the buffer, BufferIOis signaled (SignalIO). All* waiting processes then wake up and check to see* if theirbuffer is now ready. This implementation* is simple, but efficient enough if WaitIO is* rarelycalled by multiple processes simultaneously.** ProcSleep atomically releases the spinlock and goes to* sleep.** Note: there is an easy fix if the queue becomes long.* save the id of the buffer we arewaiting for in* the queue structure. That way signal can figure* out which proc to wake up.*/ #ifdef HAS_TEST_AND_SET static void WaitIO(BufferDesc *buf, SPINLOCK spinlock) { SpinRelease(spinlock); S_LOCK(&(buf->io_in_progress_lock)); S_UNLOCK(&(buf->io_in_progress_lock)); SpinAcquire(spinlock); }
Michael Simms <grim@argh.demon.co.uk> writes: > Well, thanks to tom, I know what was wrong, and I have found the problem, > or one of them at least... > FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting. > Okee, that segment of code is, well, its some deep down internals that > are as clear as mud to me. Hmph. Apparently, some backend was waiting for some other backend to finish reading a page in or writing it out, and gave up after deciding it had waited an unreasonable amount of time (~ 1 minute, which does seem plenty long enough). Probably, the I/O did in fact finish, but the waiting backend didn't get the word for some reason. Is it possible that there's something wrong with the spinlock code on your hardware? There are a bunch of different spinlock implementations (assembly code for various hardware) in include/storage/s_lock.h and backend/storage/buffer/s_lock.c. Some of 'em might not be as well tested as others. But you're on PC hardware, right? I would've thought that flavor of the code would be pretty well wrung out. Another likely explanation is that there's something wrong in bufmgr.c's logic for setting and releasing the io_in_progress lock --- but a quick look doesn't show any obvious error, and I would have thought we'd have found out about any such problem long since. Since we're not being buried in reports of stuck-spinlock errors, I'm guessing there is some platform-specific problem on your machine. No good ideas what it is if it isn't a spinlock failure. (Finally, are you sure this is the *only* indication of trouble in the logs? If a backend crashed while holding the spinlock, the other ones would eventually die with complaints like this, but that wouldn't make the spinlock code be at fault...) regards, tom lane
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > Sent: Friday, September 24, 1999 11:27 PM > To: Michael Simms > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Frustration > > > Michael Simms <grim@argh.demon.co.uk> writes: > > Well, thanks to tom, I know what was wrong, and I have found > the problem, > > or one of them at least... > > FATAL: s_lock(0c9ef824) at bufmgr.c:1106, stuck spinlock. Aborting. > > Okee, that segment of code is, well, its some deep down internals that > > are as clear as mud to me. > > Hmph. Apparently, some backend was waiting for some other backend to > finish reading a page in or writing it out, and gave up after deciding > it had waited an unreasonable amount of time (~ 1 minute, which does > seem plenty long enough). Probably, the I/O did in fact finish, but > the waiting backend didn't get the word for some reason. > [snip] > > Another likely explanation is that there's something wrong in > bufmgr.c's logic for setting and releasing the io_in_progress lock --- > but a quick look doesn't show any obvious error, and I would have > thought we'd have found out about any such problem long since. > Since we're not being buried in reports of stuck-spinlock errors, > I'm guessing there is some platform-specific problem on your machine. > No good ideas what it is if it isn't a spinlock failure. > Different from other spinlocks,io_in_progress spinlock is a per bufpage spinlock and ProcReleaseSpins() doesn't release the spinlock. If an error(in md.c in most cases) occured while holding the spinlock ,the spinlock would necessarily freeze. Michael Simms saysERROR: cannot read block 641 of server occured before the spinlock stuck abort. Probably it is an original cause of the spinlock freeze. However I don't understand the following status of his machine. Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda3 1109780 704964 347461 67% / /dev/hda1 33149 6140 25297 20% /boot /dev/hdc1 9515145 3248272 5773207 36% /home /dev/hdb1 402852 154144 227903 40% /tmp /dev/sda1 30356106785018642307 43892061535609608 0 100% /var/lib/pgsql Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > Different from other spinlocks,io_in_progress spinlock is a per bufpage > spinlock and ProcReleaseSpins() doesn't release the spinlock. > If an error(in md.c in most cases) occured while holding the spinlock > ,the spinlock would necessarily freeze. Oooh, good point. Shouldn't this be fixed? If we don't fix it, then a disk I/O error will translate to an installation-wide shutdown and restart as soon as some backend tries to touch the locked page (as indeed was happening to Michael). That seems a tad extreme. > Michael Simms says > ERROR: cannot read block 641 of server > occured before the spinlock stuck abort. > Probably it is an original cause of the spinlock freeze. I seem to have missed the message containing that bit of info, but it certainly suggests that your diagnosis is correct. > However I don't understand the following status of his machine. > /dev/sda1 30356106785018642307 43892061535609608 0 100% Now that we know the root problem was disk driver flakiness, I think we can write that off as Not Our Fault ;-) regards, tom lane
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Monday, September 27, 1999 10:20 PM > To: Hiroshi Inoue > Cc: Michael Simms; pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Frustration > > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > Different from other spinlocks,io_in_progress spinlock is a per bufpage > > spinlock and ProcReleaseSpins() doesn't release the spinlock. > > If an error(in md.c in most cases) occured while holding the spinlock > > ,the spinlock would necessarily freeze. > > Oooh, good point. Shouldn't this be fixed? If we don't fix it, then Yes,it's on TODO. * spinlock stuck problem when elog(FATAL) and elog(ERROR) inside bufmgr I would try to fix it. > a disk I/O error will translate to an installation-wide shutdown and > restart as soon as some backend tries to touch the locked page (as > indeed was happening to Michael). That seems a tad extreme. > Regards. Hiroshi Inoue Inoue@tpf.co.jp