Home > mailing lists

Thread: fatal error in database

fatal error in database

From

"Johnson, Shaunn"

Date:

27 November 2002, 13:08:45

Howdy:

Running PostgreSQL 7.2.1 on RedHat Linux 7.2.

I'm having a problem trying to identify some of the causes
for the following errors:

[snip]
test=> select count (*) from t_testob;
FATAL 2: open of /raid/pgsql/data/pg_clog/0373 failed: No such file or directory
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: NOTICE: Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
Failed.
!>

[/snip]

I have a log file that captures some errors (my debug level is 2)
and it says this:

[snip]
Nov 27 12:51:58 hmp2 postgres[9715]: [4] FATAL 2: open of /raid/pgsql/data/pg_clog/0373 f
ailed: No such file or directory
Nov 27 12:51:58 hmp2 postgres[9715]: [4] FATAL 2: open of /raid/pgsql/data/pg_clog/0373 f
ailed: No such file or directory
Nov 27 12:51:58 hmp2 postgres[9484]: [4] DEBUG: server process (pid 9715) exited with exi
t code 2
Nov 27 12:51:58 hmp2 postgres[9484]: [5] DEBUG: terminating any other active server proce
sses
Nov 27 12:51:59 hmp2 postgres[9716]: [4-1] NOTICE: Message from PostgreSQL backend:
Nov 27 12:51:59 hmp2 postgres[9716]: [4-2] ^IThe Postmaster has informed me that some othe
r backend
Nov 27 12:51:59 hmp2 postgres[9716]: [4-3] ^Idied abnormally and possibly corrupted shared
memory.
Nov 27 12:51:59 hmp2 postgres[9716]: [4-4] ^II have rolled back the current transaction an
d am
Nov 27 12:51:59 hmp2 postgres[9716]: [4-5] ^Igoing to terminate your database system conne
ction and exit.
Nov 27 12:51:59 hmp2 postgres[9716]: [4-6] ^IPlease reconnect to the database system and r
epeat your query.
Nov 27 12:51:59 hmp2 postgres[9484]: [6] DEBUG: all server processes terminated; reinitia
lizing shared memory and semaphores
Nov 27 12:51:59 hmp2 postgres[9717]: [7] DEBUG: database system was interrupted at 2002-1
1-27 12:39:30 EST
Nov 27 12:51:59 hmp2 postgres[9717]: [8] DEBUG: checkpoint record is at 8/21BFC274
Nov 27 12:51:59 hmp2 postgres[9717]: [9] DEBUG: redo record is at 8/21BFC274; undo record
is at 0/0; shutdown FALSE
Nov 27 12:51:59 hmp2 postgres[9717]: [10] DEBUG: next transaction id: 15999894; next oid:
138653530
Nov 27 12:51:59 hmp2 postgres[9717]: [11] DEBUG: database system was not properly shut do
wn; automatic recovery in progress
Nov 27 12:51:59 hmp2 postgres[9717]: [12] DEBUG: ReadRecord: record with zero length at 8
/21BFC2B4
Nov 27 12:51:59 hmp2 postgres[9717]: [13] DEBUG: redo is not required
Nov 27 12:52:01 hmp2 postgres[9717]: [14] DEBUG: database system is ready

[/snip]

When it says, 'corrupt shared memory', I'm *hoping* this has nothing to do
with 'physical memory'.

Can someone tell me how I can stress test PostgreSQL so that I
can find out what the error is really referring to?

Thanks!

-X

Re: fatal error in database

From

Tom Lane

Date:

27 November 2002, 14:42:14

"Johnson, Shaunn" <SJohnson6@bcbsm.com> writes:
> Running PostgreSQL 7.2.1 on RedHat Linux 7.2.

> I'm having a problem trying to identify some of the causes
> for the following errors:

> [snip]
> test=> select count (*) from t_testob;
> FATAL 2:  open of /raid/pgsql/data/pg_clog/0373 failed: No such file or
> directory
> server closed the connection unexpectedly

This probably means corrupted data in your t_testob table: the system
is trying to determine the commit status of a bogus transaction number
(I'm assuming that the file names present in pg_clog/ are nowhere near
0373?).  This is frequently the first visible failure when trying to
read a completely trashed disk page.

As to how the page got trashed, it could have been Postgres' fault,
but I'm inclined to suspect a disk-hardware or kernel mistake instead.
Are you up2date on your kernel version?

You can probably find the broken page by looking through t_testob
with a tool like pg_filedump (should be available from
http://sources.redhat.com/rhdb --- would give you an exact URL except
I can't seem to reach that site right now).  Look for pages that have
page header or item header fields obviously different from the rest;
you don't usually have to know much about what you're reading to spot
the one that ain't like the others.

If the page seems to be completely trashed, which is the usual situation
in the cases I've looked at personally, your best bet is to zero it out;
this loses the rows that were on that page but makes the rest of the
table usable again.  You can do that with something like
dd bs=8K seek=<target page number> count=1 if=/dev/zero of=<target file>
while the postmaster is shut down.  (I strongly advise making a backup
copy of the target file first, in case you make a mistake ...)

BTW, you should definitely upgrade to 7.2.3.  There are serious known
bugs in 7.2.1 (that's why we put out update releases).

            regards, tom lane