Thread: Creation of 10000's of empty table segments and more...

Creation of 10000's of empty table segments and more...

From

"Philip Poles"

Date:

16 August 2000, 17:19:40

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name  : Philip Poles
Your email address :   philip@surfen.com


System Configuration
---------------------
  Architecture (example: Intel Pentium): dual intel pIII 733

  Operating System (example: Linux 2.0.26 ELF): RH 6.2 (Linux 2.2.14-5.0smp)

  PostgreSQL version (example: PostgreSQL-7.0): PostgreSQL-7.0.0

  Compiler used (example:  gcc 2.8.0): gcc 2.91.66


Please enter a FULL description of your problem:
------------------------------------------------
This problem has been reported before for earlier versions of postgres, and I
checked the archives (which are incomplete)
The last report I have of this problem in my personal archives is from Don
Baccus on April 22 of this year, although that was
for version 6.5.3.  Basically, the backend has created a bunch of empty files of
the name <table_name>.<n>, ~32500 for one table,
~50000 for another, ~44000 for a third, and ~250 for a fourth.  From reading the
old thread on this, I suspect it's being caused
by the nightly vacuum we run, and is due to a corrupted index.

pg_dump now also fails for this database, producing the message:
pqWait() -- connection not open
PQendcopy: resetting connection
SQL query to dump the contents of Table 'respondent_surveys' did not execute
correctly.  After we read all the table contents from the backend, PQendcopy()
failed.  Explanation from backend: 'The Data Base System is in recovery mode
'.
The query was: 'COPY "respondent_surveys" TO stdout;

respondent_surveys is the table with ~250 empty segments.

Furthermore, and I'm not sure if this is a related issue, but the nightly vacuum
analyze now crashes with the message:
NOTICE:  --Relation selected_answer_counts--
NOTICE:  Rel selected_answer_counts: TID 210/347: OID IS INVALID. TUPGONE 0.
NOTICE:  Pages 213: Changed 0, reaped 6, Empty 1, New 0; Tup 132526: Vac 430,
Keep/VTL 0/0, Crash 1, UnUsed 376, MinLen 48, MaxLen 48; Re-using: Free/Avail.
Space 83300/53448; EndEmpty/Avail. Pages 0/5. CPU 0.05s/0.34u sec.
NOTICE:  Index selected_answer_counts_pkey: Pages 463; Tuples 132526: Deleted 1.
CPU 0.06s/0.13u sec.
ERROR:  No one parent tuple was found
vacuumdb: vacuum failed

The selected_answer_counts table is, oddly NOT one of the tables with many empty
segments.

Also, during the day before the dump/vacuum began to fail, the backend was
resetting itself every few minutes with the message:
Server process (pid 25155) exited with status 11 at Fri Aug 11 11:47:47 2000
Terminating any active server processes...
NOTICE:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend died abnormally
and possibly corrupted shared memory.
        I have rolled back the current transaction and am going to terminate
your database system connection and exit.
        Please reconnect to the database system and repeat your query.
etc...

I'm not sure what status 11 means.

This database was no longer useable, and had to be restored from a backup, but I
kept it around under a different name if that will
help at all.

Other possibly relevant info:
The server that postgres is running on has 512 MB ram, and postgres is started
with the following switches:

pg_ctl -o "-i -N 256 -B 512 -o \"-S 4096\" >>postgres.log 2>&1 </dev/null" start

I'm not sure if all of these problems are directly related, or if I'm looking at
two or more possible bugs.


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

I am unsure of how to repeat this problem...although I do have a database which
can cause the problem stored locally.
I do know that nothing unusual was going on at the time - i.e. the same
select/insert/update/deletes that have been
running trouble-free for the past 2 months were the only transactions taking
place at the time.


If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

Re: Creation of 10000's of empty table segments and more...

From

Tom Lane

Date:

19 August 2000, 00:03:10

"Philip Poles" <philip@surfen.com> writes:
> Basically, the backend has created a bunch of empty files of the name
> <table_name>.<n>, ~32500 for one table, ~50000 for another, ~44000 for
> a third, and ~250 for a fourth.  From reading the old thread on this,
> I suspect it's being caused by the nightly vacuum we run, and is due
> to a corrupted index.

Probably so.  The proximate cause of that behavior was that if the
low-level file access code (md.c) was handed a ridiculously large
intra-file block number, it would try to access the file segment
containing that block number --- and merrily create all the intervening
segments, though not populate them with any data.  So a bad block number
is being injected from somewhere, and a corrupted index is the most
likely source.

In current sources, md.c will barf promptly if handed a block number
more than one segment beyond the current EOF, so that
sorcerer's-apprentice behavior should be fixed.  The more interesting
question is whether the original cause of the index corruption has been
fixed.  (I cleaned up some problems in the btree index code not long
ago, but have no way to tell if this is related.)  I don't suppose you
have a way of reproducing the problem from a cold start?

> Also, during the day before the dump/vacuum began to fail, the backend was
> resetting itself every few minutes with the message:
> Server process (pid 25155) exited with status 11 at Fri Aug 11 11:47:47 2000
> Terminating any active server processes...

> I'm not sure what status 11 means.

SEGV crash.  There should have been a core dump from that --- is there
a core file in the old database directory, and if so can you get a
backtrace from it?

            regards, tom lane

Re: Creation of 10000's of empty table segments and more...

From

"Philip Poles"

Date:

22 August 2000, 11:21:01

Greetings...

Tom, I don't have a way to reproduce this from scratch, unfortunately.
The problem occurred on a production server which had been running trouble-free
for the past two months.
Also, there are no core files lying around anywhere, which is somewhat
surprising.
Does it make any difference that I compiled with a BLCKSZ of 32768 and
NAMEDATALEN of 64?
All I do have is the intact contents of the database directory of the problem
database.
Is there any way to move this to another installation so that I can have a look
at it, and maybe get you a core dump or at least a detailed log on another
machine? pg_dump obviously won't do the trick here.
The tarball of the entire directory (including the thousands of empty files) is
38.5M, so I don't know if you'd be interested in looking at that.
Also, would upgrading to 7.0.2 help, or are the lastest fixes to the btree code
and md.c in current sources only?

Is there anything else I can do to help (other than actually digging into the
code, of course) ?

Thank-you,

    -Philip

----- Original Message -----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Philip Poles <philip@surfen.net>
Cc: <pgsql-bugs@postgresql.org>
Sent: Friday, August 18, 2000 11:57 PM
Subject: Re: [BUGS] Creation of 10000's of empty table segments and more...

"Philip Poles" <philip@surfen.com> writes:
> Basically, the backend has created a bunch of empty files of the name
> <table_name>.<n>, ~32500 for one table, ~50000 for another, ~44000 for
> a third, and ~250 for a fourth.  From reading the old thread on this,
> I suspect it's being caused by the nightly vacuum we run, and is due
> to a corrupted index.

Probably so.  The proximate cause of that behavior was that if the
low-level file access code (md.c) was handed a ridiculously large
intra-file block number, it would try to access the file segment
containing that block number --- and merrily create all the intervening
segments, though not populate them with any data.  So a bad block number
is being injected from somewhere, and a corrupted index is the most
likely source.

In current sources, md.c will barf promptly if handed a block number
more than one segment beyond the current EOF, so that
sorcerer's-apprentice behavior should be fixed.  The more interesting
question is whether the original cause of the index corruption has been
fixed.  (I cleaned up some problems in the btree index code not long
ago, but have no way to tell if this is related.)  I don't suppose you
have a way of reproducing the problem from a cold start?

> Also, during the day before the dump/vacuum began to fail, the backend was
> resetting itself every few minutes with the message:
> Server process (pid 25155) exited with status 11 at Fri Aug 11 11:47:47 2000
> Terminating any active server processes...

> I'm not sure what status 11 means.

SEGV crash.  There should have been a core dump from that --- is there
a core file in the old database directory, and if so can you get a
backtrace from it?

regards, tom lane

Re: Creation of 10000's of empty table segments and more...

From

Tom Lane

Date:

22 August 2000, 11:25:53

> Also, there are no core files lying around anywhere, which is somewhat
> surprising.

Hm.  On some platforms, processes started from system startup scripts
run with "ulimit coredumpsize" set to 0, which prevents coredumps.
You may need to start the postmaster manually with a normal ulimit
before you will get a corefile.

> Does it make any difference that I compiled with a BLCKSZ of 32768 and
> NAMEDATALEN of 64?

In theory, no ...

> All I do have is the intact contents of the database directory of the
> problem database.  Is there any way to move this to another
> installation so that I can have a look at it, and maybe get you a core
> dump or at least a detailed log on another machine?

If you have the entire $PGDATA directory, you can copy it to another
identically-configured machine.  Or you can leave it where it is and
start up a test postmaster in it (just specify -D pointing at the data
dir, and select a port number other than the standard 5432 with -p).
If you do that, just remember to specify the special port number when
connecting with psql (use -p, or set environment variable PGPORT).

A coredump backtrace should help, especially if you recompile with -g
first.

> Also, would upgrading to 7.0.2 help,

Possibly, but it's hard to tell until we know more.  I'd suggest leaving
your installation alone until we have more info.

            regards, tom lane