Thread: the un-vacuumable table
I found this error message in my log files repeatedly:<br /><br />Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2"for deletion target page 64767<br /><br />I though "hmm, that index looks broken. I'd better re-createit." So, I dropped the index and then tried to create a new one to replace it. Which completely locked up the backendthat was running the CREATE TABLE. I ran truss against the backend in question and it didn't register anything (exceptsignals 2 and 15 when I tried to cancel the query and kill the backend respectively). I eventually had to restartthe database to get the CREATE INDEX process to go away (well, to release that big nasty lock).<br /><br />I thentried to do a VACUUM FULL, but it didn't complete after more than 2 hours. I cancelled that and tried a CLUSTER in thehopes that it might work a little faster. It did the exact same thing as the create index command: completely locked upthe backend.<br /><br />So, I'm wondering what if anything I can do to get that table cleaned up.<br /><br />Running 8.1.11on FreeBSD 6.2.<br /><br />Anyway, the current plan is to drop the table and reload it from backup. I'm posting herein case there's interest in gathering some forensic data or a clever suggetion about how I can recover this situationor even some ideas about what's causing it.<br /><br />Andrew<br />
Andrew Hammond wrote: > I found this error message in my log files repeatedly: > > Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for > deletion target page 64767 > > I though "hmm, that index looks broken. I'd better re-create it." So, I > dropped the index and then tried to create a new one to replace it. Which > completely locked up the backend that was running the CREATE TABLE. I ran > truss against the backend in question and it didn't register anything > (except signals 2 and 15 when I tried to cancel the query and kill the > backend respectively). I eventually had to restart the database to get the > CREATE INDEX process to go away (well, to release that big nasty lock). What kind of an index is it? Does "SELECT COUNT(*) from <table>" work? > Anyway, the current plan is to drop the table and reload it from backup. I'm > posting here in case there's interest in gathering some forensic data or a > clever suggetion about how I can recover this situation or even some ideas > about what's causing it. Yes, please take a filesystem-level backup right away to retain the evidence. Could you connect to the hung backend with gdb and get a stacktrace? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jun 25, 2008 at 2:58 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
After the restart I did a count(*) and it worked. A little under 13m rows. So, sequential scans seem to work.
Well, I've already burned our downtime allowance for this month, but we do a regular PITR type backup which hopefully will be sufficient to replicate the problem.
The backend is no longer hung (two restarts later). I'll try to reproduce this problem on my workstation (same binary, same OS, libraries etc) using the PITR dump.
Andrew
Andrew Hammond wrote:What kind of an index is it? Does "SELECT COUNT(*) from <table>" work?I found this error message in my log files repeatedly:
Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for
deletion target page 64767
I though "hmm, that index looks broken. I'd better re-create it." So, I
dropped the index and then tried to create a new one to replace it. Which
completely locked up the backend that was running the CREATE TABLE. I ran
truss against the backend in question and it didn't register anything
(except signals 2 and 15 when I tried to cancel the query and kill the
backend respectively). I eventually had to restart the database to get the
CREATE INDEX process to go away (well, to release that big nasty lock).
After the restart I did a count(*) and it worked. A little under 13m rows. So, sequential scans seem to work.
Yes, please take a filesystem-level backup right away to retain the evidence.posting here in case there's interest in gathering some forensic data or aAnyway, the current plan is to drop the table and reload it from backup. I'm
clever suggetion about how I can recover this situation or even some ideas
about what's causing it.
Well, I've already burned our downtime allowance for this month, but we do a regular PITR type backup which hopefully will be sufficient to replicate the problem.
Could you connect to the hung backend with gdb and get a stacktrace?
The backend is no longer hung (two restarts later). I'll try to reproduce this problem on my workstation (same binary, same OS, libraries etc) using the PITR dump.
On Wed, Jun 25, 2008 at 9:57 AM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > > On Wed, Jun 25, 2008 at 2:58 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote: >> >> Andrew Hammond wrote: >>> >>> I found this error message in my log files repeatedly: >>> >>> Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for >>> deletion target page 64767 >>> >>> I though "hmm, that index looks broken. I'd better re-create it." So, I >>> dropped the index and then tried to create a new one to replace it. Which >>> completely locked up the backend that was running the CREATE TABLE. I ran >>> truss against the backend in question and it didn't register anything >>> (except signals 2 and 15 when I tried to cancel the query and kill the >>> backend respectively). I eventually had to restart the database to get the >>> CREATE INDEX process to go away (well, to release that big nasty lock). >> >> What kind of an index is it? Does "SELECT COUNT(*) from <table>" work? > > After the restart I did a count(*) and it worked. A little under 13m rows. So, sequential scans seem to work. > >>> >>> posting here in case there's interest in gathering some forensic data or a >>> clever suggetion about how I can recover this situation or even some ideas >>> about what's causing it. >> >> Anyway, the current plan is to drop the table and reload it from backup. I'm >> >> Yes, please take a filesystem-level backup right away to retain the evidence. > > Well, I've already burned our downtime allowance for this month, but we do a regular PITR type backup which hopefully willbe sufficient to replicate the problem. > >> >> Could you connect to the hung backend with gdb and get a stacktrace? > > The backend is no longer hung (two restarts later). I'll try to reproduce this problem on my workstation (same binary,same OS, libraries etc) using the PITR dump. > > Andrew I tried to restore the PITR backup and it failed. Jun 27 15:54:30 qadb2 postgres[92517]: [1-1] DEBUG: postmaster: PostmasterMain: initial environ dump: Jun 27 15:54:30 qadb2 postgres[92517]: [2-1] DEBUG: ----------------------------------------- Jun 27 15:54:30 qadb2 postgres[92517]: [3-1] DEBUG: USER=pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [4-1] DEBUG: MAIL=/var/mail/pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [5-1] DEBUG: LD_LIBRARY_PATH=:/usr/local/adecn/lib Jun 27 15:54:30 qadb2 postgres[92517]: [6-1] DEBUG: HOME=/usr/local/pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [7-1] DEBUG: PGLIB=/usr/local/lib Jun 27 15:54:30 qadb2 postgres[92517]: [8-1] DEBUG: PS1=[QA2] \u@\h:\w\$ Jun 27 15:54:30 qadb2 postgres[92517]: [9-1] DEBUG: BLOCKSIZE=K Jun 27 15:54:30 qadb2 postgres[92517]: [10-1] DEBUG: TERM=xterm Jun 27 15:54:30 qadb2 postgres[92517]: [11-1] DEBUG: PGSYSCONFDIR=/usr/local/etc/postgresql Jun 27 15:54:30 qadb2 postgres[92517]: [12-1] DEBUG: PGLOCALEDIR=/usr/local/share/locale Jun 27 15:54:30 qadb2 postgres[92517]: [13-1] DEBUG: Jun 27 15:54:30 qadb2 postgres[92517]: [13-2] PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin Jun 27 15:54:30 qadb2 postgres[92517]: [13-3] :/usr/local/pgsql/bin Jun 27 15:54:30 qadb2 postgres[92517]: [14-1] DEBUG: ADECN_HOME=/usr/local/adecn Jun 27 15:54:30 qadb2 postgres[92517]: [15-1] DEBUG: SHELL=/bin/sh Jun 27 15:54:30 qadb2 postgres[92517]: [16-1] DEBUG: Jun 27 15:54:30 qadb2 postgres[92517]: [16-2] CLASSPATH= (deleted a bunch of lines) Jun 27 15:54:30 qadb2 postgres[92517]: [17-1] DEBUG: PYTHONPATH=/usr/local/adecn/python:/usr/local/adecn/lib/python:/usr/local/adecn/api/client/python Jun 27 15:54:30 qadb2 postgres[92517]: [18-1] DEBUG: FTP_PASSIVE_MODE=YES Jun 27 15:54:30 qadb2 postgres[92517]: [19-1] DEBUG: PGDATA=/var/db/adecn/adecndb Jun 27 15:54:30 qadb2 postgres[92517]: [20-1] DEBUG: LC_COLLATE=C Jun 27 15:54:30 qadb2 postgres[92517]: [21-1] DEBUG: LC_CTYPE=C Jun 27 15:54:30 qadb2 postgres[92517]: [22-1] DEBUG: LC_MESSAGES=C Jun 27 15:54:30 qadb2 postgres[92517]: [23-1] DEBUG: LC_MONETARY=C Jun 27 15:54:30 qadb2 postgres[92517]: [24-1] DEBUG: LC_NUMERIC=C Jun 27 15:54:30 qadb2 postgres[92517]: [25-1] DEBUG: LC_TIME=C Jun 27 15:54:30 qadb2 postgres[92517]: [26-1] DEBUG: ----------------------------------------- Jun 27 15:54:30 qadb2 postgres[92518]: [27-1] DEBUG: invoking IpcMemoryCreate(size=92938240) Jun 27 15:54:30 qadb2 postgres[92518]: [28-1] DEBUG: max_safe_fds = 983, usable_fds = 1000, already_open = 7 Jun 27 15:54:30 qadb2 postgres[92519]: [29-1] LOG: database system was interrupted at 2008-06-25 03:01:02 PDT Jun 27 15:54:30 qadb2 postgres[92519]: [30-1] LOG: starting archive recovery Jun 27 15:54:30 qadb2 postgres[92519]: [31-1] LOG: restore_command = "cp -p /usr/tmp/2008-06-25_wals/%f %p" Jun 27 15:54:30 qadb2 postgres[92519]: [32-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history pg_xlog/RECOVERYHISTORY" Jun 27 15:54:30 qadb2 postgres[92519]: [33-1] DEBUG: could not restore file "00000001.history" from archive: return code 256 Jun 27 15:54:30 qadb2 postgres[92519]: [34-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup Jun 27 15:54:30 qadb2 postgres[92519]: [34-2] pg_xlog/RECOVERYHISTORY" Jun 27 15:54:30 qadb2 postgres[92519]: [35-1] LOG: restored log file "00000001000001D600000078.0055F0B8.backup" from archive Jun 27 15:54:30 qadb2 postgres[92519]: [36-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078 pg_xlog/RECOVERYXLOG" Jun 27 15:54:31 qadb2 postgres[92519]: [37-1] LOG: restored log file "00000001000001D600000078" from archive Jun 27 15:54:31 qadb2 postgres[92519]: [38-1] LOG: checkpoint record is at 1D6/7855F0B8 Jun 27 15:54:31 qadb2 postgres[92519]: [39-1] LOG: redo record is at 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE Jun 27 15:54:31 qadb2 postgres[92519]: [40-1] LOG: next transaction ID: 397171279; next OID: 679516596 Jun 27 15:54:31 qadb2 postgres[92519]: [41-1] LOG: next MultiXactId: 857318; next MultiXactOffset: 1718141 Jun 27 15:54:31 qadb2 postgres[92519]: [42-1] LOG: automatic recovery in progress Jun 27 15:54:31 qadb2 postgres[92519]: [43-1] LOG: redo starts at 1D6/7855F108 (I thought this line was interesting) Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open relation 1663/16386/679439393: No such file or directory Jun 27 15:54:31 qadb2 postgres[92518]: [29-1] DEBUG: forked new backend, pid=92526 socket=8 Jun 27 15:54:31 qadb2 postgres[92526]: [29-1] LOG: connection received: host=[local] Jun 27 15:54:31 qadb2 postgres[92526]: [30-1] FATAL: the database system is starting up Jun 27 15:54:31 qadb2 postgres[92526]: [31-1] DEBUG: proc_exit(0) Jun 27 15:54:31 qadb2 postgres[92526]: [32-1] DEBUG: shmem_exit(0) Jun 27 15:54:31 qadb2 postgres[92526]: [33-1] DEBUG: exit(0) Jun 27 15:54:31 qadb2 postgres[92518]: [30-1] DEBUG: reaping dead processes Jun 27 15:54:31 qadb2 postgres[92518]: [31-1] DEBUG: server process (PID 92526) exited with exit code 0 Jun 27 15:54:31 qadb2 postgres[92518]: [32-1] DEBUG: reaping dead processes Jun 27 15:54:31 qadb2 postgres[92518]: [33-1] LOG: startup process (PID 92519) was terminated by signal 6 Jun 27 15:54:31 qadb2 postgres[92518]: [34-1] LOG: aborting startup due to startup process failure Jun 27 15:54:31 qadb2 postgres[92518]: [35-1] DEBUG: proc_exit(1) Jun 27 15:54:31 qadb2 postgres[92518]: [36-1] DEBUG: shmem_exit(1) Jun 27 15:54:31 qadb2 postgres[92518]: [37-1] DEBUG: exit(1) I googled to find out what the numbers 1663/16386/679439393 from the PANIC message mean, but no luck. The last looks like an OID. I also hunted through our production log files and found the following interesting error messages. (On Thursday night) vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not write block 209610 of relation 1663/16386/236356665: No space left on device CONTEXT: writing block 209610 of relation 1663/16386/236356665 (Friday night, and every night until the incident on Tuesday). vacuumdb: vacuuming of database "adecndb" failed: ERROR: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for deletion target page 64767 Now, the first message is very strange since we have monitoring on the file system used by the database and it's been hovering at about 18% space used for the last month. So I can't figure out why we'd get "No space left on device", assuming the device is actually the disk (which seems reasonable given the context) and not shared memory. I also checked our bug system for history and discovered we'd seen similar problems in the past when connecting to a SAN via a FC switch. Directly attaching the server to the SAN appeared to solve the problem. So currently I'm wondering if this might be a hardware / os / some-other-part-of-the-storage-stack issue manifesting as trashing the database in some way? Interestingly, since I dropped the ledgerdetail_2008_03_idx2 index, the nightly vacuum has been completing successfully. I'd also like advice on whether I need to schedule another outage and do a pg_dump / initdb / load (well, more accurately, do a slony move to the backup box and rebuild the old origin). Or perhaps I should ask, how urgently. Andrew
"Andrew Hammond" <andrew.george.hammond@gmail.com> writes: > (I thought this line was interesting) > Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory > I googled to find out what the numbers 1663/16386/679439393 from the > PANIC message mean, but no luck. tablespaceOID/databaseOID/relfilenode. Looks like just some random user table. Not clear why this would be a crash, *especially* since WAL recovery is generally willing to create nonexistent files. Is this reproducible? > (On Thursday night) > vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not > write block 209610 of relation 1663/16386/236356665: No space left on > device > CONTEXT: writing block 209610 of relation 1663/16386/236356665 That's pretty frickin' odd as well, because as a rule we make sure that backing store exists for each table page before we open it up for normal writing. Do you have a way to find out what relation 1663/16386/236356665 is? What filesystem is this database sitting on? > Now, the first message is very strange since we have monitoring on the > file system used by the database and it's been hovering at about 18% > space used for the last month. So I can't figure out why we'd get "No > space left on device", assuming the device is actually the disk (which > seems reasonable given the context) and not shared memory. Yeah, this is definitely a case of ENOSPC being returned by write(), If you're sure the disk wasn't out of space, maybe some per-user quota was getting in the way? regards, tom lane
On Fri, Jun 27, 2008 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >> (I thought this line was interesting) >> Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open >> relation 1663/16386/679439393: No such file or directory > >> I googled to find out what the numbers 1663/16386/679439393 from the >> PANIC message mean, but no luck. > > tablespaceOID/databaseOID/relfilenode. Looks like just some random user > table. Not clear why this would be a crash, *especially* since WAL > recovery is generally willing to create nonexistent files. Is this > reproducible? Yes, both when I just tried to restart the recovery: Jun 30 16:04:43 qadb2 postgres[20797]: [27-1] DEBUG: invoking IpcMemoryCreate(size=92938240) Jun 30 16:04:43 qadb2 postgres[20797]: [28-1] DEBUG: max_safe_fds = 983, usable_fds = 1000, already_open = 7 Jun 30 16:04:43 qadb2 postgres[20798]: [29-1] LOG: database system was interrupted while in recovery at 2008-06-27 15:54:31 PDT Jun 30 16:04:43 qadb2 postgres[20798]: [29-2] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. Jun 30 16:04:43 qadb2 postgres[20798]: [30-1] LOG: starting archive recovery Jun 30 16:04:43 qadb2 postgres[20798]: [31-1] LOG: restore_command = "cp -p /usr/tmp/2008-06-25_wals/%f %p" Jun 30 16:04:43 qadb2 postgres[20798]: [32-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history pg_xlog/RECOVERYHISTORY" Jun 30 16:04:43 qadb2 postgres[20798]: [33-1] DEBUG: could not restore file "00000001.history" from archive: return code 256 Jun 30 16:04:43 qadb2 postgres[20798]: [34-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup Jun 30 16:04:43 qadb2 postgres[20798]: [34-2] pg_xlog/RECOVERYHISTORY" Jun 30 16:04:43 qadb2 postgres[20798]: [35-1] LOG: restored log file "00000001000001D600000078.0055F0B8.backup" from archive Jun 30 16:04:43 qadb2 postgres[20798]: [36-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078 pg_xlog/RECOVERYXLOG" Jun 30 16:04:44 qadb2 postgres[20797]: [29-1] DEBUG: forked new backend, pid=20805 socket=8 Jun 30 16:04:44 qadb2 postgres[20805]: [29-1] LOG: connection received: host=[local] Jun 30 16:04:44 qadb2 postgres[20805]: [30-1] FATAL: the database system is starting up Jun 30 16:04:44 qadb2 postgres[20805]: [31-1] DEBUG: proc_exit(0) Jun 30 16:04:44 qadb2 postgres[20805]: [32-1] DEBUG: shmem_exit(0) Jun 30 16:04:44 qadb2 postgres[20805]: [33-1] DEBUG: exit(0) Jun 30 16:04:44 qadb2 postgres[20797]: [30-1] DEBUG: reaping dead processes Jun 30 16:04:44 qadb2 postgres[20797]: [31-1] DEBUG: server process (PID 20805) exited with exit code 0 Jun 30 16:04:44 qadb2 postgres[20798]: [37-1] LOG: restored log file "00000001000001D600000078" from archive Jun 30 16:04:44 qadb2 postgres[20798]: [38-1] LOG: checkpoint record is at 1D6/7855F0B8 Jun 30 16:04:44 qadb2 postgres[20798]: [39-1] LOG: redo record is at 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE Jun 30 16:04:44 qadb2 postgres[20798]: [40-1] LOG: next transaction ID: 397171279; next OID: 679516596 Jun 30 16:04:44 qadb2 postgres[20798]: [41-1] LOG: next MultiXactId: 857318; next MultiXactOffset: 1718141 Jun 30 16:04:44 qadb2 postgres[20798]: [42-1] LOG: automatic recovery in progress Jun 30 16:04:44 qadb2 postgres[20798]: [43-1] LOG: redo starts at 1D6/7855F108 Jun 30 16:04:45 qadb2 postgres[20798]: [44-1] PANIC: could not open relation 1663/16386/679439393: No such file or directory Jun 30 16:04:45 qadb2 postgres[20797]: [32-1] DEBUG: reaping dead processes Jun 30 16:04:45 qadb2 postgres[20797]: [33-1] LOG: startup process (PID 20798) was terminated by signal 6 Jun 30 16:04:45 qadb2 postgres[20797]: [34-1] LOG: aborting startup due to startup process failure Jun 30 16:04:45 qadb2 postgres[20797]: [35-1] DEBUG: proc_exit(1) Jun 30 16:04:45 qadb2 postgres[20797]: [36-1] DEBUG: shmem_exit(1) Jun 30 16:04:45 qadb2 postgres[20797]: [37-1] DEBUG: exit(1) And also when I tried to wipe the slate clean and recover it freshly. Jun 30 19:11:59 qadb2 postgres[23091]: [1-1] DEBUG: postmaster: PostmasterMain: initial environ dump: Jun 30 19:11:59 qadb2 postgres[23091]: [2-1] DEBUG: ----------------------------------------- Jun 30 19:11:59 qadb2 postgres[23091]: [3-1] DEBUG: USER=pgsql Jun 30 19:11:59 qadb2 postgres[23091]: [4-1] DEBUG: MAIL=/var/mail/pgsql Jun 30 19:11:59 qadb2 postgres[23091]: [5-1] DEBUG: LD_LIBRARY_PATH=:/usr/local/adecn/lib Jun 30 19:11:59 qadb2 postgres[23091]: [6-1] DEBUG: HOME=/usr/local/pgsql Jun 30 19:11:59 qadb2 postgres[23091]: [7-1] DEBUG: PGLIB=/usr/local/lib Jun 30 19:11:59 qadb2 postgres[23091]: [8-1] DEBUG: PS1=[QA2] \u@\h:\w\$ Jun 30 19:11:59 qadb2 postgres[23091]: [9-1] DEBUG: BLOCKSIZE=K Jun 30 19:11:59 qadb2 postgres[23091]: [10-1] DEBUG: TERM=xterm Jun 30 19:11:59 qadb2 postgres[23091]: [11-1] DEBUG: PGSYSCONFDIR=/usr/local/etc/postgresql Jun 30 19:11:59 qadb2 postgres[23091]: [12-1] DEBUG: PGLOCALEDIR=/usr/local/share/locale Jun 30 19:11:59 qadb2 postgres[23091]: [13-1] DEBUG: Jun 30 19:11:59 qadb2 postgres[23091]: [13-2] PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin Jun 30 19:11:59 qadb2 postgres[23091]: [13-3] :/usr/local/pgsql/bin (snip junk about environment) Jun 30 19:11:59 qadb2 postgres[23091]: [19-1] DEBUG: PGDATA=/var/db/adecn/adecndb Jun 30 19:11:59 qadb2 postgres[23091]: [20-1] DEBUG: LC_COLLATE=C Jun 30 19:11:59 qadb2 postgres[23091]: [21-1] DEBUG: LC_CTYPE=C Jun 30 19:11:59 qadb2 postgres[23091]: [22-1] DEBUG: LC_MESSAGES=C Jun 30 19:11:59 qadb2 postgres[23091]: [23-1] DEBUG: LC_MONETARY=C Jun 30 19:11:59 qadb2 postgres[23091]: [24-1] DEBUG: LC_NUMERIC=C Jun 30 19:11:59 qadb2 postgres[23091]: [25-1] DEBUG: LC_TIME=C Jun 30 19:11:59 qadb2 postgres[23091]: [26-1] DEBUG: ----------------------------------------- Jun 30 19:11:59 qadb2 postgres[23092]: [27-1] DEBUG: invoking IpcMemoryCreate(size=92938240) Jun 30 19:11:59 qadb2 postgres[23092]: [28-1] DEBUG: max_safe_fds = 983, usable_fds = 1000, already_open = 7 Jun 30 19:11:59 qadb2 postgres[23093]: [29-1] LOG: database system was interrupted at 2008-06-25 03:01:02 PDT Jun 30 19:11:59 qadb2 postgres[23093]: [30-1] LOG: starting archive recovery Jun 30 19:11:59 qadb2 postgres[23093]: [31-1] LOG: restore_command = "cp -p /usr/tmp/2008-06-25_wals/%f %p" Jun 30 19:11:59 qadb2 postgres[23093]: [32-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history pg_xlog/RECOVERYHISTORY" Jun 30 19:11:59 qadb2 postgres[23093]: [33-1] DEBUG: could not restore file "00000001.history" from archive: return code 256 Jun 30 19:11:59 qadb2 postgres[23093]: [34-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup Jun 30 19:11:59 qadb2 postgres[23093]: [34-2] pg_xlog/RECOVERYHISTORY" Jun 30 19:11:59 qadb2 postgres[23093]: [35-1] LOG: restored log file "00000001000001D600000078.0055F0B8.backup" from archive Jun 30 19:11:59 qadb2 postgres[23093]: [36-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078 pg_xlog/RECOVERYXLOG" Jun 30 19:12:00 qadb2 postgres[23093]: [37-1] LOG: restored log file "00000001000001D600000078" from archive Jun 30 19:12:00 qadb2 postgres[23093]: [38-1] LOG: checkpoint record is at 1D6/7855F0B8 Jun 30 19:12:00 qadb2 postgres[23093]: [39-1] LOG: redo record is at 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE Jun 30 19:12:00 qadb2 postgres[23093]: [40-1] LOG: next transaction ID: 397171279; next OID: 679516596 Jun 30 19:12:00 qadb2 postgres[23093]: [41-1] LOG: next MultiXactId: 857318; next MultiXactOffset: 1718141 Jun 30 19:12:00 qadb2 postgres[23093]: [42-1] LOG: automatic recovery in progress Jun 30 19:12:00 qadb2 postgres[23093]: [43-1] LOG: redo starts at 1D6/7855F108 Jun 30 19:12:00 qadb2 postgres[23093]: [44-1] PANIC: could not open relation 1663/16386/679439393: No such file or directory Jun 30 19:12:00 qadb2 postgres[23092]: [29-1] DEBUG: reaping dead processes Jun 30 19:12:00 qadb2 postgres[23092]: [30-1] LOG: startup process (PID 23093) was terminated by signal 6 Jun 30 19:12:00 qadb2 postgres[23092]: [31-1] LOG: aborting startup due to startup process failure Jun 30 19:12:00 qadb2 postgres[23092]: [32-1] DEBUG: proc_exit(1) Jun 30 19:12:00 qadb2 postgres[23092]: [33-1] DEBUG: shmem_exit(1) Jun 30 19:12:00 qadb2 postgres[23092]: [34-1] DEBUG: exit(1) So yes, I believe this is entirely repeatable. >> (On Thursday night) >> vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not >> write block 209610 of relation 1663/16386/236356665: No space left on >> device >> CONTEXT: writing block 209610 of relation 1663/16386/236356665 > > That's pretty frickin' odd as well, because as a rule we make sure that > backing store exists for each table page before we open it up for > normal writing. Do you have a way to find out what relation > 1663/16386/236356665 is? 1663 = pg_default tablespace 16386 = adecndb, the only database (aside from postgres, template1 & 0 on this backend). SELECT relname FROM pg_class WHERE relfilenode = 236356665; -- got nothing But this is on the production database where a vacuum has already succeeded. > What filesystem is this database sitting on? FreeBSD 6.2 ufs. Here are the options from /etc/fstab. /dev/da1 /xraid ufs rw,noatime 2 2 >> Now, the first message is very strange since we have monitoring on the >> file system used by the database and it's been hovering at about 18% >> space used for the last month. So I can't figure out why we'd get "No >> space left on device", assuming the device is actually the disk (which >> seems reasonable given the context) and not shared memory. > > Yeah, this is definitely a case of ENOSPC being returned by write(), > If you're sure the disk wasn't out of space, maybe some per-user quota > was getting in the way? Monitoring is using df to determine space available and runs every 5 minutes. It will alarm at 70% usage and critical at 80% usage. We received no alarms or other notifications. We do not yet have tracking in place to story the history of this, so I can't say definitively that we didn't spike over that usage level. However we are only at 18% currently, so it seems unlikely. Nothing else uses this disk. I checked /etc/rc.conf and did not find quotas turned on. Neither are they enabled in the fstab. I confirmed using quota -v pgsql It says "none". Andrew
Does anyone else have any suggestions about what I can do to diagnose this? Do I need to re-initdb or can I reasonably keep running with the existing db? A On Mon, Jun 30, 2008 at 7:20 PM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > On Fri, Jun 27, 2008 at 8:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >>> (I thought this line was interesting) >>> Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open >>> relation 1663/16386/679439393: No such file or directory >> >>> I googled to find out what the numbers 1663/16386/679439393 from the >>> PANIC message mean, but no luck. >> >> tablespaceOID/databaseOID/relfilenode. Looks like just some random user >> table. Not clear why this would be a crash, *especially* since WAL >> recovery is generally willing to create nonexistent files. Is this >> reproducible? > > Yes, both when I just tried to restart the recovery: > > Jun 30 16:04:43 qadb2 postgres[20797]: [27-1] DEBUG: invoking > IpcMemoryCreate(size=92938240) > Jun 30 16:04:43 qadb2 postgres[20797]: [28-1] DEBUG: max_safe_fds = > 983, usable_fds = 1000, already_open = 7 > Jun 30 16:04:43 qadb2 postgres[20798]: [29-1] LOG: database system > was interrupted while in recovery at 2008-06-27 15:54:31 PDT > Jun 30 16:04:43 qadb2 postgres[20798]: [29-2] HINT: This probably > means that some data is corrupted and you will have to use the last > backup for recovery. > Jun 30 16:04:43 qadb2 postgres[20798]: [30-1] LOG: starting archive recovery > Jun 30 16:04:43 qadb2 postgres[20798]: [31-1] LOG: restore_command = > "cp -p /usr/tmp/2008-06-25_wals/%f %p" > Jun 30 16:04:43 qadb2 postgres[20798]: [32-1] DEBUG: executing > restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history > pg_xlog/RECOVERYHISTORY" > Jun 30 16:04:43 qadb2 postgres[20798]: [33-1] DEBUG: could not > restore file "00000001.history" from archive: return code 256 > Jun 30 16:04:43 qadb2 postgres[20798]: [34-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup > Jun 30 16:04:43 qadb2 postgres[20798]: [34-2] pg_xlog/RECOVERYHISTORY" > Jun 30 16:04:43 qadb2 postgres[20798]: [35-1] LOG: restored log file > "00000001000001D600000078.0055F0B8.backup" from archive > Jun 30 16:04:43 qadb2 postgres[20798]: [36-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078 > pg_xlog/RECOVERYXLOG" > Jun 30 16:04:44 qadb2 postgres[20797]: [29-1] DEBUG: forked new > backend, pid=20805 socket=8 > Jun 30 16:04:44 qadb2 postgres[20805]: [29-1] LOG: connection > received: host=[local] > Jun 30 16:04:44 qadb2 postgres[20805]: [30-1] FATAL: the database > system is starting up > Jun 30 16:04:44 qadb2 postgres[20805]: [31-1] DEBUG: proc_exit(0) > Jun 30 16:04:44 qadb2 postgres[20805]: [32-1] DEBUG: shmem_exit(0) > Jun 30 16:04:44 qadb2 postgres[20805]: [33-1] DEBUG: exit(0) > Jun 30 16:04:44 qadb2 postgres[20797]: [30-1] DEBUG: reaping dead processes > Jun 30 16:04:44 qadb2 postgres[20797]: [31-1] DEBUG: server process > (PID 20805) exited with exit code 0 > Jun 30 16:04:44 qadb2 postgres[20798]: [37-1] LOG: restored log file > "00000001000001D600000078" from archive > Jun 30 16:04:44 qadb2 postgres[20798]: [38-1] LOG: checkpoint record > is at 1D6/7855F0B8 > Jun 30 16:04:44 qadb2 postgres[20798]: [39-1] LOG: redo record is at > 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE > Jun 30 16:04:44 qadb2 postgres[20798]: [40-1] LOG: next transaction > ID: 397171279; next OID: 679516596 > Jun 30 16:04:44 qadb2 postgres[20798]: [41-1] LOG: next MultiXactId: > 857318; next MultiXactOffset: 1718141 > Jun 30 16:04:44 qadb2 postgres[20798]: [42-1] LOG: automatic recovery > in progress > Jun 30 16:04:44 qadb2 postgres[20798]: [43-1] LOG: redo starts at 1D6/7855F108 > > Jun 30 16:04:45 qadb2 postgres[20798]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory > > Jun 30 16:04:45 qadb2 postgres[20797]: [32-1] DEBUG: reaping dead processes > Jun 30 16:04:45 qadb2 postgres[20797]: [33-1] LOG: startup process > (PID 20798) was terminated by signal 6 > Jun 30 16:04:45 qadb2 postgres[20797]: [34-1] LOG: aborting startup > due to startup process failure > Jun 30 16:04:45 qadb2 postgres[20797]: [35-1] DEBUG: proc_exit(1) > Jun 30 16:04:45 qadb2 postgres[20797]: [36-1] DEBUG: shmem_exit(1) > Jun 30 16:04:45 qadb2 postgres[20797]: [37-1] DEBUG: exit(1) > > > > And also when I tried to wipe the slate clean and recover it freshly. > > Jun 30 19:11:59 qadb2 postgres[23091]: [1-1] DEBUG: postmaster: > PostmasterMain: initial environ dump: > Jun 30 19:11:59 qadb2 postgres[23091]: [2-1] DEBUG: > ----------------------------------------- > Jun 30 19:11:59 qadb2 postgres[23091]: [3-1] DEBUG: USER=pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [4-1] DEBUG: MAIL=/var/mail/pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [5-1] DEBUG: > LD_LIBRARY_PATH=:/usr/local/adecn/lib > Jun 30 19:11:59 qadb2 postgres[23091]: [6-1] DEBUG: HOME=/usr/local/pgsql > Jun 30 19:11:59 qadb2 postgres[23091]: [7-1] DEBUG: PGLIB=/usr/local/lib > Jun 30 19:11:59 qadb2 postgres[23091]: [8-1] DEBUG: PS1=[QA2] \u@\h:\w\$ > Jun 30 19:11:59 qadb2 postgres[23091]: [9-1] DEBUG: BLOCKSIZE=K > Jun 30 19:11:59 qadb2 postgres[23091]: [10-1] DEBUG: TERM=xterm > Jun 30 19:11:59 qadb2 postgres[23091]: [11-1] DEBUG: > PGSYSCONFDIR=/usr/local/etc/postgresql > Jun 30 19:11:59 qadb2 postgres[23091]: [12-1] DEBUG: > PGLOCALEDIR=/usr/local/share/locale > Jun 30 19:11:59 qadb2 postgres[23091]: [13-1] DEBUG: > Jun 30 19:11:59 qadb2 postgres[23091]: [13-2] > PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin > Jun 30 19:11:59 qadb2 postgres[23091]: [13-3] :/usr/local/pgsql/bin > > (snip junk about environment) > > Jun 30 19:11:59 qadb2 postgres[23091]: [19-1] DEBUG: > PGDATA=/var/db/adecn/adecndb > Jun 30 19:11:59 qadb2 postgres[23091]: [20-1] DEBUG: LC_COLLATE=C > Jun 30 19:11:59 qadb2 postgres[23091]: [21-1] DEBUG: LC_CTYPE=C > Jun 30 19:11:59 qadb2 postgres[23091]: [22-1] DEBUG: LC_MESSAGES=C > Jun 30 19:11:59 qadb2 postgres[23091]: [23-1] DEBUG: LC_MONETARY=C > Jun 30 19:11:59 qadb2 postgres[23091]: [24-1] DEBUG: LC_NUMERIC=C > Jun 30 19:11:59 qadb2 postgres[23091]: [25-1] DEBUG: LC_TIME=C > Jun 30 19:11:59 qadb2 postgres[23091]: [26-1] DEBUG: > ----------------------------------------- > Jun 30 19:11:59 qadb2 postgres[23092]: [27-1] DEBUG: invoking > IpcMemoryCreate(size=92938240) > Jun 30 19:11:59 qadb2 postgres[23092]: [28-1] DEBUG: max_safe_fds = > 983, usable_fds = 1000, already_open = 7 > Jun 30 19:11:59 qadb2 postgres[23093]: [29-1] LOG: database system > was interrupted at 2008-06-25 03:01:02 PDT > Jun 30 19:11:59 qadb2 postgres[23093]: [30-1] LOG: starting archive recovery > Jun 30 19:11:59 qadb2 postgres[23093]: [31-1] LOG: restore_command = > "cp -p /usr/tmp/2008-06-25_wals/%f %p" > Jun 30 19:11:59 qadb2 postgres[23093]: [32-1] DEBUG: executing > restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history > pg_xlog/RECOVERYHISTORY" > Jun 30 19:11:59 qadb2 postgres[23093]: [33-1] DEBUG: could not > restore file "00000001.history" from archive: return code 256 > Jun 30 19:11:59 qadb2 postgres[23093]: [34-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup > Jun 30 19:11:59 qadb2 postgres[23093]: [34-2] pg_xlog/RECOVERYHISTORY" > Jun 30 19:11:59 qadb2 postgres[23093]: [35-1] LOG: restored log file > "00000001000001D600000078.0055F0B8.backup" from archive > Jun 30 19:11:59 qadb2 postgres[23093]: [36-1] DEBUG: executing > restore command "cp -p > /usr/tmp/2008-06-25_wals/00000001000001D600000078 > pg_xlog/RECOVERYXLOG" > Jun 30 19:12:00 qadb2 postgres[23093]: [37-1] LOG: restored log file > "00000001000001D600000078" from archive > Jun 30 19:12:00 qadb2 postgres[23093]: [38-1] LOG: checkpoint record > is at 1D6/7855F0B8 > Jun 30 19:12:00 qadb2 postgres[23093]: [39-1] LOG: redo record is at > 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE > Jun 30 19:12:00 qadb2 postgres[23093]: [40-1] LOG: next transaction > ID: 397171279; next OID: 679516596 > Jun 30 19:12:00 qadb2 postgres[23093]: [41-1] LOG: next MultiXactId: > 857318; next MultiXactOffset: 1718141 > Jun 30 19:12:00 qadb2 postgres[23093]: [42-1] LOG: automatic recovery > in progress > Jun 30 19:12:00 qadb2 postgres[23093]: [43-1] LOG: redo starts at 1D6/7855F108 > > Jun 30 19:12:00 qadb2 postgres[23093]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory > > Jun 30 19:12:00 qadb2 postgres[23092]: [29-1] DEBUG: reaping dead processes > Jun 30 19:12:00 qadb2 postgres[23092]: [30-1] LOG: startup process > (PID 23093) was terminated by signal 6 > Jun 30 19:12:00 qadb2 postgres[23092]: [31-1] LOG: aborting startup > due to startup process failure > Jun 30 19:12:00 qadb2 postgres[23092]: [32-1] DEBUG: proc_exit(1) > Jun 30 19:12:00 qadb2 postgres[23092]: [33-1] DEBUG: shmem_exit(1) > Jun 30 19:12:00 qadb2 postgres[23092]: [34-1] DEBUG: exit(1) > > So yes, I believe this is entirely repeatable. > >>> (On Thursday night) >>> vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not >>> write block 209610 of relation 1663/16386/236356665: No space left on >>> device >>> CONTEXT: writing block 209610 of relation 1663/16386/236356665 >> >> That's pretty frickin' odd as well, because as a rule we make sure that >> backing store exists for each table page before we open it up for >> normal writing. Do you have a way to find out what relation >> 1663/16386/236356665 is? > > 1663 = pg_default tablespace > 16386 = adecndb, the only database (aside from postgres, template1 & 0 > on this backend). > > SELECT relname FROM pg_class WHERE relfilenode = 236356665; -- got nothing > > But this is on the production database where a vacuum has already succeeded. > >> What filesystem is this database sitting on? > > FreeBSD 6.2 ufs. Here are the options from /etc/fstab. > > /dev/da1 /xraid ufs rw,noatime 2 2 > >>> Now, the first message is very strange since we have monitoring on the >>> file system used by the database and it's been hovering at about 18% >>> space used for the last month. So I can't figure out why we'd get "No >>> space left on device", assuming the device is actually the disk (which >>> seems reasonable given the context) and not shared memory. >> >> Yeah, this is definitely a case of ENOSPC being returned by write(), >> If you're sure the disk wasn't out of space, maybe some per-user quota >> was getting in the way? > > Monitoring is using df to determine space available and runs every 5 > minutes. It will alarm at 70% usage and critical at 80% usage. We > received no alarms or other notifications. We do not yet have tracking > in place to story the history of this, so I can't say definitively > that we didn't spike over that usage level. However we are only at 18% > currently, so it seems unlikely. Nothing else uses this disk. > > I checked /etc/rc.conf and did not find quotas turned on. Neither are > they enabled in the fstab. I confirmed using > quota -v pgsql > It says "none". > > Andrew >
"Andrew Hammond" <andrew.george.hammond@gmail.com> writes: > Does anyone else have any suggestions about what I can do to diagnose this? The whole thing is pretty mystifying, especially the ENOSPC write failure on what seems like it couldn't have been a full disk. > Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open > relation 1663/16386/679439393: No such file or directory I don't think anyone asked before --- after the restore fails with the above, does the directory $PGDATA/base/16386/ exist? Although WAL recovery should attempt to create missing files, I think it won't try to create missing directories. regards, tom lane
On Thu, Jul 3, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >> Does anyone else have any suggestions about what I can do to diagnose this? > > The whole thing is pretty mystifying, especially the ENOSPC write > failure on what seems like it couldn't have been a full disk. Yes, I've passed along the task of explaining why PG thought the disk was full to the sysadmin responsible for the box. I'll post the answer here, when and if we have one. >> Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open >> relation 1663/16386/679439393: No such file or directory > > I don't think anyone asked before --- after the restore fails with the > above, does the directory $PGDATA/base/16386/ exist? Although WAL > recovery should attempt to create missing files, I think it won't > try to create missing directories. The directory exists (and the 679439393 file does not). Andrew
"Andrew Hammond" <andrew.george.hammond@gmail.com> writes: > On Thu, Jul 3, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The whole thing is pretty mystifying, especially the ENOSPC write >> failure on what seems like it couldn't have been a full disk. > Yes, I've passed along the task of explaining why PG thought the disk > was full to the sysadmin responsible for the box. I'll post the answer > here, when and if we have one. I just noticed something even more mystifying: you said that the ENOSPC error occurred once a day during vacuuming. That doesn't make any sense, because a write error would leave the shared buffer still marked dirty, and so the next checkpoint would try to write it again. If there's a persistent write error on a particular block, you should see it being complained of at least once per checkpoint interval. If you didn't see that, it suggests that the ENOSPC was transient, which isn't unreasonable --- but why would it recur for the exact same block each night? Have you looked into the machine's kernel log to see if there is any evidence of low-level distress (hardware or filesystem level)? I'm wondering if ENOSPC is being reported because it is the closest available errno code, but the real problem is something different than the error message text suggests. Other than the errno the symptoms all look quite a bit like a bad-sector problem ... regards, tom lane
On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >> On Thu, Jul 3, 2008 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The whole thing is pretty mystifying, especially the ENOSPC write >>> failure on what seems like it couldn't have been a full disk. > >> Yes, I've passed along the task of explaining why PG thought the disk >> was full to the sysadmin responsible for the box. I'll post the answer >> here, when and if we have one. > > I just noticed something even more mystifying: you said that the ENOSPC > error occurred once a day during vacuuming. Actually, the ENOSPC happened once. After that first error, we got vacuumdb: vacuuming of database "adecndb" failed: ERROR: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for deletion target page 64767 repeatedly. > That doesn't make any > sense, because a write error would leave the shared buffer still marked > dirty, and so the next checkpoint would try to write it again. If > there's a persistent write error on a particular block, you should see > it being complained of at least once per checkpoint interval. > > If you didn't see that, it suggests that the ENOSPC was transient, > which isn't unreasonable --- but why would it recur for the exact > same block each night? > > Have you looked into the machine's kernel log to see if there is any > evidence of low-level distress (hardware or filesystem level)? I'm > wondering if ENOSPC is being reported because it is the closest > available errno code, but the real problem is something different than > the error message text suggests. Other than the errno the symptoms > all look quite a bit like a bad-sector problem ... I will pass this along to the sysadmin in charge of this box.
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Have you looked into the machine's kernel log to see if there is any > evidence of low-level distress (hardware or filesystem level)? I'm > wondering if ENOSPC is being reported because it is the closest > available errno code, but the real problem is something different than > the error message text suggests. Other than the errno the symptoms > all look quite a bit like a bad-sector problem ... Uhm, just for the record FileWrite returns error messages which get printed this way for two reasons other than write(2) returning ENOSPC: 1) if FileAccess has to reopen the file then open(2) could return an error. I don't see how open returns ENOSPC without O_CREAT (and that's cleared for reopening) 2) If write(2) returns < 0 but doesn't set errno. That also seems like a strange case that shouldn't happen, but perhaps there's some reason it can. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Thu, Jul 3, 2008 at 10:57 PM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Have you looked into the machine's kernel log to see if there is any >> evidence of low-level distress (hardware or filesystem level)? I'm >> wondering if ENOSPC is being reported because it is the closest >> available errno code, but the real problem is something different than >> the error message text suggests. Other than the errno the symptoms >> all look quite a bit like a bad-sector problem ... da1 is the storage device where the PGDATA lives. Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929ba560:6810 timed out for ccb 0xffffff0000e20000 (req->ccb 0xffffff0000e20000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b90c0:6811 timed out for ccb 0xffffff0001081000 (req->ccb 0xffffff0001081000) Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b9f88:6812 timed out for ccb 0xffffff0000d93800 (req->ccb 0xffffff0000d93800) Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929ba560:6810 function 0 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929bcc90:6813 timed out for ccb 0xffffff03e132dc00 (req->ccb 0xffffff03e132dc00) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929ba560:6810 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929ba560:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b90c0:6811 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b90c0:6811 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b90c0:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929b9f88:6812 function 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 6c 99 9 c0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929b9f88:6812 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b9f88:0 completed Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req 0xffffffff929bcc90:6813 function 0 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req 0xffffffff929bcc90:6813 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929bcc90:0 completed Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 0 1 65 1b 71 a0 0 0 0 20 0 0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus device reset occurred Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data) Jun 19 03:18:16 db1 kernel: mpt3: request 0xffffffff929d5900:56299 timed out for ccb 0xffffff03df7f5000 (req->ccb 0xffffff03df7f5000) I think this is a smoking gun. Andrew
"Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >> On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Have you looked into the machine's kernel log to see if there is any >>> evidence of low-level distress (hardware or filesystem level)? > Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req > 0xffffffff929b9f88:6812 function 0 > Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 > 0 1 6c 99 9 c0 0 0 0 20 0 0 > Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error > Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition > Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 > Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus > device reset occurred > [etc] > I think this is a smoking gun. Yeah, sure looks like one. Time to replace that disk drive? Also, I suggest filing a bug with your kernel distributor --- ENOSPC was a totally misleading error code here. Seems like EIO would be more appropriate. They'll probably want to see the kernel log. regards, tom lane
On Mon, Jul 7, 2008 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Andrew Hammond" <andrew.george.hammond@gmail.com> writes: >>> On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Have you looked into the machine's kernel log to see if there is any >>>> evidence of low-level distress (hardware or filesystem level)? > >> Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req >> 0xffffffff929b9f88:6812 function 0 >> Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0 >> 0 1 6c 99 9 c0 0 0 0 20 0 0 >> Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error >> Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition >> Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0 >> Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus >> device reset occurred >> [etc] > >> I think this is a smoking gun. > > Yeah, sure looks like one. Time to replace that disk drive? Well, that's an Apple XRaid device on the other side of an LSI FiberChannel HBA. I'll see if it has any error messages in it's logs. > Also, I suggest filing a bug with your kernel distributor --- ENOSPC was > a totally misleading error code here. Seems like EIO would be more > appropriate. They'll probably want to see the kernel log. It's FreeBSD 6.2 (yes, I know it's EoL). I'll submit a bug including this email.
On Mon, Jul 7, 2008 at 2:08 PM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > On Mon, Jul 7, 2008 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, I suggest filing a bug with your kernel distributor --- ENOSPC was >> a totally misleading error code here. Seems like EIO would be more >> appropriate. They'll probably want to see the kernel log. > > It's FreeBSD 6.2 (yes, I know it's EoL). I'll submit a bug including this email. http://www.freebsd.org/cgi/query-pr.cgi?pr=125382 Andrew