Re: the un-vacuumable table - Mailing list pgsql-hackers
From | Andrew Hammond |
---|---|
Subject | Re: the un-vacuumable table |
Date | |
Msg-id | 5a0a9d6f0806271631p3e7b627obc031b30f445a1ac@mail.gmail.com Whole thread Raw |
In response to | Re: the un-vacuumable table ("Andrew Hammond" <andrew.george.hammond@gmail.com>) |
Responses |
Re: the un-vacuumable table
|
List | pgsql-hackers |
On Wed, Jun 25, 2008 at 9:57 AM, Andrew Hammond <andrew.george.hammond@gmail.com> wrote: > > On Wed, Jun 25, 2008 at 2:58 AM, Heikki Linnakangas <heikki@enterprisedb.com> wrote: >> >> Andrew Hammond wrote: >>> >>> I found this error message in my log files repeatedly: >>> >>> Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for >>> deletion target page 64767 >>> >>> I though "hmm, that index looks broken. I'd better re-create it." So, I >>> dropped the index and then tried to create a new one to replace it. Which >>> completely locked up the backend that was running the CREATE TABLE. I ran >>> truss against the backend in question and it didn't register anything >>> (except signals 2 and 15 when I tried to cancel the query and kill the >>> backend respectively). I eventually had to restart the database to get the >>> CREATE INDEX process to go away (well, to release that big nasty lock). >> >> What kind of an index is it? Does "SELECT COUNT(*) from <table>" work? > > After the restart I did a count(*) and it worked. A little under 13m rows. So, sequential scans seem to work. > >>> >>> posting here in case there's interest in gathering some forensic data or a >>> clever suggetion about how I can recover this situation or even some ideas >>> about what's causing it. >> >> Anyway, the current plan is to drop the table and reload it from backup. I'm >> >> Yes, please take a filesystem-level backup right away to retain the evidence. > > Well, I've already burned our downtime allowance for this month, but we do a regular PITR type backup which hopefully willbe sufficient to replicate the problem. > >> >> Could you connect to the hung backend with gdb and get a stacktrace? > > The backend is no longer hung (two restarts later). I'll try to reproduce this problem on my workstation (same binary,same OS, libraries etc) using the PITR dump. > > Andrew I tried to restore the PITR backup and it failed. Jun 27 15:54:30 qadb2 postgres[92517]: [1-1] DEBUG: postmaster: PostmasterMain: initial environ dump: Jun 27 15:54:30 qadb2 postgres[92517]: [2-1] DEBUG: ----------------------------------------- Jun 27 15:54:30 qadb2 postgres[92517]: [3-1] DEBUG: USER=pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [4-1] DEBUG: MAIL=/var/mail/pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [5-1] DEBUG: LD_LIBRARY_PATH=:/usr/local/adecn/lib Jun 27 15:54:30 qadb2 postgres[92517]: [6-1] DEBUG: HOME=/usr/local/pgsql Jun 27 15:54:30 qadb2 postgres[92517]: [7-1] DEBUG: PGLIB=/usr/local/lib Jun 27 15:54:30 qadb2 postgres[92517]: [8-1] DEBUG: PS1=[QA2] \u@\h:\w\$ Jun 27 15:54:30 qadb2 postgres[92517]: [9-1] DEBUG: BLOCKSIZE=K Jun 27 15:54:30 qadb2 postgres[92517]: [10-1] DEBUG: TERM=xterm Jun 27 15:54:30 qadb2 postgres[92517]: [11-1] DEBUG: PGSYSCONFDIR=/usr/local/etc/postgresql Jun 27 15:54:30 qadb2 postgres[92517]: [12-1] DEBUG: PGLOCALEDIR=/usr/local/share/locale Jun 27 15:54:30 qadb2 postgres[92517]: [13-1] DEBUG: Jun 27 15:54:30 qadb2 postgres[92517]: [13-2] PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin Jun 27 15:54:30 qadb2 postgres[92517]: [13-3] :/usr/local/pgsql/bin Jun 27 15:54:30 qadb2 postgres[92517]: [14-1] DEBUG: ADECN_HOME=/usr/local/adecn Jun 27 15:54:30 qadb2 postgres[92517]: [15-1] DEBUG: SHELL=/bin/sh Jun 27 15:54:30 qadb2 postgres[92517]: [16-1] DEBUG: Jun 27 15:54:30 qadb2 postgres[92517]: [16-2] CLASSPATH= (deleted a bunch of lines) Jun 27 15:54:30 qadb2 postgres[92517]: [17-1] DEBUG: PYTHONPATH=/usr/local/adecn/python:/usr/local/adecn/lib/python:/usr/local/adecn/api/client/python Jun 27 15:54:30 qadb2 postgres[92517]: [18-1] DEBUG: FTP_PASSIVE_MODE=YES Jun 27 15:54:30 qadb2 postgres[92517]: [19-1] DEBUG: PGDATA=/var/db/adecn/adecndb Jun 27 15:54:30 qadb2 postgres[92517]: [20-1] DEBUG: LC_COLLATE=C Jun 27 15:54:30 qadb2 postgres[92517]: [21-1] DEBUG: LC_CTYPE=C Jun 27 15:54:30 qadb2 postgres[92517]: [22-1] DEBUG: LC_MESSAGES=C Jun 27 15:54:30 qadb2 postgres[92517]: [23-1] DEBUG: LC_MONETARY=C Jun 27 15:54:30 qadb2 postgres[92517]: [24-1] DEBUG: LC_NUMERIC=C Jun 27 15:54:30 qadb2 postgres[92517]: [25-1] DEBUG: LC_TIME=C Jun 27 15:54:30 qadb2 postgres[92517]: [26-1] DEBUG: ----------------------------------------- Jun 27 15:54:30 qadb2 postgres[92518]: [27-1] DEBUG: invoking IpcMemoryCreate(size=92938240) Jun 27 15:54:30 qadb2 postgres[92518]: [28-1] DEBUG: max_safe_fds = 983, usable_fds = 1000, already_open = 7 Jun 27 15:54:30 qadb2 postgres[92519]: [29-1] LOG: database system was interrupted at 2008-06-25 03:01:02 PDT Jun 27 15:54:30 qadb2 postgres[92519]: [30-1] LOG: starting archive recovery Jun 27 15:54:30 qadb2 postgres[92519]: [31-1] LOG: restore_command = "cp -p /usr/tmp/2008-06-25_wals/%f %p" Jun 27 15:54:30 qadb2 postgres[92519]: [32-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history pg_xlog/RECOVERYHISTORY" Jun 27 15:54:30 qadb2 postgres[92519]: [33-1] DEBUG: could not restore file "00000001.history" from archive: return code 256 Jun 27 15:54:30 qadb2 postgres[92519]: [34-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup Jun 27 15:54:30 qadb2 postgres[92519]: [34-2] pg_xlog/RECOVERYHISTORY" Jun 27 15:54:30 qadb2 postgres[92519]: [35-1] LOG: restored log file "00000001000001D600000078.0055F0B8.backup" from archive Jun 27 15:54:30 qadb2 postgres[92519]: [36-1] DEBUG: executing restore command "cp -p /usr/tmp/2008-06-25_wals/00000001000001D600000078 pg_xlog/RECOVERYXLOG" Jun 27 15:54:31 qadb2 postgres[92519]: [37-1] LOG: restored log file "00000001000001D600000078" from archive Jun 27 15:54:31 qadb2 postgres[92519]: [38-1] LOG: checkpoint record is at 1D6/7855F0B8 Jun 27 15:54:31 qadb2 postgres[92519]: [39-1] LOG: redo record is at 1D6/7855F0B8; undo record is at 0/0; shutdown FALSE Jun 27 15:54:31 qadb2 postgres[92519]: [40-1] LOG: next transaction ID: 397171279; next OID: 679516596 Jun 27 15:54:31 qadb2 postgres[92519]: [41-1] LOG: next MultiXactId: 857318; next MultiXactOffset: 1718141 Jun 27 15:54:31 qadb2 postgres[92519]: [42-1] LOG: automatic recovery in progress Jun 27 15:54:31 qadb2 postgres[92519]: [43-1] LOG: redo starts at 1D6/7855F108 (I thought this line was interesting) Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open relation 1663/16386/679439393: No such file or directory Jun 27 15:54:31 qadb2 postgres[92518]: [29-1] DEBUG: forked new backend, pid=92526 socket=8 Jun 27 15:54:31 qadb2 postgres[92526]: [29-1] LOG: connection received: host=[local] Jun 27 15:54:31 qadb2 postgres[92526]: [30-1] FATAL: the database system is starting up Jun 27 15:54:31 qadb2 postgres[92526]: [31-1] DEBUG: proc_exit(0) Jun 27 15:54:31 qadb2 postgres[92526]: [32-1] DEBUG: shmem_exit(0) Jun 27 15:54:31 qadb2 postgres[92526]: [33-1] DEBUG: exit(0) Jun 27 15:54:31 qadb2 postgres[92518]: [30-1] DEBUG: reaping dead processes Jun 27 15:54:31 qadb2 postgres[92518]: [31-1] DEBUG: server process (PID 92526) exited with exit code 0 Jun 27 15:54:31 qadb2 postgres[92518]: [32-1] DEBUG: reaping dead processes Jun 27 15:54:31 qadb2 postgres[92518]: [33-1] LOG: startup process (PID 92519) was terminated by signal 6 Jun 27 15:54:31 qadb2 postgres[92518]: [34-1] LOG: aborting startup due to startup process failure Jun 27 15:54:31 qadb2 postgres[92518]: [35-1] DEBUG: proc_exit(1) Jun 27 15:54:31 qadb2 postgres[92518]: [36-1] DEBUG: shmem_exit(1) Jun 27 15:54:31 qadb2 postgres[92518]: [37-1] DEBUG: exit(1) I googled to find out what the numbers 1663/16386/679439393 from the PANIC message mean, but no luck. The last looks like an OID. I also hunted through our production log files and found the following interesting error messages. (On Thursday night) vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not write block 209610 of relation 1663/16386/236356665: No space left on device CONTEXT: writing block 209610 of relation 1663/16386/236356665 (Friday night, and every night until the incident on Tuesday). vacuumdb: vacuuming of database "adecndb" failed: ERROR: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for deletion target page 64767 Now, the first message is very strange since we have monitoring on the file system used by the database and it's been hovering at about 18% space used for the last month. So I can't figure out why we'd get "No space left on device", assuming the device is actually the disk (which seems reasonable given the context) and not shared memory. I also checked our bug system for history and discovered we'd seen similar problems in the past when connecting to a SAN via a FC switch. Directly attaching the server to the SAN appeared to solve the problem. So currently I'm wondering if this might be a hardware / os / some-other-part-of-the-storage-stack issue manifesting as trashing the database in some way? Interestingly, since I dropped the ledgerdetail_2008_03_idx2 index, the nightly vacuum has been completing successfully. I'd also like advice on whether I need to schedule another outage and do a pg_dump / initdb / load (well, more accurately, do a slony move to the backup box and rebuild the old origin). Or perhaps I should ask, how urgently. Andrew
pgsql-hackers by date: