Thread: simple update closes connection. why?
hi, this simple update closes the connection. I use the database to dynamically update titles and directory on my web pages. I show table structure first then the command then, uh, the closed connection. calling=# \d calling; Table "public.calling" Column | Type | Modifiers -------------+------------------------+----------- filename | character varying(40) | description | character varying(180) | uname | character varying(30) | Indexes: unique_filename unique btree (filename), unique_filename_idx btree (filename) calling=# select * from calling limit 5; filename | description | uname -----------------+-------------------------------------+------- c_java_python | | conf | old conf directory | mod_perl_webapp | main weba pplication, now a symlink | python | | ssh+ssl | | (5 rows) calling=# update calling set uname='pencilhead'; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !# \q [pencilhead@www postgres]$ psql -d calling Password: Welcome to psql 7.3.1, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help on internal slash commands \g or terminate with semicolon to execute query \q to quit calling=# select count(*) from calling; count ------- 1683 (1 row) calling=# \dp calling Access privileges for database "calling" Schema | Table | Access privileges --------+-------+--------------------------------- public | calling | {=,pencilhead=arwdRxt} (1 row) joe speigle
hi, I increased the debugging in postgresql.conf but not many changes calling=# update calling set uname='joe'; DEBUG: StartTransactionCommand DEBUG: ProcessQuery server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !# -- joe speigle
joseph speigle <joe.speigle@jklh.us> writes: > this simple update closes the connection. My first guess is there's some data corruption in the table or one of its indexes. It would be useful to get a stack trace showing where in the backend the core dump occurs --- can you manage that? Other things to check: can you do a "select * from calling" without crashing? What shows up in the postmaster log when the crash occurs? If you just want to fix the database ASAP, you could try a REINDEX on the table --- if the problem is in the indexes that should get rid of it. But it would also destroy the evidence of exactly what went wrong. regards, tom lane
tom, i have a few things for you. namely, i increased the debug settings for logging and i ran gdb by attaching to the runningprocess. history: i had altered the table when it was okay with the following sequence alter table calling add column uname character varying(20); alter table calling alter column uname set not null; said I couldn't do that because there were null values in it, so that's when i tried the update calling set uname='joe'; and got errors. maybe the attempt to set to not null with empty records was bad. ======================================================================================================= pg_dump -d 'calling' > dump.sql .................... pg_dump: DEBUG: StartTransactionCommand pg_dump: DEBUG: ProcessUtility pg_dump: DEBUG: CommitTransactionCommand DEBUG: StartTransactionCommand DEBUG: ProcessUtility pg_dump: DEBUG: StartTransactionCommand pg_dump: DEBUG: ProcessUtility pg_dump: dumpClasses(): SQL command failed pg_dump: Error message from server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. pg_dump: The command was: FETCH 100 FROM _pg_dump_cursor DEBUG: reaping dead processes DEBUG: child process (pid 16468) was terminated by signal 11 LOG: server process (pid 16468) was terminated by signal 11 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: shmem_exit(0) DEBUG: invoking IpcMemoryCreate(size=3620864) -bash-2.05b$ LOG: database system was interrupted at 2004-02-01 01:16:07 CST LOG: checkpoint record is at 0/344EA680 LOG: redo record is at 0/344EA680; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 8800413; next oid: 614903 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: record with zero length at 0/344EA6C0 LOG: redo is not required LOG: database system is ready DEBUG: proc_exit(0) DEBUG: shmem_exit(0) DEBUG: exit(0) DEBUG: reaping dead processes ================================================================== postgres 17460 0.0 0.3 10952 3476 pts/4 S 01:52 0:00 postgres: joe calling [local] idle postgres 17461 0.0 0.0 2620 704 pts/4 R 01:52 0:00 ps -axu -bash-2.05b$ gdb attach 17460 GNU gdb Red Hat Linux (5.2.1-4) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"...attach: No such file or directory. Attaching to process 17460 Reading symbols from /usr/local/pgsql/bin/postgres...done. Reading symbols from /usr/lib/libz.so.1...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /usr/lib/libreadline.so.4...done. Loaded symbols for /usr/lib/libreadline.so.4 Reading symbols from /usr/lib/libncurses.so.5...done. Loaded symbols for /usr/lib/libncurses.so.5 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/i686/libm.so.6...done. Loaded symbols for /lib/i686/libm.so.6 Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x420dabc2 in recv () from /lib/i686/libc.so.6 (gdb) c Continuing. DEBUG: StartTransactionCommand DEBUG: ProcessQuery Program received signal SIGSEGV, Segmentation fault. 0x0806b286 in nocachegetattr () (gdb) where #0 0x0806b286 in nocachegetattr () #1 0x080cef8b in ExecEvalVar () #2 0x080d01e0 in ExecEvalExpr () #3 0x080d04ea in ExecTargetList () #4 0x080d0797 in ExecProject () #5 0x080d0839 in ExecScan () #6 0x080d674b in ExecSeqScan () #7 0x080ce842 in ExecProcNode () #8 0x080d736e in ExecLimit () #9 0x080ce879 in ExecProcNode () #10 0x080cd7cb in ExecutePlan () #11 0x080cce92 in ExecutorRun () #12 0x0811a645 in ProcessQuery () #13 0x08118dc1 in pg_exec_query_string () #14 0x08119c21 in PostgresMain () #15 0x08101ac2 in DoBackend () #16 0x0810163a in BackendStartup () #17 0x08100781 in ServerLoop () #18 0x08100230 in PostmasterMain () #19 0x080defe4 in main () #20 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6 (gdb) RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE RECOMPILE -g (gdb) bt #0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at heaptuple.c:349 #1 0x080f79a3 in ExecEvalVar (variable=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "") at execQual.c:377 #2 0x080f972b in ExecEvalExpr (expression=0x82f5080, econtext=0x82f5b08, isNull=0xbfffe9d3 "", isDone=0xbfffe9f0) at execQual.c:1682 #3 0x080f9c3d in ExecTargetList (targetlist=0x82f50a8, nodomains=1, targettype=0x82f5ee8, values=0x82f5fc8, econtext=0x82f5b08, isDone=0xbfffec04) at execQual.c:2058 #4 0x080fa025 in ExecProject (projInfo=0x82f5fa0, isDone=0xbfffec04) at execQual.c:2282 #5 0x080fa140 in ExecScan (node=0x82f59f0, accessMtd=0x8102498 <SeqNext>) at execScan.c:133 #6 0x0810259c in ExecSeqScan (node=0x82f59f0) at nodeSeqscan.c:133 #7 0x080f6c28 in ExecProcNode (node=0x82f59f0, parent=0x0) at execProcnode.c:291 #8 0x080f56ac in ExecutePlan (estate=0x82f5e20, plan=0x82f59f0, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, destfunc=0x82f5fe0) at execMain.c:954 #9 0x080f4a87 in ExecutorRun (queryDesc=0x82f5df8, estate=0x82f5e20, direction=ForwardScanDirection, count=0) at execMain.c:195 #10 0x0815ffc9 in ProcessQuery (parsetree=0x82f33e8, plan=0x82f59f0, dest=Remote, completionTag=0xbfffedd0 "") at pquery.c:242 #11 0x0815e232 in pg_exec_query_string (query_string=0x82f3048, dest=Remote, parse_context=0x82ed1c8) at postgres.c:838 #12 0x0815f4da in PostgresMain (argc=4, argv=0xbffff070, username=0x82ac1a1 "joe") at postgres.c:2016 #13 0x0813e836 in DoBackend (port=0x82ac070) at postmaster.c:2293 #14 0x0813df3f in BackendStartup (port=0x82ac070) at postmaster.c:1915 #15 0x0813cd37 in ServerLoop () at postmaster.c:1000 #16 0x0813c810 in PostmasterMain (argc=1, argv=0x8293198) at postmaster.c:779 #17 0x0810e22f in main (argc=1, argv=0xbffff9d4) at main.c:210 #18 0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6 (gdb) (gdb) info all-registers eax 0x0 0 ecx 0x924 2340 edx 0xc 12 ebx 0x1a 26 esp 0xbfffe8d0 0xbfffe8d0 ebp 0xbfffe918 0xbfffe918 esi 0x40012020 1073815584 edi 0xbffff9d4 -1073743404 eip 0x806b854 0x806b854 eflags 0x10283 66179 cs 0x23 35 ss 0x2b 43 ds 0x2b 43 es 0x2b 43 fs 0x0 0 gs 0x0 0 st0 0 (raw 0x00000000000000000000) st1 0 (raw 0x00000000000000000000) st2 0 (raw 0x00000000000000000000) st3 0 (raw 0x00000000000000000000) st4 1 (raw 0x3fff8000000000000000) st5 1 (raw 0x3fff8000000000000000) st6 0 (raw 0x00000000000000000000) st7 0 (raw 0x00000000000000000000) fctrl 0x37f 895 fstat 0x4020 16416 ftag 0xffff 65535 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 xmm0 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm1 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm2 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} ---Type <return> to continue, or q <return> to quit--- xmm3 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm4 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm5 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm6 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} xmm7 {f = {0x0, 0x0, 0x0, 0x0}} {f = {0, 0, 0, 0}} mxcsr 0x1f80 8064 orig_eax 0xffffffff -1 (gdb) > joseph speigle <joe.speigle@jklh.us> writes: > > this simple update closes the connection. > > My first guess is there's some data corruption in the table or one of > its indexes. It would be useful to get a stack trace showing where > in the backend the core dump occurs --- can you manage that? > > Other things to check: can you do a "select * from calling" without > crashing? What shows up in the postmaster log when the crash occurs? > > If you just want to fix the database ASAP, you could try a REINDEX > on the table --- if the problem is in the indexes that should get > rid of it. But it would also destroy the evidence of exactly what > went wrong. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings -- joe speigle
i can't reindex the table. I dropped the index but it wont recreate it because there are duplicates in the table ERROR: Cannot create unique index. Table contains non-unique values joe speigle
joseph speigle <joe.speigle@jklh.us> writes: > i had altered the table when it was okay with the following sequence > alter table calling add column uname character varying(20); > alter table calling alter column uname set not null; > said I couldn't do that because there were null values in it, so that's when i tried the > update calling set uname='joe'; > and got errors. Hmm. That's interesting and possibly relevant, but it can't be the whole story --- people have been doing that for years. What else can you tell us about the history of this table? > (gdb) bt > #0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at heaptuple.c:349 I'm beginning to think you must have corruption in the system catalogs, because there's no way that control should have reached that line for this table, seeing that all the columns are variable-width. Could we see the output of select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass; regards, tom lane
okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large? because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas select * from calling limit 1101 kills the process. Here is what i get for the select you asked for: calling=# select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass; -[ RECORD 1 ]-+------------ attrelid | 577720 attname | filename atttypid | 1043 attstattarget | -1 attlen | -1 attnum | 1 attndims | 0 attcacheoff | -1 atttypmod | 44 attbyval | f attstorage | x attisset | f attalign | i attnotnull | f atthasdef | f attisdropped | f attislocal | t attinhcount | 0 -[ RECORD 2 ]-+------------ attrelid | 577720 attname | description atttypid | 1043 attstattarget | -1 attlen | -1 attnum | 2 attndims | 0 attcacheoff | -1 atttypmod | 184 attbyval | f attstorage | x attisset | f attalign | i attnotnull | f atthasdef | f attisdropped | f attislocal | t attinhcount | 0 -[ RECORD 3 ]-+------------ attrelid | 577720 attname | uname atttypid | 1043 attstattarget | -1 attlen | -1 attnum | 3 attndims | 0 attcacheoff | -1 atttypmod | 34 attbyval | f attstorage | x attisset | f attalign | i attnotnull | f atthasdef | f attisdropped | f attislocal | t attinhcount | 0 and here is also something intersting calling=# select oid from calling; snip ..... 583116 583117 583118 583119 583120 584227 2617245723 588066 588067 578306 578307 578310 588068 ...... snip ( so thinking I can try to get where > 584227...) calling=# select oid from calling where oid > 584227; oid ------------ 1342255253 588066 588067 588068 588069 588090 588091 588093 588094 596284 596285 596286 596287 snip .................................. (it's a moving target!) calling=# select oid from calling where oid > 584227; -[ RECORD 1 ]--- oid | 1006632986 -[ RECORD 2 ]--- oid | 588066 -[ RECORD 3 ]--- oid | 588067 -[ RECORD 4 ]--- calling=# select * from pg_class where relname='calling'; -[ RECORD 1 ]--+-------------------------------------------------- relname | calling relnamespace | 2200 reltype | 577721 relowner | 110 relam | 0 relfilenode | 577720 relpages | 159 reltuples | 1685 reltoastrelid | 0 reltoastidxid | 0 relhasindex | t relisshared | f relkind | r relnatts | 3 relchecks | 0 reltriggers | 0 relukeys | 0 relfkeys | 0 relrefs | 0 relhasoids | t relhaspkey | f relhasrules | f relhassubclass | f relacl | {=,joe=arwdRxt,perlcalling=arwdRxt,perl_user=arwdRxt} calling=# select count(*) from calling; -[ RECORD 1 ] count | 1820 calling=# On Sun, Feb 01, 2004 at 12:02:50PM -0500, Tom Lane wrote: > joseph speigle <joe.speigle@jklh.us> writes: > > i had altered the table when it was okay with the following sequence > > > alter table calling add column uname character varying(20); > > alter table calling alter column uname set not null; > > said I couldn't do that because there were null values in it, so that's when i tried the > > update calling set uname='joe'; > > and got errors. > > Hmm. That's interesting and possibly relevant, but it can't be the > whole story --- people have been doing that for years. What else can > you tell us about the history of this table? > > > (gdb) bt > > #0 0x0806b854 in nocachegetattr (tuple=0x82f5b78, attnum=1, tupleDesc=0x406f23a0, isnull=0xbfffe9d3 "") at heaptuple.c:349 > > I'm beginning to think you must have corruption in the system catalogs, > because there's no way that control should have reached that line for > this table, seeing that all the columns are variable-width. Could we > see the output of > select * from pg_attribute where attnum > 0 and attrelid = 'calling'::regclass; > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings -- joe speigle
joseph speigle <joe.speigle@jklh.us> writes: > okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large? > because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas > select * from calling limit 1101 kills the process. Okay, so you have a corrupted tuple --- from the evidence so far I'd guess that the length word of the first column in that row is clobbered. The wacky OIDs in some of the other rows look like data corruption too. At this point my thoughts are heading in the direction of hardware problems. Have you tried running any memory or disk diagnostics? memtest86 and badblocks seem to be the most widely used tests. regards, tom lane
no, but perhaps I will. but since there's only this one corrupted table, it was trying to add a third column "uname" whichwas where it went bad. It might have been this command alter table calling add uname character varying(30) not null default 'joe'; Or, maybe I am confused and it didnt' like altertable calling alter column uname set not null; Thanks for helping out! All my commands were: \d calling; alter table calling add uname character varying(30) not null default 'joe'; alter table calling add uname character varying(30) not null; alter table calling add uname character varying(30); alter table calling uname character varying(30) set not null; alter table calling modify uname character varying(30) set not null; alter table calling alter column uname character varying(30) set not null; alter table calling alter column uname set not null; update calling set uname='joe'; \q update calling set uname='joe'; \q \dt \d calling; select * from calling limit 5; update calling set uname='joe'; \q update calling set uname='joe'; \q update calling set uname='joe'; \q select count(*) from calling; \d calling; \q \c calling; \dt On Sun, Feb 01, 2004 at 04:35:14PM -0500, Tom Lane wrote: > joseph speigle <joe.speigle@jklh.us> writes: > > okay. I"m making progress here. :) I want to know if I should just delete the oid that's too very large? > > because, I can issue a select up to row 1100. So, Select * from calling limit 1100 returns all rows, whereas > > select * from calling limit 1101 kills the process. > > Okay, so you have a corrupted tuple --- from the evidence so far I'd > guess that the length word of the first column in that row is clobbered. > The wacky OIDs in some of the other rows look like data corruption too. > > At this point my thoughts are heading in the direction of hardware > problems. Have you tried running any memory or disk diagnostics? > memtest86 and badblocks seem to be the most widely used tests. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- joe speigle
joseph speigle <joe.speigle@jklh.us> writes: > no, but perhaps I will. but since there's only this one corrupted > table, it was trying to add a third column "uname" which was where it > went bad. It might have been this command alter table calling add > uname character varying(30) not null default 'joe'; Seems unlikely. ALTER TABLE ADD COLUMN has been in Postgres since forever; I think it's real unlikely that the sequence of operations you showed would have misbehaved. If you'd done some weird sequence of column adds and drops, I'd be more prepared to credit a bug (since DROP COLUMN is new in 7.3) ... but even then, damage consisting of isolated corruption of different kinds in a few different rows doesn't sound like a very probable symptom. I suspect the damage was already there and the UPDATE was just the first operation that happened to touch the clobbered row. I'd look for hardware explanations if I were you. regards, tom lane