Thread: Hung SQL Update Linux Redhat 4U5 Postgres 8.3.1
All, I'm hoping for some help on trying to figure out what is going on with our postgres implementation. We have an application that uses Hibernate to persist objects into a Postgres database. We've run into a semi-repeatable problem (every other or every third test run) where we issue an update SQL command to update a single row in a table and that update just hangs (we've let it go for more than 12hrs to see if it would eventually complete and it did not) While the update is hung the system has a negligible CPU load (with postgres not even showing up in a top listing) We've dumped the locks and it shows that all locks have been granted so it appears that it is not a lock that is standing in our way. We've also gone in via psql while the update is hung and were able to perform an update on the offending table without issue. Finally, we have also enabled the statement_timeout and set it to five minutes and it does timeout the hung update and return to normal processing by rolling back the offending transaction but that's not a viable solution for us. Anyone have any words of wisdom on how to track this down? Thanks, -Craig
Craig Vosburgh <craig.vosburgh@cassatt.com> writes: > We've dumped the locks and it shows that all locks have been granted so > it appears that it is not a lock that is standing in our way. We've > also gone in via psql while the update is hung and were able to perform > an update on the offending table without issue. Finally, we have also > enabled the statement_timeout and set it to five minutes and it does > timeout the hung update and return to normal processing by rolling back > the offending transaction but that's not a viable solution for us. > Anyone have any words of wisdom on how to track this down? For starters, attach to the hung backend with gdb and get a stack trace ... regards, tom lane
>> Craig Vosburgh <craig(dot)vosburgh(at)cassatt(dot)com> writes: >> We've dumped the locks and it shows that all locks have been granted so >> it appears that it is not a lock that is standing in our way. We've >> also gone in via psql while the update is hung and were able to perform >> an update on the offending table without issue. Finally, we have also >> enabled the statement_timeout and set it to five minutes and it does >> timeout the hung update and return to normal processing by rolling back >> the offending transaction but that's not a viable solution for us. >> >> Anyone have any words of wisdom on how to track this down? > > For starters, attach to the hung backend with gdb and get a stack trace ... > > regards, tom lane Tom, Got GDB installed on the errant node and did a back trace call (I'm guessing that is what you were looking for when you said stack trace) on the process that shows in the process table as executing the hung SQL command. The backtrace is: (gdb) bt #0 0x0088b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0096f3ab in semop () from /lib/tls/libc.so.6 #2 0x081ba8f8 in PGSemaphoreLock () #3 0x081e4d9e in ProcSleep () #4 0x081e1db3 in GrantAwaitedLock () #5 0x0832f984 in twophase_recover_callbacks () #6 0x00000006 in ?? () #7 0x00000003 in ?? () #8 0x091c3770 in ?? () #9 0x091c3758 in ?? () #10 0xbfe63158 in ?? () #11 0xbfe634a0 in ?? () #12 0xbfe634b0 in ?? () #13 0x0832f984 in twophase_recover_callbacks () #14 0x08e8dc40 in ?? () #15 0x00000006 in ?? () #16 0xbff32c72 in ?? () #17 0x08e68758 in ?? () #18 0xb792a9f8 in ?? () #19 0xbfe631b8 in ?? () #20 0xbfe630d0 in ?? () #21 0x081e1d88 in GrantAwaitedLock () #22 0x081d6722 in MarkBufferDirty () (gdb) Thanks for the help, -Craig
Craig Vosburgh <craig.vosburgh@cassatt.com> writes: > Got GDB installed on the errant node and did a back trace call (I'm guessing > that is what you were looking for when you said stack trace) on the process > that shows in the process table as executing the hung SQL command. > The backtrace is: > (gdb) bt > #0 0x0088b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 > #1 0x0096f3ab in semop () from /lib/tls/libc.so.6 > #2 0x081ba8f8 in PGSemaphoreLock () > #3 0x081e4d9e in ProcSleep () > #4 0x081e1db3 in GrantAwaitedLock () > #5 0x0832f984 in twophase_recover_callbacks () > #6 0x00000006 in ?? () Hmm. gdb is lying to you, because GrantAwaitedLock doesn't call ProcSleep --- probably line 4 should refer to WaitOnLock instead. You could try installing a non-symbol-stripped postgres executable (or in RPM environments, install the matching debuginfo RPM) to get a more trustworthy backtrace. However, what I suspect you are looking at is just a run-of-the-mill lock wait. You *sure* there's no ungranted locks showing in pg_locks? regards, tom lane
Tom, Yea, about that lock that I *said* I didn't have... We had been joining through the pg_class and pg_tables tables to get some additional data and it turns out the row in the pg_locks that shows as locked doesn't have a relation so it was filtered out due to the join. So, now that I have accurate data (just doing a select * from pg_locks where granted = 'f') I indeed have an outstanding lock that isn't granted that is causing the holdup. However, now I have a couple of other questions. First, I thought I've seen in the past where Postgres will recognize a deadlock and will log an error and then move on. Do I need to do something with the more recent code base to get that behavior? I think the last time I saw that was back on an 8.1.X code base. Second, any words of wisdom to help run to ground who's keeping me from getting the lock on the offending row? The row in the lock table that shows granted false does not show as belonging to a database or relation (both null) so I can't join through to get the table info from pg_table. Thanks for all the help, -Craig On 5/12/08 12:16 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Craig Vosburgh <craig.vosburgh@cassatt.com> writes: >> Got GDB installed on the errant node and did a back trace call (I'm guessing >> that is what you were looking for when you said stack trace) on the process >> that shows in the process table as executing the hung SQL command. > >> The backtrace is: >> (gdb) bt >> #0 0x0088b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 >> #1 0x0096f3ab in semop () from /lib/tls/libc.so.6 >> #2 0x081ba8f8 in PGSemaphoreLock () >> #3 0x081e4d9e in ProcSleep () >> #4 0x081e1db3 in GrantAwaitedLock () >> #5 0x0832f984 in twophase_recover_callbacks () >> #6 0x00000006 in ?? () > > Hmm. gdb is lying to you, because GrantAwaitedLock doesn't call > ProcSleep --- probably line 4 should refer to WaitOnLock instead. > You could try installing a non-symbol-stripped postgres executable > (or in RPM environments, install the matching debuginfo RPM) to > get a more trustworthy backtrace. > > However, what I suspect you are looking at is just a run-of-the-mill > lock wait. You *sure* there's no ungranted locks showing in pg_locks? > > regards, tom lane
Craig Vosburgh <craig.vosburgh@cassatt.com> writes: > Second, any words of wisdom to help run to ground who's keeping me from > getting the lock on the offending row? The row in the lock table that shows > granted false does not show as belonging to a database or relation (both > null) so I can't join through to get the table info from pg_table. What's the locktype? If (as I suspect) it's a transaction or virtualtransaction lock, then which process holds that lock and what's it doing? For that matter, what is the stuck process doing? pg_stat_activity output should be sufficient here. regards, tom lane
> > What's the locktype? > Yep, locktype is transaction. > If (as I suspect) it's a transaction or > virtualtransaction lock, then which process holds that lock and what's > it doing? As for which process owns that lock, I'm not sure how to find that out (sorry newbie). I can find the PID that is waiting for that lock and I can find the table/row that appears to be waiting for the lock to perform the action but I can't figure out which process actually owns the lock that is causing the issue. > For that matter, what is the stuck process doing? > pg_stat_activity output should be sufficient here. > > regards, tom lane I've enabled stats tracking and the process that is hung has the following data in the pg_stat_activity table datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port -------+---------+---------+----------+----------+-------------------------- ---------------------------------------------------------------------------- -------------------------------------------------+---------+---------------- ---------------+-------------------------------+---------------------------- ---+-------------+------------- 16384 | collage | 20938 | 10 | postgres | update ips set address=$1, usage=$2, subnet_rips_id=$3, hostname=$4, errored=$5, errorReason=$6, modinfo=$7, name=$8, description=$9 where bmo_id=$10 | t | 2008-05-11 17:25:04.484224-06 | 2008-05-11 17:25:04.528319-06 | 2008-05-11 17:02:00.016083-06 | 127.0.0.1 | 49056 Everything else in the table is either IDLE or IDLE in transaction. Thanks again, -Craig On 5/12/08 1:11 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Craig Vosburgh <craig.vosburgh@cassatt.com> writes: > As for which process owns that lock, I'm not sure how to find that out > (sorry newbie). I can find the PID that is waiting for that lock and I can > find the table/row that appears to be waiting for the lock to perform the > action but I can't figure out which process actually owns the lock that is > causing the issue. It's whichever one actually holds that same lock according to pg_locks (ie, there's an entry that matches on locktype and transactionid but has granted = 't'; then join the pid of that entry against pg_stat_activity). regards, tom lane
>>Craig Vosburgh <craig(dot)vosburgh(at)cassatt(dot)com> writes: >> We've dumped the locks and it shows that all locks have been granted so >> it appears that it is not a lock that is standing in our way. We've >> also gone in via psql while the update is hung and were able to perform >> an update on the offending table without issue. Finally, we have also >> enabled the statement_timeout and set it to five minutes and it does >> timeout the hung update and return to normal processing by rolling back >> the offending transaction but that's not a viable solution for us. >> >> Anyone have any words of wisdom on how to track this down? > >For starters, attach to the hung backend with gdb and get a stack trace ... > > regards, tom lane Tom, Got GDB installed on the errant node and did a back trace call (I'm guessing that is what you were looking for when you said stack trace) on the process that shows in the process table as executing the hung SQL command. The backtrace is: (gdb) bt #0 0x0088b7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x0096f3ab in semop () from /lib/tls/libc.so.6 #2 0x081ba8f8 in PGSemaphoreLock () #3 0x081e4d9e in ProcSleep () #4 0x081e1db3 in GrantAwaitedLock () #5 0x0832f984 in twophase_recover_callbacks () #6 0x00000006 in ?? () #7 0x00000003 in ?? () #8 0x091c3770 in ?? () #9 0x091c3758 in ?? () #10 0xbfe63158 in ?? () #11 0xbfe634a0 in ?? () #12 0xbfe634b0 in ?? () #13 0x0832f984 in twophase_recover_callbacks () #14 0x08e8dc40 in ?? () #15 0x00000006 in ?? () #16 0xbff32c72 in ?? () #17 0x08e68758 in ?? () #18 0xb792a9f8 in ?? () #19 0xbfe631b8 in ?? () #20 0xbfe630d0 in ?? () #21 0x081e1d88 in GrantAwaitedLock () #22 0x081d6722 in MarkBufferDirty () (gdb) Thanks for the help, -Craig