Thread: Killing "stuck" queries and preventing queries from getting "stuck"

Killing "stuck" queries and preventing queries from getting "stuck"

From

Tim Uckun

Date:

27 September 2010, 20:41:44

I have a situation where there are dozens of daemons hitting the same
postgres database.  They all run different types of queries but each
daemon runs the same set of queries over and over again.

Sometimes some queries get "stuck" in that they run for hours and
hours. They never stop running.  Killing the deamon does not stop the
query from running.

Once there are three of four of these "stuck" queries the database
slows down drastically.

Is there a way to tell postgres to stop any query that runs longer
than a specified amount of time? Say an hour?

Failing that what is a good strategy for detecting stuck queries and
killing them.

Thanks.

Re: Killing "stuck" queries and preventing queries from getting "stuck"

From

Tom Lane

Date:

27 September 2010, 23:39:42

Tim Uckun <timuckun@gmail.com> writes:
> Is there a way to tell postgres to stop any query that runs longer
> than a specified amount of time? Say an hour?

Setting statement_timeout would do that.  You ought to figure out
what's causing the performance problem, though, instead of just
zapping things ...

            regards, tom lane

Re: Killing "stuck" queries and preventing queries from getting "stuck"

From

Tim Uckun

Date:

28 September 2010, 00:26:08

On Tue, Sep 28, 2010 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tim Uckun <timuckun@gmail.com> writes:
>> Is there a way to tell postgres to stop any query that runs longer
>> than a specified amount of time? Say an hour?
>
> Setting statement_timeout would do that.  You ought to figure out
> what's causing the performance problem, though, instead of just
> zapping things ...

Well the query is pretty heavy but it gets run a lot. There is a
distinct in there which seems to be the cause of most of the headaches
but it's going to take a while to redo the application to not use
distinct.

The query gets run a lot and 99.99% of the time it runs succesfully
and the daemon goes on it's merry way. Occasionally it seems to "get
stuck" and killing the daemon does not unstick it.  I have not been
willing to kill -9 the process and at this stage I can afford to
restart the postgres. Eventually I won't be able to do that though so
I want to fix the app so it uses a more reasonable query and detect
and stop stuck queries in case other queries sneak into the process
during development.

Re: Killing "stuck" queries and preventing queries from getting "stuck"

From

Craig Ringer

Date:

28 September 2010, 03:21:22

On 28/09/10 11:25, Tim Uckun wrote:
> On Tue, Sep 28, 2010 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Tim Uckun <timuckun@gmail.com> writes:
>>> Is there a way to tell postgres to stop any query that runs longer
>>> than a specified amount of time? Say an hour?
>>
>> Setting statement_timeout would do that.  You ought to figure out
>> what's causing the performance problem, though, instead of just
>> zapping things ...
>
> Well the query is pretty heavy but it gets run a lot. There is a
> distinct in there which seems to be the cause of most of the headaches
> but it's going to take a while to redo the application to not use
> distinct.
>
> The query gets run a lot and 99.99% of the time it runs succesfully
> and the daemon goes on it's merry way. Occasionally it seems to "get
> stuck" and killing the daemon does not unstick it.

Useful things to try when you have a "stuck" backend:

- attach strace to it and see if it's doing anything
  that involves system calls

- attach gdb to it and get a backtrace to see what
  it's up to. If it's using CPU, do this multiple times
  to see if it's in some kind of infinite loop, as you'll
  get a snapshot of different stacks if so. See:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

- (on linux; you didn't mention your OS):
  cat /proc/$pid/stack , where $pid is the process id
  of the stuck backend, to see what the backend process is
  up to in the kernel.


... then post the output of all those tests here, along with the
contents of "select * from pg_stat_activity", "select * from pg_locks"
and anything from the postgresql log files that looks possibly relevant.

--
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

Re: Killing "stuck" queries and preventing queries from getting "stuck"

From

Alban Hertroys

Date:

28 September 2010, 14:58:01

On 28 Sep 2010, at 1:41, Tim Uckun wrote:

> Sometimes some queries get "stuck" in that they run for hours and
> hours. They never stop running.  Killing the deamon does not stop the
> query from running.

You really should try to find out why they get "stuck". Killing stuck clients isn't going to solve your problem (aside
fromthe fact that you probably shouldn't be using kill -9 on them, that's like using a jackhammer on a jammed door). 

Some things to look into: Are those queries waiting on a lock by another daemon maybe? Are some of them keeping
transactionsopen for a long time without committing them (or rolling them back)? 

I recall you were having another problem (with deleting records). This all smells like you either are waiting for locks
onrecords or that the statistics used for query planning aren't reflecting the actual situation. 

Have a look in pg_locks and check the query plans of some of your more problematic queries (see: explain analyse) to
seewhat's going on there. Posting the results of those here would allow more eyes to look into your issues, in case it
isn'tobvious to you. There's some good documentation on these subjects too. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4ca22c9f678304378921584!

Re: Killing "stuck" queries and preventing queries from getting "stuck"

From

Tim Uckun

Date:

28 September 2010, 17:14:56

>
> You really should try to find out why they get "stuck". Killing stuck clients isn't going to solve your problem
(asidefrom the fact that you probably shouldn't be using kill -9 on them, that's like using a jackhammer on a jammed
door).

Well I didn't use kill -9 I used the pg_cancel_backend command.

>
> Some things to look into: Are those queries waiting on a lock by another daemon maybe? Are some of them keeping
transactionsopen for a long time without committing them (or rolling them back)? 

I'll take a look at that. It certainly would be simpler than attaching
a gdb session to the pid and getting a stacktrace.

> I recall you were having another problem (with deleting records). This all smells like you either are waiting for
lockson records or that the statistics used for query planning aren't reflecting the actual situation. 
>

I am having some performance issues with the database. I am also
trying to clean out a lot of records out of the system.  Once all the
records I want to delete are gone perhaps the problem will go away. I
am also looking at how the application can be refactored not to use
this particular DISTINCT query.

> Have a look in pg_locks and check the query plans of some of your more problematic queries (see: explain analyse) to
seewhat's going on there. Posting the results of those here would allow more eyes to look into your issues, in case it
isn'tobvious to you. There's some good documentation on these subjects too. 


I did look at the analyze and basically postgres is saying the
distinct is killing me. I remove that and the query is fine.   I
didn't look at the locks because the queries are read only so I didn't
think they would be effected by locks but I will look at them post
them here.