Thread: Huge spikes in number of connections doing "PARSE"
hi, i have a server with pg 8.3.11. Every now and then, for no apparent reason (io normal, cpu usage normal, web traffic normal) i see hundreds of connections in "PARSE" state. to give some perspective - we're doing in the lowest traffic of the day around 3000 transactions per second, with ~ 500-600 connections. So. every now and then (couple of times per day at most). I see hundreds (800-900) of connections in "PARSE" state. I did notice one thing. we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args every 15 seconds or so. And based on its output, I was able to get stats of "wchan" of all PARSE pg processes when the problem was logged. Results: 805 x semtimedop 10 x stext Any ideas on what could be wrong? Machine was definitely not loaded most of the times it happened. The problem usually goes away in ~ 10-15 seconds. Any hints? Suggestions? Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote: > So. every now and then (couple of times per day at most). I see hundreds > (800-900) of connections in "PARSE" state. > > I did notice one thing. > > we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args > every 15 seconds or so. > > And based on its output, I was able to get stats of "wchan" of all PARSE > pg processes when the problem was logged. > Results: > > 805 x semtimedop > 10 x stext > > Any ideas on what could be wrong? Machine was definitely not loaded most > of the times it happened. > > The problem usually goes away in ~ 10-15 seconds. Would you have your monitoring process detect this condition and capture stack traces, preferably from a gdb with access to debug information, of several of these processes? That will probably make the specific contention point clear.
On Fri, Mar 11, 2011 at 03:03:55AM -0500, Noah Misch wrote: > On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote: > > So. every now and then (couple of times per day at most). I see hundreds > > (800-900) of connections in "PARSE" state. > > > > I did notice one thing. > > > > we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args > > every 15 seconds or so. > > > > And based on its output, I was able to get stats of "wchan" of all PARSE > > pg processes when the problem was logged. > > Results: > > > > 805 x semtimedop > > 10 x stext > > > > Any ideas on what could be wrong? Machine was definitely not loaded most > > of the times it happened. > > > > The problem usually goes away in ~ 10-15 seconds. > > Would you have your monitoring process detect this condition and capture stack > traces, preferably from a gdb with access to debug information, of several of > these processes? That will probably make the specific contention point clear. unfortunately debug was not enabled on this server, and changing binaries would be rather complicated as it's production environment. i'm not a c programmer, can you tell me how to get stack trace (assuming it makes any sense without debug enabled) without damaging the process in any way? Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Fri, Mar 11, 2011 at 04:13:52PM +0100, hubert depesz lubaczewski wrote: > On Fri, Mar 11, 2011 at 03:03:55AM -0500, Noah Misch wrote: > > On Wed, Mar 09, 2011 at 09:38:07PM +0100, hubert depesz lubaczewski wrote: > > > So. every now and then (couple of times per day at most). I see hundreds > > > (800-900) of connections in "PARSE" state. > > > > > > I did notice one thing. > > > > > > we do log output of ps axo user,pid,ppid,pgrp,%cpu,%mem,rss,lstart,nice,nlwp,sgi_p,cputime,tty,wchan:25,args > > > every 15 seconds or so. > > > > > > And based on its output, I was able to get stats of "wchan" of all PARSE > > > pg processes when the problem was logged. > > > Results: > > > > > > 805 x semtimedop > > > 10 x stext > > > > > > Any ideas on what could be wrong? Machine was definitely not loaded most > > > of the times it happened. > > > > > > The problem usually goes away in ~ 10-15 seconds. > > > > Would you have your monitoring process detect this condition and capture stack > > traces, preferably from a gdb with access to debug information, of several of > > these processes? That will probably make the specific contention point clear. > > unfortunately debug was not enabled on this server, and changing > binaries would be rather complicated as it's production environment. Understood. Not a critical problem, most likely. > i'm not a c programmer, can you tell me how to get stack trace (assuming > it makes any sense without debug enabled) without damaging the process > in any way? gdb -ex=bt /path/to/bin/postgres $pid </dev/null I've used this on production systems to debug issues like this one, and I've never observed damage. The exact effect of debugger attach/detach may be OS/kernel-dependent, so it's hard to make categorical guarantees of safety. nm
On Fri, Mar 11, 2011 at 11:13:43AM -0500, Noah Misch wrote: > gdb -ex=bt /path/to/bin/postgres $pid </dev/null thanks. added cronjob that will get stacktrace of random 10 backend in parse state, when we'll have more than 100 of them. It will take some time to gather some more evidence though. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Fri, Mar 11, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:
At least on Windows NT and derivatives (until I last checked a few years ago), detaching a debugger kills the debugged process.
gdb -ex=bt /path/to/bin/postgres $pid </dev/null
I've used this on production systems to debug issues like this one, and I've
never observed damage. The exact effect of debugger attach/detach may be
OS/kernel-dependent, so it's hard to make categorical guarantees of safety.
At least on Windows NT and derivatives (until I last checked a few years ago), detaching a debugger kills the debugged process.
Regards,
--
On Fri, Mar 11, 2011 at 11:13:43AM -0500, Noah Misch wrote: > gdb -ex=bt /path/to/bin/postgres $pid </dev/null hi so, let me remind what's what. I wrote a script, that every 15 seconds, checks system for Pg backends in "PARSE" state. If there are more than 100 of them, script randombly chooses 10 of them, and runs "gdb -batch -quiet -ex=bt /usr/bin/postgres PID" on them. Over the weekend I got 2125 such stack traces logged, but only 60 of them happened when we had such huge unexpected spikes (this db server is quite busy), with over 400 parsing backends. These 60 were summarized, and output is available here: http://www.depesz.com/various/locks.summary.txt as you can seem, in 48 cases backend process was in semop(), which relates directly to my previous findings with ps/wchan. summary format is: 11 0x00000031884d4665 in recv () from /lib64/libc.so.6 #0 0x00000031884d4665 in recv () from /lib64/libc.so.6 #1 0x00000000005366d7 in secure_read () #2 0x000000000053d204 in ?? () #3 0x000000000053d607 in pq_getbyte () #4 0x00000000005afa6d in PostgresMain () #5 0x00000000005857b4 in ?? () #6 0x000000000058643a in PostmasterMain () #7 0x000000000053efde in main () means that there were 11 stack traces exacly like the one shown next to number "11" Any ideas based on the stack traces in the file ( the file itself is 20kB, so I didn't want to put it in email ) Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote: > These 60 were summarized, and output is available here: > http://www.depesz.com/various/locks.summary.txt > > as you can seem, in 48 cases backend process was in semop(), which relates > directly to my previous findings with ps/wchan. It's unfortunate you don't have debug symbols enabled, which makes these traces somewhat unreliable. So you get odd things like index_open calling index_close. The common factor seems to be lots of index locks. Do you have very many indexes? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Attachment
On Mon, Mar 14, 2011 at 10:21:27PM +0100, Martijn van Oosterhout wrote: > The common factor seems to be lots of index locks. Do you have very > many indexes? $ select count(*) from pg_class where relkind = 'i'; count ------- 450 (1 row) $ select count(*) from pg_class where relkind = 'r'; count ------- 201 (1 row) not sure if that's many. -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote: > I wrote a script, that every 15 seconds, checks system for Pg backends in > "PARSE" state. If there are more than 100 of them, script randombly chooses > 10 of them, and runs "gdb -batch -quiet -ex=bt /usr/bin/postgres PID" on > them. > > Over the weekend I got 2125 such stack traces logged, but only 60 of them > happened when we had such huge unexpected spikes (this db server is quite > busy), with over 400 parsing backends. > > These 60 were summarized, and output is available here: > http://www.depesz.com/various/locks.summary.txt Thanks; that's exactly what I wanted. > Any ideas based on the stack traces in the file ( the file itself is 20kB, so I > didn't want to put it in email ) The common theme is contention over updating the shared lock table, with the lockers targeting system catalogs. Such requests normally get satisfied from backend-local caches, greatly reducing the amount of lock activity. That isn't happening, indicating that either your backend lifespan is very short or your caches are getting reset. More likely the latter. What is a typical lifespan for a backend in this system? What sort of connection pooling are you using, if any? Do you create temporary objects or perform other DDL regularly? In PostgreSQL 8.3, when any one backend-local local cache fell too far out of date, all backends would take a full cache reset. That seems like the most credible explanation for your symptoms. PostgreSQL 8.4 brought major changes to this area (commit fad153ec45299bd4d4f29dec8d9e04e2f1c08148), so upgrading might eliminate the spikes you're seeing. Regardless of PostgreSQL version, constraining the number of concurrent backends at the connection pool level will make these events less frequent and costly. You'll almost always be better off running 200 backends and letting 200 additional queries wait than running 400 backends. nm
On Mon, Mar 14, 2011 at 10:21:27PM +0100, Martijn van Oosterhout wrote: > On Mon, Mar 14, 2011 at 07:49:46PM +0100, hubert depesz lubaczewski wrote: > > These 60 were summarized, and output is available here: > > http://www.depesz.com/various/locks.summary.txt > > > > as you can seem, in 48 cases backend process was in semop(), which relates > > directly to my previous findings with ps/wchan. > > It's unfortunate you don't have debug symbols enabled, which makes > these traces somewhat unreliable. So you get odd things like index_open > calling index_close. That trace is legitimate. If index_open()'s call into the relcache needs to consult the catalogs, it will in turn scan pg_class_oid_index.
On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote: > What is a typical lifespan for a backend in this system? What sort of > connection pooling are you using, if any? quite long, but: we have n (~ 40 i think) web servers. each webserver has it's own pgbouncer (in session pooling). application connects to pgbouncer, issues (usually) single query, and disconnects. pgbouncers are set to close oder connections, if unused, after (i think 2 minutes. (i don't actually have access there). generally, through the day we see from 400 to 800 connections, mostly idle, but sometimes it goes much higher (like 1400), and then the connections are mostly parsing. depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote: >> What is a typical lifespan for a backend in this system? What sort of >> connection pooling are you using, if any? > quite long, but: > we have n (~ 40 i think) web servers. each webserver has it's own > pgbouncer (in session pooling). > application connects to pgbouncer, issues (usually) single query, and > disconnects. > pgbouncers are set to close oder connections, if unused, after (i think > 2 minutes. (i don't actually have access there). > generally, through the day we see from 400 to 800 connections, mostly > idle, but sometimes it goes much higher (like 1400), and then the > connections are mostly parsing. If the backends are "mostly idle" then it'd be a good idea to set a smaller maximum limit on their number. Idle backends can be a performance risk pre-8.4 because of the cache reset problem Noah pointed to. You'd be better off having pgbouncer delay incoming queries until there's a free backend to pass them to. regards, tom lane
On Tue, Mar 15, 2011 at 10:20:38AM -0400, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > On Tue, Mar 15, 2011 at 12:13:21AM -0400, Noah Misch wrote: > >> What is a typical lifespan for a backend in this system? What sort of > >> connection pooling are you using, if any? > > > quite long, but: > > > we have n (~ 40 i think) web servers. each webserver has it's own > > pgbouncer (in session pooling). > > > application connects to pgbouncer, issues (usually) single query, and > > disconnects. > > > pgbouncers are set to close oder connections, if unused, after (i think > > 2 minutes. (i don't actually have access there). > > > generally, through the day we see from 400 to 800 connections, mostly > > idle, but sometimes it goes much higher (like 1400), and then the > > connections are mostly parsing. > > If the backends are "mostly idle" then it'd be a good idea to set a > smaller maximum limit on their number. Idle backends can be a > performance risk pre-8.4 because of the cache reset problem Noah pointed > to. You'd be better off having pgbouncer delay incoming queries until > there's a free backend to pass them to. while I understand it, and will pass this information to client, one thing bugs me - one the problem happens only sometimes, with relatively low count of backends? we saw database running normally with 1200 connections, yet sometimes, it has 500 connections, and then something happens and they all end up locking in this parse mode. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
hubert depesz lubaczewski <depesz@depesz.com> writes: > while I understand it, and will pass this information to client, one > thing bugs me - one the problem happens only sometimes, with relatively > low count of backends? > we saw database running normally with 1200 connections, yet sometimes, > it has 500 connections, and then something happens and they all end up > locking in this parse mode. Well, if they're all busy, you're not going to get this cache reset problem. It's idle backends that have fallen behind that create the problem. regards, tom lane
On Tue, Mar 15, 2011 at 10:58:52AM -0400, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > while I understand it, and will pass this information to client, one > > thing bugs me - one the problem happens only sometimes, with relatively > > low count of backends? > > we saw database running normally with 1200 connections, yet sometimes, > > it has 500 connections, and then something happens and they all end up > > locking in this parse mode. > > Well, if they're all busy, you're not going to get this cache reset > problem. It's idle backends that have fallen behind that create the > problem. thanks. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
hubert depesz lubaczewski <depesz@depesz.com> writes: > we have n (~ 40 i think) web servers. each webserver has it's own > pgbouncer (in session pooling). In some cases I have found useful to have those webserver's pgbouncer connect to another pgbouncer on the database host. But if your problem is tied to real active connection, I don't see what it would solve. Still, if you're searching ideas… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sat, Apr 30, 2011 at 08:55:09PM +0200, Dimitri Fontaine wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > we have n (~ 40 i think) web servers. each webserver has it's own > > pgbouncer (in session pooling). > > In some cases I have found useful to have those webserver's pgbouncer > connect to another pgbouncer on the database host. But if your problem > is tied to real active connection, I don't see what it would solve. > Still, if you're searching ideas… thanks. we'll probably look into it, but the situaion kind of stalled now. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/