Thread: Autovacuum seems to block database: WARNING worker took too long to start
Autovacuum seems to block database: WARNING worker took too long to start
From
Pablo Delgado Díaz-Pache
Date:
Hi all,
I'm having a strange problem with postgres & autovacuum
Everything is working fine until I start getting the following errors ... and postgres stops working shortly after (it stops accepting connections)
2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08 CET|1/44303|0|| LOG: automatic vacuum of table "mrs.pg_catalog.pg_statistic": index scans: 1
pages: 0 removed, 189 remain
tuples: 132 removed, 4587 remain
system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
.........
.........
infinite errors like this ...
.........
.........
This situation obviously creates serious problems for us!! He have to restart the database to make it work again.
Some info of our system
OS: Centos 5.5
Kernel: 2.6.18-194.26.1.el5
Postgres version: 8.4.5 (installation out-of-the-box using yum)
Server memory: 28GB
Autovacuum configuration
autovacuum = on
vacuum_cost_delay = 10
log_autovacuum_min_duration = 0
autovacuum_naptime = 60
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_cost_delay = 20ms
Shared memory value
shared_buffers = 917504
I've read the problem seems to be autovacuum using all resources and leaving the system unasable (not allowing to accept more connections), but I've not read the way to fix it.
Does upgrading to postgres 9.0 help?
Any help would be appreciated!
Thanks
Pablo
=?ISO-8859-1?Q?Pablo_Delgado_D=EDaz=2DPache?= <delgadop@gmail.com> writes: > I'm having a strange problem with postgres & autovacuum > Everything is working fine until I start getting the following errors ... > and postgres stops working shortly after (it stops accepting connections) > 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08 > CET|1/44303|0|| LOG: automatic vacuum of table > "mrs.pg_catalog.pg_statistic": index scans: 1 > pages: 0 removed, 189 remain > tuples: 132 removed, 4587 remain > system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec > 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > CET||0|| WARNING: worker took too long to start; cancelled > 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > CET||0|| WARNING: worker took too long to start; cancelled > 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 > CET||0|| WARNING: worker took too long to start; cancelled Hm. The code comment above that warning says * The only problems that may cause this code to * fire are errors in the earlier sections of AutoVacWorkerMain, * before the worker removes the WorkerInfo from the * startingWorker pointer. but it's hard to see what problem there could lead to an issue. (In particular, I discount the idea that AutovacuumLock could be stuck, because we had to acquire it in order to issue this message.) But it strikes me that the code comment is wrong in one significant way: if the postmaster were failing to heed SIGUSR1 at all, you could reach the timeout here, because the fork-failed signal wouldn't get sent. Given that you say it also stops accepting connections, I'm thinking this is a postmaster problem not an autovacuum problem. But you've not provided any information about that end of it. Exactly what happens when you try to make a connection? Are there any entries at all in the postmaster log? What about the kernel log? Are you sure that new connections stop working *after* this happens, and not at exactly the same time? > OS: Centos 5.5 > Kernel: 2.6.18-194.26.1.el5 > Postgres version: 8.4.5 (installation out-of-the-box using yum) Given that this is a Linux system, I think that an OOM kill on the postmaster is a not-insignificant possibility. Or at least I would think that if there weren't a PostmasterIsAlive check in the autovac launcher loop. It's real hard to see how you could get more than one of these messages if the postmaster were gone entirely. Could you try strace'ing the postmaster process to see what it's doing when this is happening? regards, tom lane
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Pablo Delgado Díaz-Pache
Date:
Hi Tom, thanks for your detailed responses. I really appreciate it.
My comments below inline ...
Hm. The code comment above that warning says
> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)
> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG: automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
> pages: 0 removed, 189 remain
> tuples: 132 removed, 4587 remain
> system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled
* The only problems that may cause this code to
* fire are errors in the earlier sections of AutoVacWorkerMain,
* before the worker removes the WorkerInfo from the
* startingWorker pointer.
but it's hard to see what problem there could lead to an issue. (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)
But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.
Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem. But you've
not provided any information about that end of it. Exactly what happens
when you try to make a connection? Are there any entries at all in the
postmaster log? What about the kernel log? Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?
When I try to make a connection from the command line I get nothing. I mean, if I execute ...
# psql -U postgres -l
I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever.
It's like no more connections were accepted. Obviously our applications stop working immediately.
Trying to stop postgres with
# /etc/init.d/postgres stopStopping postgresql service:
I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min. with no luck). As a consequence I had to kill the postmaster process manually.
Kernel log says nothing :(
Not sure what started failing first: the connections or the autovacuum. I just went to the log and saw that sequence of events. Searching in google I found a few references saying autovacuum and that error log "WARNING: worker took too long to start; cancelled" were releated.
Given that this is a Linux system, I think that an OOM kill on the
> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)
postmaster is a not-insignificant possibility. Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop. It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.
I just noticed the error log repeats once every minute but the content is the same ...
First time the log appeared
2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
It repeats once every minute until we fix it
2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
All 1,000+ lines are alike except for the date (every new line adds 1 more minute)
However the log is the same
2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
According to our postgresql.conf log format ...
log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| '# %u = user name# %d = database name# %r = remote host and port# %h = remote host# %p = process ID# %t = timestamp without milliseconds# %m = timestamp with milliseconds# %i = command tag# %c = session ID# %l = session line number# %s = session start timestamp# %v = virtual transaction ID# %x = transaction ID (0 if none)# %q = stop here in non-session# processes
Splitting the log ...
(%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET
(%l = session line number) = 209
(%d = database name) =
(%r = remote host and port) =
(%p = process ID) = 3300
(%i = command tag) =
(%c = session ID) = 4cdc2ae6.ce4
(%s = session start timestamp) = 2010-11-11 18:41:58 CET
(%v = virtual transaction ID) =
(%x = transaction ID (0 if none)) = 0
%u = user name) =
Message = WARNING: worker took too long to start; cancelled
I see the following fields are the same in the 1,200+ lines of the log
(%d = database name) =
(%r = remote host and port) =
(%c = session ID)
(%s = session start timestamp)
(%v = virtual transaction ID)
(%x = transaction ID (0 if none))
%u = user name)
Message
Does this info help?
What does "empty database" mean? could be the autovacuum process?
I noticed the "%s" (matches the time we started postgres. Here you have the log
2010-11-11 18:41:58 CET)
2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58 CET||0|| LOG: database system was shut down at 2010-11-11 18:38:55 CET2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| LOG: autovacuum launcher started2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56 CET||0|| LOG: database system is ready to accept connections
Could you try strace'ing the postmaster process to see what it's doing
when this is happening?
I definitely will.
More information in case that helps. Strangely, we had the same problem in a temporary database we used last week while migrating some data. This time there was not autovacuum entry in the log ...
2010-11-10 20:26:06.441 CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04 CET|114/0|0|postgres| LOG: duration: 1770.500 ms statement: select id_type,id,id_language_fk,data from cache where id_type=5 and id = '60820269' and id_language_fk = 22010-11-10 20:26:07.100 CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15 CET|80/0|0|postgres| LOG: duration: 1178.423 ms statement: SELECT "reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551') LIMIT 12010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:28:41.641 CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08 CET|6/701|0|postgres| LOG: duration: 1878.724 ms execute <unnamed>: select tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on (tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on (t.id_token = tu.id_token_fk) inner join useoccupation as uo on (uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk = tp.id_tokenprovider) where tp.id_tokenprovider in (10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020) order by tp.id_tokenprovider,t.id_token,uo.sort2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:30:01.940 CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15 CET|70/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.942 CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06 CET|60/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.948 CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15 CET|89/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.950 CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28 CET|72/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.951 CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30 CET|83/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.961 CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19 CET|57/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:01.969 CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17 CET|65/0|0|postgres| LOG: unexpected EOF on client connection2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
Well, thanks for any information. I can provide more logs/information if necessary.
Pablo
regards, tom lane
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Alvaro Herrera
Date:
Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010: > > But it strikes me that the code comment is wrong in one significant way: > > if the postmaster were failing to heed SIGUSR1 at all, you could reach > > the timeout here, because the fork-failed signal wouldn't get sent. Hmm, yeah, I guess I wasn't considering the idea that postmaster was so busy with something else that it wouldn't be able to respond to the launcher's requests. As you say, if it went away entirely, autovacuum would also suicide. > > Could you try strace'ing the postmaster process to see what it's doing > > when this is happening? > > I definitely will. Yes, please. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Pablo Delgado Díaz-Pache
Date:
Not sure what you mean.
Once we start getting that error in the postgres log, there is no autovacuum entry. Only that error message once every minute (and as i wrote in my last email, the same message all the time).
The question is what is causing postgres to stop working. Autovacuum or another reason? If it's another reason ... what could it be?
thanks in advance for your help
2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010:Hmm, yeah, I guess I wasn't considering the idea that postmaster was so
> > But it strikes me that the code comment is wrong in one significant way:
> > if the postmaster were failing to heed SIGUSR1 at all, you could reach
> > the timeout here, because the fork-failed signal wouldn't get sent.
busy with something else that it wouldn't be able to respond to the
launcher's requests. As you say, if it went away entirely, autovacuum
would also suicide.Yes, please.
> > Could you try strace'ing the postmaster process to see what it's doing
> > when this is happening?
>
> I definitely will.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Alvaro Herrera
Date:
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010: > Not sure what you mean. > > Once we start getting that error in the postgres log, there is no autovacuum > entry. Only that error message once every minute (and as i wrote in my last > email, the same message all the time). Yeah. This means that the autovacuum launcher daemon is still running and trying to schedule new jobs, but it's not able to for some reason. When it tries to start a new one, it sees the old one still waiting to start running; and then it logs this message. To start a new job, what it does is send a signal to postmaster and wait for it to create a child process. This is the part that's not working. > The question is what is causing postgres to stop working. Autovacuum or > another reason? If it's another reason ... what could it be? I have no idea what's going on. I have never seen this problem. If you see it happen again, please get a stacktrace from postmaster; and/or strace it. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Alvaro Herrera
Date:
Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010: > Well, we had the problem again. This is what we did > > 1) A "ps" to check what processes were running. Many "postmaster" processes > in defunct state. A few postgres connections still working. A few "psql" (by > shell script) queries hanged (they are scripts to monitor state of database. > We use big-sister to do so). > > The normal state of the server is 1 postmaster pid and many "postgres" > connections in "idle" state. > > This the result when we were having the problem. I've excluded irrelevant > processes. > > *postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres > international 10.19.0.51(49264) idle* > *postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>* > *postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>* Hmm, these are processes that were forked from postmaster (PID 3889) and died before they could become something useful -- but postmaster hasn't reclaimed yet. Can you see what process 3889 is doing? strace it or get a stacktrace with GDB. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Alvaro Herrera
Date:
Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010: > 2) We did a strace to the postmaster pid. However we had 2 postmasters not > dead > > # ps -fea |grep -i postmaster > postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 > -D /var/lib/pgsql/data > postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 > -D /var/lib/pgsql/data > > As soon as we did a "strace" to the 3889 pid everything started to work > again. Sorry for my previous response -- evidently I failed to scroll down enough to notice this part. It seems to me that this process was stuck in a unnatural way. > Not sure it was a coincidence but that was the way it was. > > *# strace -p 3889* > *Process 3889 attached - interrupt to quit* > *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be > restarted)* > *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---* > *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN > RT_1], NULL, 8) = 0* This seems normal postmaster activity: receiving SIGUSR1, then SIGCHLD, and doing stuff accordingly. Rather than a coincidence, I would think that the act of tracing it made it come back to life. A kernel bug maybe? Have you upgraded your kernel or libc lately? > I also straced the other postmaster pid > > *# strace -p 7601* > *Process 7601 attached - interrupt to quit* > *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, > NULL) = 181* This one seems like a regular postmaster child that hadn't gotten around to changing its ps status yet. (Note it had PPID 3889 which is consistent with this idea.) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Pablo Delgado Díaz-Pache
Date:
Just noticed the mysql database (we just used it for wordpress) that we had in the same server was also "stucked" and had to restart it again. Checking the "netstat -an" I noticed the following ...
# netstat -an | grep 3306tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTENtcp 0 0 10.19.0.47:3306 10.19.0.46:33691 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33745 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33710 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33743 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33751 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33690 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33758 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33725 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33739 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33730 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33744 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33688 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33741 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33757 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33712 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33734 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33719 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33738 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33682 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33702 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33701 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33742 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33708 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33703 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33731 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33723 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33695 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33724 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33728 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33689 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33729 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33694 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33720 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33754 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33732 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33705 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33748 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33749 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33733 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33753 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33755 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33726 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33727 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33737 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33693 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33700 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33740 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33747 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33716 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33709 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33697 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33717 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33736 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33715 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33704 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33711 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33746 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33718 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33692 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33707 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33696 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33750 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33681 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33699 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33756 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33714 SYN_RECVtcp 0 0 10.19.0.47:3306 10.19.0.46:33706 SYN_RECVtcp 1 0 10.19.0.47:3306 10.19.0.46:49437 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49436 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49439 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49438 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49433 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49432 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49435 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49434 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49429 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49428 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49431 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49430 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49427 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49453 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49452 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49455 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49454 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49449 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49448 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49451 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49450 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49445 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49444 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49447 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49446 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49441 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49440 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49443 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49442 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49457 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:49456 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30397 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30396 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30399 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30398 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30393 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30392 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30395 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30394 TIME_WAITtcp 100 0 10.19.0.47:3306 10.19.0.46:30391 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30409 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30408 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30411 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30410 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30405 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30404 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30407 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30406 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30401 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30400 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30403 TIME_WAITtcp 1 0 10.19.0.47:3306 10.19.0.46:30402 TIME_WAIT
Didn't do the netstat -an for port 5432, but I guess all max_connections for postgres were busy. Just a guess ...
All connections are done from internal servers, and there are no firewall rules in that interface. Just in case here it goes the iptables -L -n -v
# iptables -L -n -vChain INPUT (policy ACCEPT 0 packets, 0 bytes)pkts bytes target prot opt in out source destinationChain FORWARD (policy ACCEPT 0 packets, 0 bytes)pkts bytes target prot opt in out source destinationChain OUTPUT (policy ACCEPT 170M packets, 392G bytes)pkts bytes target prot opt in out source destinationChain mirai (1 references)pkts bytes target prot opt in out source destination30 1800 ACCEPT all -- eth0 * x.x.x.x 0.0.0.0/01023K 61M ACCEPT all -- eth1 * 10.0.0.0/8 0.0.0.0/0
2010/11/18 Pablo Delgado Díaz-Pache <delgadop@gmail.com>
Well, we had the problem again. This is what we did1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.This the result when we were having the problem. I've excluded irrelevant processes.postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idlepostgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idlepostgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idlepostgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idlepostgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger processpostgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer processpostgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer processpostgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher processpostgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector processpostgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idlepostgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idlepostgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idlepostgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idlepostgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrspostgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idlepostgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idlepostgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idlepostgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idlepostgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>2) We did a strace to the postmaster pid. However we had 2 postmasters not dead# ps -fea |grep -i postmasterpostgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/dataAs soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.# strace -p 3889Process 3889 attached - interrupt to quitselect(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---I also straced the other postmaster pid# strace -p 7601Process 7601 attached - interrupt to quitrecvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73recvfrom(8, <unfinished ...>Process 7601 detached3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connectionAny Ideas?Thanks for your help and feedback!Pablo2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:> Not sure what you mean.Yeah. This means that the autovacuum launcher daemon is still running
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.
To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.I have no idea what's going on. I have never seen this problem. If you
> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?
see it happen again, please get a stacktrace from postmaster; and/or
strace it.
--Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: Autovacuum seems to block database: WARNING worker took too long to start
From
Pablo Delgado Díaz-Pache
Date:
Well, we had the problem again. This is what we did
1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).
The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.
This the result when we were having the problem. I've excluded irrelevant processes.
postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idlepostgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idlepostgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idlepostgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idlepostgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/datapostgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger processpostgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer processpostgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer processpostgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher processpostgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector processpostgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idlepostgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idlepostgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idlepostgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idlepostgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrsroot 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrspostgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idlepostgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idlepostgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idlepostgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idlepostgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>
2) We did a strace to the postmaster pid. However we had 2 postmasters not dead
# ps -fea |grep -i postmaster
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.
# strace -p 3889Process 3889 attached - interrupt to quitselect(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGUSR1 (User defined signal 1) @ 0 (0) ---rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)--- SIGCHLD (Child exited) @ 0 (0) ---
I also straced the other postmaster pid
# strace -p 7601Process 7601 attached - interrupt to quitrecvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175brk(0x14ac4000) = 0x14ac4000lseek(3, 0, SEEK_END) = 16130048lseek(5, 0, SEEK_END) = 4661248lseek(3, 0, SEEK_END) = 16130048brk(0x14ae9000) = 0x14ae9000brk(0x14adb000) = 0x14adb000brk(0x14ad7000) = 0x14ad7000sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73recvfrom(8, <unfinished ...>Process 7601 detached
3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.
2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection
Any Ideas?
Thanks for your help and feedback!
Pablo
2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:> Not sure what you mean.Yeah. This means that the autovacuum launcher daemon is still running
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.
To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.I have no idea what's going on. I have never seen this problem. If you
> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?
see it happen again, please get a stacktrace from postmaster; and/or
strace it.
--Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support