Thread: Autovacuum seems to block database: WARNING worker took too long to start

Autovacuum seems to block database: WARNING worker took too long to start

From

Pablo Delgado Díaz-Pache

Date:

14 November 2010, 05:18:28

Hi all,

I'm having a strange problem with postgres & autovacuum

Everything is working fine until I start getting the following errors ... and postgres stops working shortly after (it stops accepting connections)

2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08 CET|1/44303|0|| LOG: automatic vacuum of table "mrs.pg_catalog.pg_statistic": index scans: 1

pages: 0 removed, 189 remain

tuples: 132 removed, 4587 remain

system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec

2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

.........

infinite errors like this ...

.........

This situation obviously creates serious problems for us!! He have to restart the database to make it work again.

Some info of our system

OS: Centos 5.5

Kernel: 2.6.18-194.26.1.el5

Postgres version: 8.4.5 (installation out-of-the-box using yum)

Server memory: 28GB

Autovacuum configuration

autovacuum = on

vacuum_cost_delay = 10

log_autovacuum_min_duration = 0

autovacuum_naptime = 60

autovacuum_vacuum_threshold = 50

autovacuum_vacuum_cost_delay = 20ms

Shared memory value

shared_buffers = 917504

I've read the problem seems to be autovacuum using all resources and leaving the system unasable (not allowing to accept more connections), but I've not read the way to fix it.

Does upgrading to postgres 9.0 help?

Any help would be appreciated!

Thanks

Pablo

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Tom Lane

Date:

14 November 2010, 15:15:39

=?ISO-8859-1?Q?Pablo_Delgado_D=EDaz=2DPache?= <delgadop@gmail.com> writes:
> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG:  automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
>         pages: 0 removed, 189 remain
>         tuples: 132 removed, 4587 remain
>         system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING:  worker took too long to start; cancelled

Hm.  The code comment above that warning says

             * The only problems that may cause this code to
             * fire are errors in the earlier sections of AutoVacWorkerMain,
             * before the worker removes the WorkerInfo from the
             * startingWorker pointer.

but it's hard to see what problem there could lead to an issue.  (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem.  But you've
not provided any information about that end of it.  Exactly what happens
when you try to make a connection?  Are there any entries at all in the
postmaster log?  What about the kernel log?  Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility.  Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop.  It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

            regards, tom lane

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Pablo Delgado Díaz-Pache

Date:

15 November 2010, 03:53:06

Hi Tom, thanks for your detailed responses. I really appreciate it.

My comments below inline ...

> I'm having a strange problem with postgres & autovacuum
> Everything is working fine until I start getting the following errors ...
> and postgres stops working shortly after (it stops accepting connections)

> 2010-11-13 12:34:08.599 CET|1|||7104||4cde77b0.1bc0|2010-11-13 12:34:08
> CET|1/44303|0|| LOG: automatic vacuum of table
> "mrs.pg_catalog.pg_statistic": index scans: 1
> pages: 0 removed, 189 remain
> tuples: 132 removed, 4587 remain
> system usage: CPU 0.00s/0.00u sec elapsed 0.04 sec
> 2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled
> 2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled
> 2010-11-13 13:26:41.254 CET|4|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58
> CET||0|| WARNING: worker took too long to start; cancelled

Hm. The code comment above that warning says

* The only problems that may cause this code to
* fire are errors in the earlier sections of AutoVacWorkerMain,
* before the worker removes the WorkerInfo from the
* startingWorker pointer.

but it's hard to see what problem there could lead to an issue. (In
particular, I discount the idea that AutovacuumLock could be stuck,
because we had to acquire it in order to issue this message.)

But it strikes me that the code comment is wrong in one significant way:
if the postmaster were failing to heed SIGUSR1 at all, you could reach
the timeout here, because the fork-failed signal wouldn't get sent.

Given that you say it also stops accepting connections, I'm thinking
this is a postmaster problem not an autovacuum problem. But you've
not provided any information about that end of it. Exactly what happens
when you try to make a connection? Are there any entries at all in the
postmaster log? What about the kernel log? Are you sure that new
connections stop working *after* this happens, and not at exactly the
same time?

When I try to make a connection from the command line I get nothing. I mean, if I execute ...

# psql -U postgres -l

I have to Ctr-C or Ctr-Z (followed by kill %1) because I can wait forever.

It's like no more connections were accepted. Obviously our applications stop working immediately.

Trying to stop postgres with

# /etc/init.d/postgres stop
Stopping postgresql service:

I have to strop it with Ctr-C as it does not wotk (I waited up to 3 min. with no luck). As a consequence I had to kill the postmaster process manually.

Kernel log says nothing :(

Not sure what started failing first: the connections or the autovacuum. I just went to the log and saw that sequence of events. Searching in google I found a few references saying autovacuum and that error log "WARNING: worker took too long to start; cancelled" were releated.

> OS: Centos 5.5
> Kernel: 2.6.18-194.26.1.el5
> Postgres version: 8.4.5 (installation out-of-the-box using yum)

Given that this is a Linux system, I think that an OOM kill on the
postmaster is a not-insignificant possibility. Or at least I would
think that if there weren't a PostmasterIsAlive check in the autovac
launcher loop. It's real hard to see how you could get more than
one of these messages if the postmaster were gone entirely.

I just noticed the error log repeats once every minute but the content is the same ...

First time the log appeared

2010-11-13 13:24:40.998 CET|2|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-13 13:25:41.126 CET|3|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

It repeats once every minute until we fix it

2010-11-14 09:44:01.889 CET|1208|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

All 1,000+ lines are alike except for the date (every new line adds 1 more minute)

However the log is the same

2010-11-14 09:45:03.105 CET|1209|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| WARNING: worker took too long to start; cancelled

According to our postgresql.conf log format ...

log_line_prefix = '%m|%l|%d|%r|%p|%i|%c|%s|%v|%x|%u| '
   # %u = user name
   # %d = database name
   # %r = remote host and port
   # %h = remote host
   # %p = process ID
   # %t = timestamp without milliseconds
   # %m = timestamp with milliseconds
   # %i = command tag
   # %c = session ID
   # %l = session line number
   # %s = session start timestamp
   # %v = virtual transaction ID
   # %x = transaction ID (0 if none)
   # %q = stop here in non-session
   # processes

Splitting the log ...

(%m = timestamp with milliseconds) = 2010-11-14 09:45:03.105 CET
(%l = session line number) = 209
(%d = database name) =
(%r = remote host and port) =
(%p = process ID) = 3300
(%i = command tag) =
(%c = session ID) = 4cdc2ae6.ce4
(%s = session start timestamp) = 2010-11-11 18:41:58 CET
(%v = virtual transaction ID) =
(%x = transaction ID (0 if none)) = 0
%u = user name) =
Message = WARNING: worker took too long to start; cancelled

I see the following fields are the same in the 1,200+ lines of the log

(%d = database name) =
(%r = remote host and port) =

(%c = session ID)
(%s = session start timestamp)
(%v = virtual transaction ID)
(%x = transaction ID (0 if none))
%u = user name)

Message

Does this info help?

What does "empty database" mean? could be the autovacuum process?

I noticed the "%s" (2010-11-11 18:41:58 CET)
matches the time we started postgres. Here you have the log

2010-11-11 18:41:58.840 CET|1|||3268||4cdc2ae6.cc4|2010-11-11 18:41:58 CET||0|| LOG: database system was shut down at 2010-11-11 18:38:55 CET
2010-11-11 18:41:58.902 CET|1|||3300||4cdc2ae6.ce4|2010-11-11 18:41:58 CET||0|| LOG: autovacuum launcher started
2010-11-11 18:41:58.902 CET|1|||3234||4cdc2ae4.ca2|2010-11-11 18:41:56 CET||0|| LOG: database system is ready to accept connections

Could you try strace'ing the postmaster process to see what it's doing
when this is happening?

I definitely will.

More information in case that helps. Strangely, we had the same problem in a temporary database we used last week while migrating some data. This time there was not autovacuum entry in the log ...

2010-11-10 20:26:06.441 CET|1|cache|10.19.0.48(20840)|11239|SELECT|4cdaf1cc.2be7|2010-11-10 20:26:04 CET|114/0|0|postgres| LOG: duration: 1770.500 ms statement: select id_type,id,id_language_fk,data from cache where id_type=5 and id = '60820269' and id_language_fk = 2
2010-11-10 20:26:07.100 CET|1|xxx|10.19.0.43(14080)|9423|SELECT|4cdaef43.24cf|2010-11-10 20:15:15 CET|80/0|0|postgres| LOG: duration: 1178.423 ms statement: SELECT "reviews".id FROM "reviews" WHERE ("reviews"."booking_id" = E'10101690551') LIMIT 1
2010-11-10 20:27:26.905 CET|3|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:28:27.422 CET|4|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:28:41.641 CET|1|mrs|10.19.0.51(20106)|11272|SELECT|4cdaf1d0.2c08|2010-11-10 20:26:08 CET|6/701|0|postgres| LOG: duration: 1878.724 ms execute <unnamed>: select tp.*,t.id_token,uo.* from tokenprovider as tp inner join token as t on (tp.id_tokenprovider = t.id_tokenprovider_fk) inner join tkuo as tu on (t.id_token = tu.id_token_fk) inner join useoccupation as uo on (uo.id_useoccupation = tu.id_useoccupation_fk and uo.id_tokenprovider_fk = tp.id_tokenprovider) where tp.id_tokenprovider in (10203746,10369604,10718768,10001514,10170665,10091612,10829359,10797243,50013934,12536008,90529952,10064394,10769848,50013983,10012115,23951091,25688092,60712161,10455933,10766831,62781261,10560488,10808512,80751062,38735328,50013938,10481737,50013710,10916713,10556915,10999695,75417634,22313851,55471672,88737419,46563417,69789399,20418193,37363697,55232743,10328868,10171688,77959221,10380595,28839589,10439708,10165273,10872146,50356324,15842616,10597119,10943311,43894782,61323020) order by tp.id_tokenprovider,t.id_token,uo.sort
2010-11-10 20:29:28.349 CET|5|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:30:01.940 CET|1|xmas|10.19.0.43(14078)|9422|idle|4cdaef43.24ce|2010-11-10 20:15:15 CET|70/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.942 CET|1|xxxx|10.19.0.43(2900)|5793|idle|4cdaebb6.16a1|2010-11-10 20:00:06 CET|60/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.948 CET|1|mrs|10.19.0.43(14081)|9424|idle|4cdaef43.24d0|2010-11-10 20:15:15 CET|89/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.950 CET|3|xxxx|10.19.0.43(14429)|9840|idle|4cdaefc8.2670|2010-11-10 20:17:28 CET|72/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.951 CET|1|xxxx|10.19.0.43(60284)|10509|idle|4cdaf0ba.290d|2010-11-10 20:21:30 CET|83/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.961 CET|1|xxxx|10.19.0.43(49864)|10986|idle|4cdaf19f.2aea|2010-11-10 20:25:19 CET|57/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:01.969 CET|1|xxxx|10.19.0.43(50039)|11282|idle|4cdaf1d9.2c12|2010-11-10 20:26:17 CET|65/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-10 20:30:28.875 CET|6|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:31:29.210 CET|7|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:32:29.642 CET|8|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:33:30.271 CET|9|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:34:30.604 CET|10|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled
2010-11-10 20:35:31.106 CET|11|||22218||4cca5977.56ca|2010-10-29 07:19:51 CEST||0|| WARNING: worker took too long to start; cancelled

Well, thanks for any information. I can provide more logs/information if necessary.

Pablo

regards, tom lane

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Alvaro Herrera

Date:

16 November 2010, 10:50:07

Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010:

> > But it strikes me that the code comment is wrong in one significant way:
> > if the postmaster were failing to heed SIGUSR1 at all, you could reach
> > the timeout here, because the fork-failed signal wouldn't get sent.

Hmm, yeah, I guess I wasn't considering the idea that postmaster was so
busy with something else that it wouldn't be able to respond to the
launcher's requests.  As you say, if it went away entirely, autovacuum
would also suicide.

> > Could you try strace'ing the postmaster process to see what it's doing
> > when this is happening?
>
> I definitely will.

Yes, please.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Pablo Delgado Díaz-Pache

Date:

16 November 2010, 11:18:17

Not sure what you mean.

Once we start getting that error in the postgres log, there is no autovacuum entry. Only that error message once every minute (and as i wrote in my last email, the same message all the time).

The question is what is causing postgres to stop working. Autovacuum or another reason? If it's another reason ... what could it be?

thanks in advance for your help

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>

Excerpts from Pablo Delgado Díaz-Pache's message of lun nov 15 04:52:53 -0300 2010:

> > But it strikes me that the code comment is wrong in one significant way:
> > if the postmaster were failing to heed SIGUSR1 at all, you could reach
> > the timeout here, because the fork-failed signal wouldn't get sent.

Hmm, yeah, I guess I wasn't considering the idea that postmaster was so
busy with something else that it wouldn't be able to respond to the
launcher's requests. As you say, if it went away entirely, autovacuum
would also suicide.

> > Could you try strace'ing the postmaster process to see what it's doing
> > when this is happening?
>
> I definitely will.

Yes, please.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Alvaro Herrera

Date:

16 November 2010, 15:09:54

Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah.  This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process.  This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on.  I have never seen this problem.  If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Alvaro Herrera

Date:

18 November 2010, 10:41:13

Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010:
> Well, we had the problem again. This is what we did
>
> 1) A "ps" to check what processes were running. Many "postmaster" processes
> in defunct state. A few postgres connections still working. A few "psql" (by
> shell script) queries hanged (they are scripts to monitor state of database.
> We use big-sister to do so).
>
> The normal state of the server is 1 postmaster pid and many "postgres"
> connections in "idle" state.
>
> This the result when we were having the problem. I've excluded irrelevant
> processes.
>
> *postgres   892  3889  0 09:07 ?        00:01:05 postgres: postgres
> international 10.19.0.51(49264) idle*
> *postgres   934  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>*
> *postgres   935  3889  0 12:00 ?        00:00:04 [postmaster] <defunct>*

Hmm, these are processes that were forked from postmaster (PID 3889) and
died before they could become something useful -- but postmaster hasn't
reclaimed yet.

Can you see what process 3889 is doing?  strace it or get a stacktrace
with GDB.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Alvaro Herrera

Date:

18 November 2010, 10:49:36

Excerpts from Pablo Delgado Díaz-Pache's message of jue nov 18 08:57:16 -0300 2010:

> 2) We did a strace to the postmaster pid. However we had 2 postmasters not
> dead
>
> # ps -fea |grep -i postmaster
> postgres  3889     1  0 Nov16 ?        00:01:24 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
> postgres  7601  3889  0 12:37 ?        00:00:00 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
>
> As soon as we did a "strace" to the 3889 pid everything started to work
> again.

Sorry for my previous response -- evidently I failed to scroll down
enough to notice this part.

It seems to me that this process was stuck in a unnatural way.

> Not sure it was a coincidence but that was the way it was.
>
> *# strace -p 3889*
> *Process 3889 attached - interrupt to quit*
> *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*

This seems normal postmaster activity: receiving SIGUSR1, then SIGCHLD,
and doing stuff accordingly.

Rather than a coincidence, I would think that the act of tracing it made
it come back to life.  A kernel bug maybe?  Have you upgraded your
kernel or libc lately?

> I also straced the other postmaster pid
>
> *# strace -p 7601*
> *Process 7601 attached - interrupt to quit*
> *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 181*

This one seems like a regular postmaster child that hadn't gotten around
to changing its ps status yet.  (Note it had PPID 3889 which is
consistent with this idea.)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Pablo Delgado Díaz-Pache

Date:

23 October 2013, 17:23:06

Just noticed the mysql database (we just used it for wordpress) that we had in the same server was also "stucked" and had to restart it again. Checking the "netstat -an" I noticed the following ...

# netstat -an | grep 3306
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN
tcp 0 0 10.19.0.47:3306 10.19.0.46:33691 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33745 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33710 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33743 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33751 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33690 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33758 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33725 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33739 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33730 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33744 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33688 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33741 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33757 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33712 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33734 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33719 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33738 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33682 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33702 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33701 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33742 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33708 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33703 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33731 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33723 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33695 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33724 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33728 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33689 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33729 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33694 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33720 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33754 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33732 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33705 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33748 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33749 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33733 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33753 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33755 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33726 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33727 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33737 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33693 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33700 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33740 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33747 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33716 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33709 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33697 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33717 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33736 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33715 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33704 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33711 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33746 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33718 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33692 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33707 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33696 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33750 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33681 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33699 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33756 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33714 SYN_RECV
tcp 0 0 10.19.0.47:3306 10.19.0.46:33706 SYN_RECV
tcp 1 0 10.19.0.47:3306 10.19.0.46:49437 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49436 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49439 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49438 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49433 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49432 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49435 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49434 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49429 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49428 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49431 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49430 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49427 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49453 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49452 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49455 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49454 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49449 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49448 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49451 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49450 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49445 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49444 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49447 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49446 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49441 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49440 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49443 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49442 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49457 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:49456 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30397 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30396 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30399 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30398 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30393 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30392 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30395 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30394 TIME_WAIT
tcp 100 0 10.19.0.47:3306 10.19.0.46:30391 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30409 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30408 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30411 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30410 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30405 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30404 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30407 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30406 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30401 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30400 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30403 TIME_WAIT
tcp 1 0 10.19.0.47:3306 10.19.0.46:30402 TIME_WAIT

Didn't do the netstat -an for port 5432, but I guess all max_connections for postgres were busy. Just a guess ...

All connections are done from internal servers, and there are no firewall rules in that interface. Just in case here it goes the iptables -L -n -v

# iptables -L -n -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
219M 56G mirai all -- * * 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination

Chain OUTPUT (policy ACCEPT 170M packets, 392G bytes)
pkts bytes target prot opt in out source destination

Chain mirai (1 references)
pkts bytes target prot opt in out source destination
12M 6207M ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
6186 509K ACCEPT icmp -- * * 0.0.0.0/0 0.0.0.0/0 icmp type 255
205M 50G ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
30 1800 ACCEPT all -- eth0 * x.x.x.x 0.0.0.0/0
1023K 61M ACCEPT all -- eth1 * 10.0.0.0/8 0.0.0.0/0
6224 373K ACCEPT tcp -- eth0 * 0.0.0.0/0 0.0.0.0/0 tcp multiport dports 80,443
308K 183M REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable

2010/11/18 Pablo Delgado Díaz-Pache <delgadop@gmail.com>

Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>
postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>
postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>
postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>
postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>
postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>
postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>
postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>
postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>
root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]
root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]
postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>
postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger process
postgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer process
postgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer process
postgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher process
postgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector process
postgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>
postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>
postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>
postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>
postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>
postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>
postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>
postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>
postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>
postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>
postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>
postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>
postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>
postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>
postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>
postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>
postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>
postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000) = 0x14ac4000
lseek(3, 0, SEEK_END) = 16130048
lseek(5, 0, SEEK_END) = 4661248
lseek(3, 0, SEEK_END) = 16130048
brk(0x14ae9000) = 0x14ae9000
brk(0x14adb000) = 0x14adb000
brk(0x14ad7000) = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000) = 0x14ac4000
lseek(3, 0, SEEK_END) = 16130048
lseek(5, 0, SEEK_END) = 4661248
lseek(3, 0, SEEK_END) = 16130048
brk(0x14ae9000) = 0x14ae9000
brk(0x14adb000) = 0x14adb000
brk(0x14ad7000) = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8, <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>
Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:

> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah. This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on. I have never seen this problem. If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Autovacuum seems to block database: WARNING worker took too long to start

From

Pablo Delgado Díaz-Pache

Date:

23 October 2013, 17:23:08

Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes in defunct state. A few postgres connections still working. A few "psql" (by shell script) queries hanged (they are scripts to monitor state of database. We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres" connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant processes.

postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres international 10.19.0.51(49264) idle
postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>
postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>
postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>
postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>
postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>
postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>
postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>
postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>
postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>
root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]
root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]
postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres international 10.19.0.51(59066) idle
postgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres international 10.19.0.51(59070) idle
postgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres international 10.19.0.50(1249) idle
postgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>
postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger process
postgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer process
postgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer process
postgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum launcher process
postgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector process
postgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>
postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>
postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>
postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>
postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>
postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>
postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>
postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>
postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres international 10.19.0.49(29113) idle
postgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>
postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres international 10.19.0.49(29129) idle
postgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>
postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>
postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>
postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres international 10.19.0.49(45160) idle
postgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>
postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>
postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>
postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>
postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>
postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>
postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs 10.19.0.51(7717) idle
postgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>
root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
root 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv -U bigsister mrs
postgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain 10.19.0.45(57042) idle
postgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai 10.19.0.45(57043) idle
postgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas 10.19.0.45(57061) idle
postgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain 10.19.0.51(11291) idle
postgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>
postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>

2) We did a strace to the postmaster pid. However we had 2 postmasters not dead

# ps -fea |grep -i postmaster

postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work again. Not sure it was a coincidence but that was the way it was.

# strace -p 3889
Process 3889 attached - interrupt to quit
select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7766
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766
wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b4f6591d440) = 7767
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigreturn(0x2) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---

I also straced the other postmaster pid

# strace -p 7601
Process 7601 attached - interrupt to quit
recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL, NULL) = 181
brk(0x14ac4000) = 0x14ac4000
lseek(3, 0, SEEK_END) = 16130048
lseek(5, 0, SEEK_END) = 4661248
lseek(3, 0, SEEK_END) = 16130048
brk(0x14ae9000) = 0x14ae9000
brk(0x14adb000) = 0x14adb000
brk(0x14ad7000) = 0x14ad7000
sendto(7, "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"..., 408, 0, NULL, 0) = 408
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0, NULL, 0) = 74
recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL, NULL) = 175
brk(0x14ac4000) = 0x14ac4000
lseek(3, 0, SEEK_END) = 16130048
lseek(5, 0, SEEK_END) = 4661248
lseek(3, 0, SEEK_END) = 16130048
brk(0x14ae9000) = 0x14ae9000
brk(0x14adb000) = 0x14adb000
brk(0x14ad7000) = 0x14ad7000
sendto(8, "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0, NULL, 0) = 73
recvfrom(8, <unfinished ...>
Process 7601 detached

3) Checked the logs. We could see the same error lines we had before, but they dissapeared after just a few minutes.

2010-11-18 12:30:02.830 CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04 CET|67/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-18 12:30:02.841 CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28 CET|75/0|0|postgres| LOG: unexpected EOF on client connection
2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08 CET||0|| WARNING: worker took too long to start; cancelled
2010-11-18 12:37:30.462 CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30 CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre@commandprompt.com>

Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09 -0300 2010:
> Not sure what you mean.
>
> Once we start getting that error in the postgres log, there is no autovacuum
> entry. Only that error message once every minute (and as i wrote in my last
> email, the same message all the time).

Yeah. This means that the autovacuum launcher daemon is still running
and trying to schedule new jobs, but it's not able to for some reason.
When it tries to start a new one, it sees the old one still waiting to
start running; and then it logs this message.

To start a new job, what it does is send a signal to postmaster and wait
for it to create a child process. This is the part that's not working.

> The question is what is causing postgres to stop working. Autovacuum or
> another reason? If it's another reason ... what could it be?

I have no idea what's going on. I have never seen this problem. If you
see it happen again, please get a stacktrace from postmaster; and/or
strace it.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support