Thread: FATAL: could not read statistics message

FATAL: could not read statistics message

From

Sean Davis

Date:

16 May 2006, 08:35:44

I am using postgresql 8.1.0 on an Xserver running MacOS 10.3.9.  I am
getting the following in the log every minute for the past couple of days.
The database is otherwise running normally, as far as I can tell:

2006-05-16 07:26:01 EDT   FATAL:  could not read statistics message:
Resource temporarily unavailable
2006-05-16 07:27:01 EDT   FATAL:  could not read statistics message:
Resource temporarily unavailable
2006-05-16 07:28:03 EDT   FATAL:  could not read statistics message:
Resource temporarily unavailable

I saw a previous message in the archives, but it did not appear that any
conclusion was reached.  Tom suggested that an EAGAIN signal was being
received from the system, but I'm not sure what this means exactly or why it
is happening now, as we have had the server running for months.

Any insight?

Thanks,
Sean

Re: FATAL: could not read statistics message

From

"Tony Wasson"

Date:

16 May 2006, 16:08:22

On 5/16/06, Sean Davis <sdavis2@mail.nih.gov> wrote:
> I am using postgresql 8.1.0 on an Xserver running MacOS 10.3.9.  I am
> getting the following in the log every minute for the past couple of days.
> The database is otherwise running normally, as far as I can tell:
>
> 2006-05-16 07:26:01 EDT   FATAL:  could not read statistics message:
> Resource temporarily unavailable
> 2006-05-16 07:27:01 EDT   FATAL:  could not read statistics message:
> Resource temporarily unavailable
> 2006-05-16 07:28:03 EDT   FATAL:  could not read statistics message:
> Resource temporarily unavailable
>
> I saw a previous message in the archives, but it did not appear that any
> conclusion was reached.  Tom suggested that an EAGAIN signal was being
> received from the system, but I'm not sure what this means exactly or why it
> is happening now, as we have had the server running for months.
>
> Any insight?

I ran into this problem also on OS X running Postgresql 8.0. When you
start postgresql you usually see these 4 processes:

/usr/local/pgsql/bin/postmaster
postgres: writer process
postgres: stats buffer process
postgres: stats collector process

When I saw the same error as you, the stats collector process was
missing. A few times we also got messages like

[KERNEL]: no space in available paging segments; swapon suggested

and then a bunch of these:

postgres[13562]: [1-1] FATAL:  could not read statistics message:
Resource temporarily unavailable

We thought it was our memory tuning of OS X. Since it wasn't a
production box, we didn't pursue the problem further. What tuning have
you done to postgresql.conf and the OS X memory settings?

Re: FATAL: could not read statistics message

From

Sean Davis

Date:

16 May 2006, 16:41:20



On 5/16/06 3:08 PM, "Tony Wasson" <ajwasson@gmail.com> wrote:

> On 5/16/06, Sean Davis <sdavis2@mail.nih.gov> wrote:
>> I am using postgresql 8.1.0 on an Xserver running MacOS 10.3.9.  I am
>> getting the following in the log every minute for the past couple of days.
>> The database is otherwise running normally, as far as I can tell:
>>
>> 2006-05-16 07:26:01 EDT   FATAL:  could not read statistics message:
>> Resource temporarily unavailable
>> 2006-05-16 07:27:01 EDT   FATAL:  could not read statistics message:
>> Resource temporarily unavailable
>> 2006-05-16 07:28:03 EDT   FATAL:  could not read statistics message:
>> Resource temporarily unavailable
>>
>> I saw a previous message in the archives, but it did not appear that any
>> conclusion was reached.  Tom suggested that an EAGAIN signal was being
>> received from the system, but I'm not sure what this means exactly or why it
>> is happening now, as we have had the server running for months.
>>
>> Any insight?
>
> I ran into this problem also on OS X running Postgresql 8.0. When you
> start postgresql you usually see these 4 processes:
>
> /usr/local/pgsql/bin/postmaster
> postgres: writer process
> postgres: stats buffer process
> postgres: stats collector process
>
> When I saw the same error as you, the stats collector process was
> missing. A few times we also got messages like

Now that I look, I see the same thing.

> [KERNEL]: no space in available paging segments; swapon suggested

No such line in the logs

> and then a bunch of these:
>
> postgres[13562]: [1-1] FATAL:  could not read statistics message:
> Resource temporarily unavailable
>
> We thought it was our memory tuning of OS X. Since it wasn't a
> production box, we didn't pursue the problem further. What tuning have
> you done to postgresql.conf and the OS X memory settings?

I had cranked things up a bit from the standard install.

shared_buffers = 15000                  # min 16 or max_connections*2, 8KB
each
#temp_buffers = 1000                    # min 100, 8KB each
#max_prepared_transactions = 50         # can be 0 or more
work_mem = 10000                        # min 64, size in KB
maintenance_work_mem = 128000           # min 1024, size in KB
max_stack_depth = 4096                  # min 100, size in KB

Some of these may not be ideal, but it really improved performance for our
needs.

Suggested by the lack of a stats collector process or stats buffer process,
I restarted the server, and it appears to have fixed the issue for now.  At
least I know what to watch for now.

Thanks, Tony, for the reply.

Sean

Re: FATAL: could not read statistics message

From

"Jim C. Nasby"

Date:

16 May 2006, 17:56:31

On Tue, May 16, 2006 at 03:41:07PM -0400, Sean Davis wrote:
> I had cranked things up a bit from the standard install.
>
> shared_buffers = 15000                  # min 16 or max_connections*2, 8KB
> each
> #temp_buffers = 1000                    # min 100, 8KB each
> #max_prepared_transactions = 50         # can be 0 or more
> work_mem = 10000                        # min 64, size in KB
> maintenance_work_mem = 128000           # min 1024, size in KB
> max_stack_depth = 4096                  # min 100, size in KB

You didn't say anything about how much memory you have, but you need to
be careful with *work_mem, as being too agressive can run the machine
out of memory.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: FATAL: could not read statistics message

From

"Tony Wasson"

Date:

16 May 2006, 18:14:02

On 5/16/06, Sean Davis <sdavis2@mail.nih.gov> wrote:
> I had cranked things up a bit from the standard install.
>
> shared_buffers = 15000                  # min 16 or max_connections*2, 8KB
> each
> #temp_buffers = 1000                    # min 100, 8KB each
> #max_prepared_transactions = 50         # can be 0 or more
> work_mem = 10000                        # min 64, size in KB
> maintenance_work_mem = 128000           # min 1024, size in KB
> max_stack_depth = 4096                  # min 100, size in KB
>
> Some of these may not be ideal, but it really improved performance for our
> needs.
>
> Suggested by the lack of a stats collector process or stats buffer process,
> I restarted the server, and it appears to have fixed the issue for now.  At
> least I know what to watch for now.

We found the same thing - restarting PostgreSQL made the problem go
away. However, we were able to duplicate the problem. Tuning our
kernel memory options and postgresql.conf options down made the
problem occur less frequently. We had 8GB or RAM in the box and  were
running 10.3.9. Here were the kernel options we were using...

kern.sysv.shmmax=1073741824
kern.sysv.shmmin=1
kern.sysv.shmmni=64
kern.sysv.shmseg=16
kern.sysv.shmall=262144  #<-- num of pages (a page is 4096 bytes)
ceil(shmmax/4096)

and here's the lowest settings we tried in the postgresql.conf:

shared_buffers = 25000          # min 16, at least max_connections*2, 8KB each
work_mem = 2048                 # min 64, size in KB
maintenance_work_mem = 1024     # min 1024, size in KB

With the above settings, we could still "kill" the stats collector.

Re: FATAL: could not read statistics message

From

Tom Lane

Date:

16 May 2006, 20:07:07

"Tony Wasson" <ajwasson@gmail.com> writes:
> When I saw the same error as you, the stats collector process was
> missing.

The collector, or the buffer process?  The reported message would be
emitted by the buffer process, after which it would immediately exit.
(The collector would go away too once it noticed EOF on its input.)
By and by the postmaster should start a fresh pair of processes.

IIRC, the postmaster's spawning is rate-limited to once a minute,
so if the new buffer were immediately dying with the same error,
that would explain your observation of once-a-minute messages.

This all still leaves us no closer to understanding *why* the recv()
is failing, though.  What it does suggest is that the problem is a
hard, repeatable error when it does occur, which makes me loath to
put in the quick-fix "retry on EAGAIN" that I previously suggested.
If it is a hard error then that will just convert the problem into
a busy-loop that'll eat all your CPU cycles ... not much of an
improvement ...

            regards, tom lane

Re: FATAL: could not read statistics message

From

Sean Davis

Date:

16 May 2006, 20:25:02

Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 03:41:07PM -0400, Sean Davis wrote:
>
>>I had cranked things up a bit from the standard install.
>>
>>shared_buffers = 15000                  # min 16 or max_connections*2, 8KB
>>each
>>#temp_buffers = 1000                    # min 100, 8KB each
>>#max_prepared_transactions = 50         # can be 0 or more
>>work_mem = 10000                        # min 64, size in KB
>>maintenance_work_mem = 128000           # min 1024, size in KB
>>max_stack_depth = 4096                  # min 100, size in KB
>
>
> You didn't say anything about how much memory you have, but you need to
> be careful with *work_mem, as being too agressive can run the machine
> out of memory.

I have 4Gb of memory and the machine is pretty much a devoted database
server.  We use it mainly for data warehousing and mining; there are
rarely more than 2 active connections and never more than 5 total, so I
have felt pretty comfortable with leaving work_mem pretty generous.  I
will likely have to change that if the machine becomes more active.

Sean

Re: FATAL: could not read statistics message

From

"Tony Wasson"

Date:

16 May 2006, 21:22:40

On 5/16/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Tony Wasson" <ajwasson@gmail.com> writes:
> > When I saw the same error as you, the stats collector process was
> > missing.
>
> The collector, or the buffer process?  The reported message would be
> emitted by the buffer process, after which it would immediately exit.
> (The collector would go away too once it noticed EOF on its input.)
> By and by the postmaster should start a fresh pair of processes.

The stats collector was dead and would not respawn. Our options seemed
limited to restarting postmaster or ignoring the error.

Here was what the process list looked like:

kangaroo:~ twasson$ ps waux | grep post
pgsql      574   0.0 -0.0   460104    832  p0  S    Wed06AM  10:26.98
/usr/local/pgsql/bin/postmaster -D /Volumes/Vol0/pgsql-data
pgsql      578   0.0 -5.2   460356 108620  p0  S    Wed06AM  27:43.68
postgres: writer process
twasson  23844   0.0 -0.0    18172    688 std  S+   10:05AM   0:00.01
grep post

> IIRC, the postmaster's spawning is rate-limited to once a minute,
> so if the new buffer were immediately dying with the same error,
> that would explain your observation of once-a-minute messages.
>
> This all still leaves us no closer to understanding *why* the recv()
> is failing, though.  What it does suggest is that the problem is a
> hard, repeatable error when it does occur, which makes me loath to
> put in the quick-fix "retry on EAGAIN" that I previously suggested.
> If it is a hard error then that will just convert the problem into
> a busy-loop that'll eat all your CPU cycles ... not much of an
> improvement ...
>

Re: FATAL: could not read statistics message

From

Sean Davis

Date:

16 May 2006, 21:49:47

Tony Wasson wrote:
> On 5/16/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> "Tony Wasson" <ajwasson@gmail.com> writes:
>> > When I saw the same error as you, the stats collector process was
>> > missing.
>>
>> The collector, or the buffer process?  The reported message would be
>> emitted by the buffer process, after which it would immediately exit.
>> (The collector would go away too once it noticed EOF on its input.)
>> By and by the postmaster should start a fresh pair of processes.
>
>
> The stats collector was dead and would not respawn. Our options seemed
> limited to restarting postmaster or ignoring the error.
>
> Here was what the process list looked like:
>
> kangaroo:~ twasson$ ps waux | grep post
> pgsql      574   0.0 -0.0   460104    832  p0  S    Wed06AM  10:26.98
> /usr/local/pgsql/bin/postmaster -D /Volumes/Vol0/pgsql-data
> pgsql      578   0.0 -5.2   460356 108620  p0  S    Wed06AM  27:43.68
> postgres: writer process
> twasson  23844   0.0 -0.0    18172    688 std  S+   10:05AM   0:00.01
> grep post

That is what I recalled, also, though I wasn't meticulous enough to hang
onto the process list.


>> IIRC, the postmaster's spawning is rate-limited to once a minute,
>> so if the new buffer were immediately dying with the same error,
>> that would explain your observation of once-a-minute messages.
>>
>> This all still leaves us no closer to understanding *why* the recv()
>> is failing, though.  What it does suggest is that the problem is a
>> hard, repeatable error when it does occur, which makes me loath to
>> put in the quick-fix "retry on EAGAIN" that I previously suggested.
>> If it is a hard error then that will just convert the problem into
>> a busy-loop that'll eat all your CPU cycles ... not much of an
>> improvement ...
>>

Re: FATAL: could not read statistics message

From

"Jim C. Nasby"

Date:

17 May 2006, 18:56:45

On Tue, May 16, 2006 at 07:24:43PM -0400, Sean Davis wrote:
> Jim C. Nasby wrote:
> >On Tue, May 16, 2006 at 03:41:07PM -0400, Sean Davis wrote:
> >
> >>I had cranked things up a bit from the standard install.
> >>
> >>shared_buffers = 15000                  # min 16 or max_connections*2, 8KB
> >>each
> >>#temp_buffers = 1000                    # min 100, 8KB each
> >>#max_prepared_transactions = 50         # can be 0 or more
> >>work_mem = 10000                        # min 64, size in KB
> >>maintenance_work_mem = 128000           # min 1024, size in KB
> >>max_stack_depth = 4096                  # min 100, size in KB
> >
> >
> >You didn't say anything about how much memory you have, but you need to
> >be careful with *work_mem, as being too agressive can run the machine
> >out of memory.
>
> I have 4Gb of memory and the machine is pretty much a devoted database
> server.  We use it mainly for data warehousing and mining; there are
> rarely more than 2 active connections and never more than 5 total, so I
> have felt pretty comfortable with leaving work_mem pretty generous.  I
> will likely have to change that if the machine becomes more active.

Well, 10M of work_mem isn't exactly generous in that case, since you'd
probably end up using no more than 100M at once. Of course if you had
100 connections and only 1G, it'd be a different story...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461