Home > mailing lists

Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) - Mailing list pgsql-bugs

From	Tomas Vondra
Subject	Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)
Date	November 26, 2014 20:45:22
Msg-id	54763BD8.8020906@fuzzy.cz Whole thread Raw
In response to	Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) (Maxim Boguk <maxim.boguk@gmail.com>)
Responses	Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)
List	pgsql-bugs

Tree view

On 26.11.2014 21:07, Maxim Boguk wrote:
>
>             On one of my servers stat collector went crasy on idle system.
>             It constantly write around 50MB/s on idle database (no
>             activity at all).
>
>
>     OK there are results of some entry-level gdb kung-fu:
>
>     problem point is:
>                 /*
>                  * Write the stats file if a new request has arrived
>     that is not
>                  * satisfied by existing file.
>                  */
>                 if (last_statwrite < last_statrequest)
>                     pgstat_write_statsfile(false);
>
>     on some loop iteration difference between last_statrequest and
>     last_statwrite is around 23000-29000 (e.g. 20-30ms).
>
>     Hm may be there some clock drift between CPU cores?
>
>
> What's made this situation interesting that the fact the code never got
> inside
>         /*
>          * If there is clock skew between backends and the collector, we
> could
>          * receive a stats request time that's in the future.  If so,
> complain
>          * and reset last_statrequest.  Resetting ensures that no inquiry
>          * message can cause more than one stats file write to occur.
>          */
>         if (last_statrequest > last_statwrite)
> ...
> block.

That suggests there's no (significant) clock skew between postmaster and
the process requesting the file.

> Watchpoint on the (last_statwrite < last_statrequest) condition
> repetable fire ONLY inside:
> pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
> {
>     if (msg->inquiry_time > last_statrequest)
>         last_statrequest = msg->inquiry_time;
> }
> function.

OK, that's expected, because the request is sent only if the file is
older than X miliseconds (10ms for autovacuum, 640ms otherwise). So it
would be strange if the condition didn't fire here.

> Now just some musings/theory (I might be dead wrong of course).
> Lets suppose there are no clock skew actually and check execution flow.
>
> Main loop executed:
>             if (last_statwrite < last_statrequest)
>                 pgstat_write_statsfile(false);
> and updated last_statwrite value.
> However, it took some time because of file size.

OK. FWIW, it's updated to GetCurrentTimestamp() before starting to write
the file.

> During that period (while pgstat busy with writes) some backend issued
> PGSTAT_MTYPE_INQUIRY request which will be succesfuly accepted on the
> next code lines at
>             len = recv(pgStatSock, (char *) &msg,...
> And as a result the last_statrequest will be updated to the
> msg->inquiry_time value.
> And this value will be definitely > than last_statwrite (because this
> request issued after pgstat worker stared writing new copy of the stat
> file).

How many backends are there, requesting a fresh stats file? I thought
you mentioned the system is rather idle?

A regular backend won't request a fresh copy unless the current file is
older than 640 ms. So that can't cause ~50MB of writes (it'd require 10
writes per second). Maybe there are multiple backends sending such
requests, but I don't see how that could cause a continuous write load
(I'd understand a short spike at the beginning, and then getting
synchronized).

Autovacuum launcher / workers might cause that, because it's using 10 ms
threshold. But that'd require a lot of autovacuum processes running in
parallel, I guess.

I wonder if this could be caused by a long "queue" of such inquiries,
but that should be filtered out by the (last_statwrite <
last_statrequest) condition.

regards
Tomas

pgsql-bugs by date:

From: Jeff Janes
Date: 26 November 2014, 20:30:07
Subject: Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)

From: Maxim Boguk
Date: 26 November 2014, 21:20:20
Subject: Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)

Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) - Mailing list pgsql-bugs

Previous

Next