Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) |
Date | |
Msg-id | 54763BD8.8020906@fuzzy.cz Whole thread Raw |
In response to | Re: BUG #12071: Stat collector went crasy (50MB/s constant writes) (Maxim Boguk <maxim.boguk@gmail.com>) |
Responses |
Re: BUG #12071: Stat collector went crasy (50MB/s constant writes)
|
List | pgsql-bugs |
On 26.11.2014 21:07, Maxim Boguk wrote: > > On one of my servers stat collector went crasy on idle system. > It constantly write around 50MB/s on idle database (no > activity at all). > > > OK there are results of some entry-level gdb kung-fu: > > problem point is: > /* > * Write the stats file if a new request has arrived > that is not > * satisfied by existing file. > */ > if (last_statwrite < last_statrequest) > pgstat_write_statsfile(false); > > on some loop iteration difference between last_statrequest and > last_statwrite is around 23000-29000 (e.g. 20-30ms). > > Hm may be there some clock drift between CPU cores? > > > What's made this situation interesting that the fact the code never got > inside > /* > * If there is clock skew between backends and the collector, we > could > * receive a stats request time that's in the future. If so, > complain > * and reset last_statrequest. Resetting ensures that no inquiry > * message can cause more than one stats file write to occur. > */ > if (last_statrequest > last_statwrite) > ... > block. That suggests there's no (significant) clock skew between postmaster and the process requesting the file. > Watchpoint on the (last_statwrite < last_statrequest) condition > repetable fire ONLY inside: > pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len) > { > if (msg->inquiry_time > last_statrequest) > last_statrequest = msg->inquiry_time; > } > function. OK, that's expected, because the request is sent only if the file is older than X miliseconds (10ms for autovacuum, 640ms otherwise). So it would be strange if the condition didn't fire here. > Now just some musings/theory (I might be dead wrong of course). > Lets suppose there are no clock skew actually and check execution flow. > > Main loop executed: > if (last_statwrite < last_statrequest) > pgstat_write_statsfile(false); > and updated last_statwrite value. > However, it took some time because of file size. OK. FWIW, it's updated to GetCurrentTimestamp() before starting to write the file. > During that period (while pgstat busy with writes) some backend issued > PGSTAT_MTYPE_INQUIRY request which will be succesfuly accepted on the > next code lines at > len = recv(pgStatSock, (char *) &msg,... > And as a result the last_statrequest will be updated to the > msg->inquiry_time value. > And this value will be definitely > than last_statwrite (because this > request issued after pgstat worker stared writing new copy of the stat > file). How many backends are there, requesting a fresh stats file? I thought you mentioned the system is rather idle? A regular backend won't request a fresh copy unless the current file is older than 640 ms. So that can't cause ~50MB of writes (it'd require 10 writes per second). Maybe there are multiple backends sending such requests, but I don't see how that could cause a continuous write load (I'd understand a short spike at the beginning, and then getting synchronized). Autovacuum launcher / workers might cause that, because it's using 10 ms threshold. But that'd require a lot of autovacuum processes running in parallel, I guess. I wonder if this could be caused by a long "queue" of such inquiries, but that should be filtered out by the (last_statwrite < last_statrequest) condition. regards Tomas
pgsql-bugs by date: