Re: autovacuum starvation - Mailing list pgsql-hackers
From | Jim Nasby |
---|---|
Subject | Re: autovacuum starvation |
Date | |
Msg-id | 8ABA01BB-D248-4135-8F7E-346A102B0E50@decibel.org Whole thread Raw |
In response to | autovacuum starvation (Alvaro Herrera <alvherre@commandprompt.com>) |
Responses |
Re: autovacuum starvation
|
List | pgsql-hackers |
On May 2, 2007, at 5:39 PM, Alvaro Herrera wrote: > The recently discovered autovacuum bug made me notice something > that is > possibly critical. The current autovacuum code makes an effort not to > leave workers in a "starting" state for too long, lest there be > failure > to timely tend all databases needing vacuum. > > This is how the launching of workers works: > 1) the launcher puts a pointer to a WorkerInfo entry in shared memory, > called "the starting worker" pointer > 2) the launcher sends a signal to the postmaster > 3) the postmaster forks a worker > 4) the new worker checks the starting worker pointer > 5) the new worker resets the starting worker pointer > 6) the new worker connects to the given database and vacuums it > > The problem is this: I originally added some code in the autovacuum > launcher to check that a worker does not take "too long" to start. > This > is autovacuum_naptime seconds. If this happens, the launcher > resets the > starting worker pointer, which means that the newly starting worker > will > not see anything that needs to be done and exit quickly. > > The problem with this is that on a high load machine, for example > lionfish during buildfarm runs, this would cause autovacuum starvation > for the period in which the high load is sustained. This could prove > dangerous. > > The problem is that things like fork() failure cannot be communicated > back to the launcher. So when the postmaster tries to start a process > and it fails for some reason (failure to fork, or out of memory) we > need > a way to re-initiate the worker that failed. > > The current code resets the starting worker pointer, and leave the > slot > free for another worker, maybe in another database, to start. > > I recently added code to resend the postmaster signal when the > launcher > sees the starting worker pointer not invalid -- step 2 above. I think > this is fine, but > > 1) we should remove the logic to remove the starting worker > pointer. It > is not needed, because database-local failures will be handled by > subsequent checks > > 2) we should leave the logic to resend the postmaster, but we should > make an effort to avoid sending it too frequently > > Opinions? > > If I haven't stated the problem clearly please let me know and I'll > try > to rephrase. Isn't there some way to get the postmaster to signal the launcher? Perhaps stick an error code in shared memory and send it a signal? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
pgsql-hackers by date: