Home > mailing lists

Re: autovacuum starvation - Mailing list pgsql-hackers

From	Jim Nasby
Subject	Re: autovacuum starvation
Date	May 6, 2007 02:28:13
Msg-id	8ABA01BB-D248-4135-8F7E-346A102B0E50@decibel.org Whole thread Raw
In response to	autovacuum starvation (Alvaro Herrera <alvherre@commandprompt.com>)
Responses	Re: autovacuum starvation
List	pgsql-hackers

Tree view

On May 2, 2007, at 5:39 PM, Alvaro Herrera wrote:
> The recently discovered autovacuum bug made me notice something  
> that is
> possibly critical.  The current autovacuum code makes an effort not to
> leave workers in a "starting" state for too long, lest there be  
> failure
> to timely tend all databases needing vacuum.
>
> This is how the launching of workers works:
> 1) the launcher puts a pointer to a WorkerInfo entry in shared memory,
>    called "the starting worker" pointer
> 2) the launcher sends a signal to the postmaster
> 3) the postmaster forks a worker
> 4) the new worker checks the starting worker pointer
> 5) the new worker resets the starting worker pointer
> 6) the new worker connects to the given database and vacuums it
>
> The problem is this: I originally added some code in the autovacuum
> launcher to check that a worker does not take "too long" to start.   
> This
> is autovacuum_naptime seconds.  If this happens, the launcher  
> resets the
> starting worker pointer, which means that the newly starting worker  
> will
> not see anything that needs to be done and exit quickly.
>
> The problem with this is that on a high load machine, for example
> lionfish during buildfarm runs, this would cause autovacuum starvation
> for the period in which the high load is sustained.  This could prove
> dangerous.
>
> The problem is that things like fork() failure cannot be communicated
> back to the launcher.  So when the postmaster tries to start a process
> and it fails for some reason (failure to fork, or out of memory) we  
> need
> a way to re-initiate the worker that failed.
>
> The current code resets the starting worker pointer, and leave the  
> slot
> free for another worker, maybe in another database, to start.
>
> I recently added code to resend the postmaster signal when the  
> launcher
> sees the starting worker pointer not invalid -- step 2 above.  I think
> this is fine, but
>
> 1) we should remove the logic to remove the starting worker  
> pointer.  It
> is not needed, because database-local failures will be handled by
> subsequent checks
>
> 2) we should leave the logic to resend the postmaster, but we should
> make an effort to avoid sending it too frequently
>
> Opinions?
>
> If I haven't stated the problem clearly please let me know and I'll  
> try
> to rephrase.

Isn't there some way to get the postmaster to signal the launcher?  
Perhaps stick an error code in shared memory and send it a signal?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

pgsql-hackers by date:

From: Jim Nasby
Date: 06 May 2007, 02:28:09
Subject: Re: temporal variants of generate_series()

From: "Nathan Buchanan"
Date: 06 May 2007, 03:07:20
Subject: Re: storage of sensor data with Fourier transforms

Re: autovacuum starvation - Mailing list pgsql-hackers

Previous

Next