Hi,
On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote:
> On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote:
> > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:
> >> The attached patch works by storing the maximum of the XID age and the MXID
> >> age in the list with the OIDs and sorting it prior to processing.
> >
> > I think it may be worth trying to avoid reliably using the same order -
> > otherwise e.g. a corrupt index on the first scheduled table can cause
> > autovacuum to reliably fail on the same relation, never allowing it to
> > progress past that point.
>
> Hm. What if we kept a short array of "failed" tables in shared memory?
I've thought about having that as part of pgstats...
> Each worker would consult this table before processing. If the table is
> there, it would remove it from the shared table and skip processing it.
> Then the next worker would try processing the table again.
>
> I also wonder how hard it would be to gracefully catch the error and let
> the worker continue with the rest of its list...
The main set of cases I've seen are when workers get hung up permanently in
corrupt indexes. There never is actually an error, the autovacuums just get
terminated as part of whatever independent reason there is to restart. The
problem with that is that you'll never actually have vacuum fail...
Greetings,
Andres Freund