Thread: BUG #18359: autovacuum stopped vacuuming user tables, focusing on just pg_auth_members repeatedly

BUG #18359: autovacuum stopped vacuuming user tables, focusing on just pg_auth_members repeatedly

From
PG Bug reporting form
Date:
22 February 2024, 14:43:18
The following bug has been logged on the website:

Bug reference:      18359
Logged by:          Dmitry Astapov
Email address:      dastapov@gmail.com
PostgreSQL version: 13.6
Operating system:   Rocky Linux
Description:

Sorry, I don't have a way to reproduce the problem at will.

I am also reporting this for version 13.6, but I checked
src/backend/access/heap/vacuumlazy.c from 13.14, and it is unchanged between
those versions, so I think my bugreport will apply to 13.14 as well.

Symptoms:

- Database users complain that "queries that took seconds now take hours".
Investigation quickly leads to the conclusion that all the queries are
mis-planned due to the stale statistics.

- select max(last_autovacuum), max(last_autoanalyze) from
pg_stat_user_tables shows that last autovacuum/autonalyze was more than two
days ago, and no autovacuum workers are running, according to
pg_stat_activity / pg_stat_progress_vacuum - normally, we can see at least a
couple at any point in time.

Log_autovacuum_min_duration was set to default -1, so we don't have logs
from the beginning of the incident. However, once we determined that
autovacuums are not running, we set it to 0, at which point we saw this
message repeated in the logs approximately every 10 seconds:

2024-02-20 08:08:49.993 EST,,,758746,,65d4a461.b93da,3,,2024-02-20 08:08:49
EST,331/9765639,0,LOG,00000,"automatic vacuum to prevent wraparound of table
""postgres.pg_catalog.pg_auth_members"": index scans: 0
pages: 0 removed, 100 remain, 0 skipped due to pins, 62 skipped frozen
tuples: 0 removed, 18359 remain, 0 are dead but not yet removable, oldest
xmin: 1429906812
buffer usage: 11 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
WAL usage: 0 records, 0 full page images, 0 bytes",,,,,,,,,"","autovacuum
worker"

We have about 15 minutes worth of these log lines, but all the numbers
(pages, tuples, scans, skipped, frozen, ...) in the message were exactly the
same every time (as verified by "sort -u"), and there were no other
"autovacuum worker" messages in the log.

It was imperative to restore the normal operation of the database, and on a
whim, we tried "vacuum full pg_auth_members", which executed immediately.
Right after that we saw autovacuum workers scheduled and running as usual,
and soon after everything returned back to normal.

This old thread

https://www.postgresql.org/message-id/flat/CAE39h23X%3D09_PBvyvC3yy1MBVu6pkLuDj9KpaQUzoSmi_2%2BiKw%40mail.gmail.com#900740c5db62e5be2afa51da23a6a926
seems very similar to what we have experienced.

I've checked through the changelogs of versions 14, 15, 16, and nothing
jumps out at me as a fix for this specific problem.