autovacuum stress-testing our system - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | autovacuum stress-testing our system |
Date | |
Msg-id | 1718942738eb65c8407fcd864883f4c8@fuzzy.cz Whole thread Raw |
Responses |
Re: autovacuum stress-testing our system
Re: autovacuum stress-testing our system Re: autovacuum stress-testing our system PATCH: Split stats file per database WAS: autovacuum stress-testing our system |
List | pgsql-hackers |
Hi, I've been struggling with autovacuum generating a lot of I/O and CPU on some of our systems - after a night spent analyzing this behavior, I believe the current autovacuum accidentally behaves a bit like a stress-test in some corner cases (but I may be seriously wrong, after all it was a long night). First - our system really is not a "common" one - we do have ~1000 of databases of various size, each containing up to several thousands of tables (several user-defined tables, the rest serve as caches for a reporting application - yes, it's a bit weird design but that's life). This all leads to pgstat.stat significantly larger than 60 MB. Now, the two main pieces of information from the pgstat.c are the timer definitions ---------------------------------- pgstat.c : 80 ---------------------------------- #define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats file * updates; in milliseconds. */ #define PGSTAT_RETRY_DELAY 10 /* How long to wait between checks for * a new file; in milliseconds. */ #define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a stats * file update; in milliseconds. */ #define PGSTAT_INQ_INTERVAL 640 /* How often to ping the collector for * a new file; in milliseconds. */ #define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to restart a * failed statistics collector; in * seconds. */ #define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY) #define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY) ----------------------------------------------------------------------------------- and then this loop (the current HEAD does this a bit differently, but the 9.2 code is a bit readable and suffers the same issue): ---------------------------------- pgstat.c : 3560 -------------------------------- /* * Loop until fresh enough stats file is available or we ran out of time. * The stats inquiry message is sent repeatedly in case collector drops * it; but not every single time, as that just swamps the collector. */for (count = 0; count < PGSTAT_POLL_LOOP_COUNT;count++){ TimestampTz file_ts = 0; CHECK_FOR_INTERRUPTS(); if (pgstat_read_statsfile_timestamp(false, &file_ts) && file_ts >= min_ts) break; /* Not there or too old, so kick the collector and wait a bit */ if ((count % PGSTAT_INQ_LOOP_COUNT) == 0) pgstat_send_inquiry(min_ts); pg_usleep(PGSTAT_RETRY_DELAY * 1000L);} if (count >= PGSTAT_POLL_LOOP_COUNT) elog(WARNING, "pgstat wait timeout"); /* Autovacuum launcher wants stats about all databases */if (IsAutoVacuumLauncherProcess()) pgStatDBHash = pgstat_read_statsfile(InvalidOid,false);else pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false); ----------------------------------------------------------------------------------- What this code does it that it checks the statfile, and if it's not stale (the timestamp of the write start is not older than PGSTAT_RETRY_DELAY milliseconds), the loop is terminated and the file is read. Now, let's suppose the write takes >10 ms, which is the PGSTAT_RETRY_DELAY values. With our current pgstat.stat filesize/num of relations, this is quite common. Actually the common write time in our case is ~100 ms, even if we move the file into tmpfs. That means that almost all the calls to backend_read_statsfile (which happen in all pgstat_fetch_stat_*entry calls) result in continuous stream of inquiries from the autovacuum workers, writing/reading of the file. We're not getting 'pgstat wait timeout' though, because it finally gets written before PGSTAT_MAX_WAIT_TIME. By moving the file to a tmpfs we've minimized the I/O impact, but now the collector and autovacuum launcher consume ~75% of CPU (i.e. ~ one core) and do nothing except burning power because the database is almost read-only. Not a good thing in the "green computing" era I guess. First, I'm interested in feedback - did I get all the details right, or am I missing something important? Next, I'm thinking about ways to solve this: 1) turning of autovacuum, doing regular VACUUM ANALYZE from cron - certainly an option, but it's rather a workaround than a solution and I'm not very fond of it. Moreover it fixes only one side of the problem - triggering the statfile writes over and over. The file will be written anyway, although not that frequently. 2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and so on to consider several seconds to be fresh enough - Would be nice to have this as a GUC variables, although we can do another private patch on our own. But more knobs is not always better. 3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the time it takes to write the file (e.g. 10x the write time or something). 4) keeping some sort of "dirty flag" in stat entries - and then writing only info about objects were modified enough to be eligible for vacuum/analyze (e.g. increasing number of index scans can't trigger autovacuum while inserting rows can). Also, I'm not worried about getting a bit older num of index scans, so 'clean' records might be written less frequently than 'dirty' ones. 5) splitting the single stat file into multiple pieces - e.g. per database, written separately, so that the autovacuum workers don't need to read all the data even for databases that don't need to be vacuumed. This might be combined with (4). Ideas? Objections? Preferred options? I kinda like (4+5), although that'd be a pretty big patch and I'm not entirely sure it can be done without breaking other things. regards Tomas
pgsql-hackers by date: