Re: Better way of dealing with pgstat wait timeout during buildfarm runs? - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Better way of dealing with pgstat wait timeout during buildfarm runs? |
Date | |
Msg-id | 4915.1419728105@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Better way of dealing with pgstat wait timeout during buildfarm runs? (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Better way of dealing with pgstat wait timeout during
buildfarm runs?
Re: Better way of dealing with pgstat wait timeout during buildfarm runs? |
List | pgsql-hackers |
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 12/27/2014 12:16 AM, Alvaro Herrera wrote: >> Tom Lane wrote: >>> The argument that autovac workers need fresher stats than anything else >>> seems pretty dubious to start with. Why shouldn't we simplify that down >>> to "they use PGSTAT_STAT_INTERVAL like everybody else"? >> The point of wanting fresher stats than that, eons ago, was to avoid a >> worker vacuuming a table that some other worker vacuumed more recently >> than PGSTAT_STAT_INTERVAL. ... >> Nowadays we can probably disregard the whole issue, since starting a new >> vacuum just after the prior one finished should not cause much stress to >> the system thanks to the visibility map. > Vacuuming is far from free, even if the visibility map says that most > pages are visible to all: you still scan all indexes, if you remove any > dead tuples at all. With typical autovacuum settings, I kinda doubt that there's much value in reducing the window for this problem from 500ms to 10ms. As Alvaro says, this was just a partial, kluge solution from the start --- if we're worried about such duplicate vacuuming, we should undertake a real solution that closes the window altogether. In any case, timeouts occurring inside autovacuum are not directly causing the buildfarm failures, since autovacuum's log entries don't reflect into regression outputs. (It's possible that autovacuum's tight tolerance is contributing to the failures by increasing the load on the stats collector, but I'm not sure I believe that.) To get back to that original complaint about buildfarm runs failing, I notice that essentially all of those failures are coming from "wait timeout" warnings reported by manual VACUUM commands. Now, VACUUM itself has no need to read the stats files. What's actually causing these messages is failure to get a timely response in pgstat_vacuum_stat(). So let me propose a drastic solution: let's dike out this bit in vacuum.c: /* * Send info about dead objects to the statistics collector, unless we are * in autovacuum --- autovacuum.c doesthis for itself. */ if ((vacstmt->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess()) pgstat_vacuum_stat(); This would have the effect of transferring all responsibility for dead-stats-entry cleanup to autovacuum. For ordinary users, I think that'd be just fine. It might be less fine though for people who disable autovacuum, if there still are any. To analyze the effects on them, let's break down what pgstat_vacuum_stat() actually does: 1. It causes drops of whole databases from pgstat's tables. I claim we don't need to do that at this level. There's no issue unless the stats collector missed the PgStat_MsgDropdb message when the database was dropped; and even if it did, the negative consequences of having a dead DB's stats still lying around are pretty minimal since we split up the stats files. There will be no extra I/O except for one small record in the global stats file. So I think we can well afford to say that with autovac off, such missed databases only get cleaned up the next time an autovac is forced for antiwraparound. 2. Within the current database, it causes drops of pgstat entries for dropped tables. This would be a tad annoying if you have lots of transient tables and no or minimal autovacuuming. However, lots of transient tables can be a pain point anyway, because we don't currently have any mechanism other than pgstat_vacuum_stat() for cleaning up per-table stats for dropped tables. It seems like it might be time to do something about that. We could probably extend the PgStat_TableXactStatus infrastructure to keep track of whether a table was created or deleted in the current transaction, and solve the problem without very much new code. Or maybe we could move the responsibility for reporting stale entries into the code that reads the stats files for the stats views. 3. Within the current database, it causes drops of pgstat entries for dropped functions, if you're tracking per-function execution stats. Since that's not the default, maybe we could get away with saying that we don't clean up such dead entries very often unless you're running autovacuum. I don't really want to propose building the infrastructure to support dropping such entries retail. Or, much more simply, we could conclude that it's not that important if pgstat_vacuum_stat() is unable to get up-to-date stats, and rejigger the code so that we don't bleat when the file-reading request comes from that source. This should be a back-patchable fix, whereas #2 above might be a bit too complicated for that. Thoughts? regards, tom lane
pgsql-hackers by date: