Thread: [DOCS] pg_stat_statements max size clarification
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/9.6/static/pgstatstatements.html Description: From https://www.postgresql.org/docs/current/static/pgstatstatements.html pg_stat_statements.max (integer) pg_stat_statements.max is the maximum number of statements tracked by the module (i.e., the maximum number of rows in the pg_stat_statements view). If more distinct statements than that are observed, information about the least-executed statements is discarded. The default value is 5000. This parameter can only be set at server start. I wonder if "least-executed" is correct. I'm not an expert and haven't convinced myself of this by examining the code, but I think after N distinct queryid's have been seen, then any additional ones are ignored. But that may not be "least-executed" at all. It's "most-recent" instead. I think we need a new phrase here.
On Sat, Jul 15, 2017 at 5:31 PM, <baron@xaprb.com> wrote: > I wonder if "least-executed" is correct. I'm not an expert and haven't > convinced myself of this by examining the code, but I think after N distinct > queryid's have been seen, then any additional ones are ignored. But that may > not be "least-executed" at all. It's "most-recent" instead. I think we need > a new phrase here. It's most executed since tracking for the entry began, with a special heuristic for queries that take a long time to execute, and might therefore consistently be evicted before execution finishes and costs are tallied (see "sticky entries" stuff for full details). Most executed means the total number of calls, which may not be the best thing to evict on the basis of, but certainly isn't too bad. The way it actually works is that either 5% of all entries or 10 entries are evicted (whichever amount is greatest) once pg_stat_statements.max entries are reached. You're right that this means that the most marginal of entries cannot be usefully tracked, but I doubt that that's much of a problem in practice. It's the usual "recency versus frequency" cache eviction problem, but for query cost tracking purposes if 5,000 entries or 10,000 entries is truly insufficient, then pg_stat_statements probably isn't the right tool. When that many entries seem insufficient that tends to be because pg_stat_statements is arguably too discriminating about what constitutes a distinct query/entry, but that's another problem. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sat, Jul 15, 2017 at 5:31 PM, <baron@xaprb.com> wrote: >> I wonder if "least-executed" is correct. I'm not an expert and haven't >> convinced myself of this by examining the code, but I think after N distinct >> queryid's have been seen, then any additional ones are ignored. But that may >> not be "least-executed" at all. It's "most-recent" instead. I think we need >> a new phrase here. > It's most executed since tracking for the entry began, with a special > heuristic for queries that take a long time to execute, and might > therefore consistently be evicted before execution finishes and costs > are tallied (see "sticky entries" stuff for full details). Most > executed means the total number of calls, which may not be the best > thing to evict on the basis of, but certainly isn't too bad. > The way it actually works is that either 5% of all entries or 10 > entries are evicted (whichever amount is greatest) once > pg_stat_statements.max entries are reached. You're right that this > means that the most marginal of entries cannot be usefully tracked, > but I doubt that that's much of a problem in practice. It's the usual > "recency versus frequency" cache eviction problem, but for query cost > tracking purposes if 5,000 entries or 10,000 entries is truly > insufficient, then pg_stat_statements probably isn't the right tool. The short answer, really, is that the algorithm is too complicated to be worth explaining in the documentation --- and it's subject to change, anyway. But "least-executed" is a reasonable short description, since frequency of use is a major factor in the decisions. Certainly "most-recent" is flat out wrong. I am not sure whether this complaint is actually meant as a bug report that the algorithm didn't seem to work well on the OP's use case. If so, we'd need a lot more details to have any hope of improving it (and the documentation comments aren't the right submission forum, either). regards, tom lane
I will mull over this and see if I can suggest a short phrase that points to the nuances without explaining them. I think the current phrasing draws attention to itself by seeming dubious, which creates a kind of cognitive dissonance for the reader. This report isn't mean to be a bug report on the functionality.
On Sat, Jul 15, 2017 at 10:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sat, Jul 15, 2017 at 5:31 PM, <baron@xaprb.com> wrote:
>> I wonder if "least-executed" is correct. I'm not an expert and haven't
>> convinced myself of this by examining the code, but I think after N distinct
>> queryid's have been seen, then any additional ones are ignored. But that may
>> not be "least-executed" at all. It's "most-recent" instead. I think we need
>> a new phrase here.
> It's most executed since tracking for the entry began, with a special
> heuristic for queries that take a long time to execute, and might
> therefore consistently be evicted before execution finishes and costs
> are tallied (see "sticky entries" stuff for full details). Most
> executed means the total number of calls, which may not be the best
> thing to evict on the basis of, but certainly isn't too bad.
> The way it actually works is that either 5% of all entries or 10
> entries are evicted (whichever amount is greatest) once
> pg_stat_statements.max entries are reached. You're right that this
> means that the most marginal of entries cannot be usefully tracked,
> but I doubt that that's much of a problem in practice. It's the usual
> "recency versus frequency" cache eviction problem, but for query cost
> tracking purposes if 5,000 entries or 10,000 entries is truly
> insufficient, then pg_stat_statements probably isn't the right tool.
The short answer, really, is that the algorithm is too complicated to be
worth explaining in the documentation --- and it's subject to change,
anyway. But "least-executed" is a reasonable short description, since
frequency of use is a major factor in the decisions. Certainly
"most-recent" is flat out wrong.
I am not sure whether this complaint is actually meant as a bug report
that the algorithm didn't seem to work well on the OP's use case. If so,
we'd need a lot more details to have any hope of improving it (and the
documentation comments aren't the right submission forum, either).
regards, tom lane