Re: cache estimates, cache access cost - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: cache estimates, cache access cost |
Date | |
Msg-id | BANLkTinKk5z659bd1-YX=r0iehB6iSJNYQ@mail.gmail.com Whole thread Raw |
In response to | Re: cache estimates, cache access cost (Cédric Villemain <cedric.villemain.debian@gmail.com>) |
Responses |
Re: cache estimates, cache access cost
|
List | pgsql-hackers |
On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain <cedric.villemain.debian@gmail.com> wrote: > 2011/5/19 Robert Haas <robertmhaas@gmail.com>: >> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain >> <cedric.villemain.debian@gmail.com> wrote: >>> The point is to get ratio in cache, not the distribution of the data >>> in cache (pgfincore also allows you to see this information). >>> I don't see how a stable (a server in production) system can have its >>> ratio moving up and down so fast without known pattern. >> >> Really? It doesn't seem that hard to me. For example, your nightly >> reports might use a different set of tables than are active during the >> day.... > > yes, this is known pattern, I believe we can work with it. I guess the case where I agree that this would be relatively static is on something like a busy OLTP system. If different users access different portions of the main tables, which parts of each relation are hot might move around, but overall the percentage of that relation in cache probably won't move around a ton, except perhaps just after running a one-off reporting query, or when the system is first starting up. But that's not everybody's workload. Imagine a system that is relatively lightly used. Every once in a while someone comes along and runs a big reporting query. Well, the contents of the buffer caches are might vary considerably depending on *which* big reporting queries ran most recently. Also, even if we knew what was going to be in cache at the start of the query, the execution of the query might change things greatly as it runs. For example, imagine a join between some table and itself. If we estimate that none of the data is i cache, we will almost certainly be wrong, because it's likely both sides of the join are going to access some of the same pages. Exactly how many depends on the details of the join condition and whether we choose to implement it by merging, sorting, or hashing. But it's likely going to be more than zero. This problem can also arise in other contexts - for example, if a query accesses a bunch of large tables, the tables that are accessed later in the computation might be less cached than the ones accessed earlier in the computation, because the earlier accesses pushed parts of the tables accessed later out of cache. Or, if a query requires a large sort, and the value of work_mem is very high (say 1GB), the sort might evict data from cache. Now maybe none of this matters a bit in practice, but it's something to think about. There was an interesting report on a problem along these lines from Kevin Grittner a while back. He found he needed to set seq_page_cost and random_page_cost differently for the database user that ran the nightly reports, precisely because the degree of caching was very different than it was for the daily activity, and he got bad plans otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: