Re: WIP: cross column correlation ... - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: WIP: cross column correlation ... |
Date | |
Msg-id | AANLkTikN7hidgDjdVKWR=u2Pxy5m-Hmoo_m-pj6baLnO@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: cross column correlation ... (Cédric Villemain <cedric.villemain.debian@gmail.com>) |
Responses |
Re: WIP: cross column correlation ...
|
List | pgsql-hackers |
2011/2/25 Cédric Villemain <cedric.villemain.debian@gmail.com>: >> All that having been said, I think that while Josh is thinking fuzzily >> about the mathematics of his proposal, the basic idea is pretty >> sensible. It is not easy - likely not possible - for the system to >> have a good idea which things will be in some kind of cache at the >> time the query is executed; it could even change mid-query. The >> execution of one part of the query could evict from the cache data >> which some other part of the plan assumed would be cached. But DBAs >> frequently have a very good idea of which stuff is in cache - they can >> make observations over a period of time and then adjust settings and >> then observe some more and adjust some more. > > I believe we can maintain a small map of area of a relation which are > in the OS buffer cache (shared buffers move more), or at least a > percentage of the relation in OS cache. Getting autovacuum daemon > being able to update those maps/counters might be enought and easy to > do, it is really near what auto-analyze do. My observation is that > the percentage in cache is stable on a production workload after some > tens of minutes needed to warm the server. I don't think we can assume that will be true in all workloads. Imagine a server doing batch processing. People submit large batches of work that take, say, an hour to complete. Not all batches use the same set of tables - maybe they even run in different databases. After a big batch process finishes crunching numbers in database A, very little of database B will be cached. But it's not necessarily right to assume that when we start queries for a new batch in database B, although it's more likely to be right for large tables (which will take a long time to get cached meaningfully, if they ever do) than small ones. Also, it could lead to strange issues where batches run much faster or slower depending on which batch immediately proceeded them. If we're going to do something a lot of times, it'd be better to bite the bullet and read it all in rather than going to more work elsewhere, but if we're only going to touch it once, then not so much. You might also have this issue on systems that run OLTP workloads all day and then do some batch processing at night to get ready for the next business day. Kevin Grittner wrote previously about those jobs needing some different settings in his environment (I'm not remembering which settings at the moment). Suppose that the batch process is going to issue a query that can be planned in one of two possible ways. One way involves reading 10% of a relation, and the other way involves reading the whole thing. The first plan takes 200 s to execute if the relation is not cached, and 180 s if the relevant portion is cached. The second plan takes 300 s to execute if the relation is not cached, and 100 s if it is cached. At the start of the batch run, the relation won't be cached, because it's used *only* by the overnight job and not by the daily OLTP traffic. Which way should we execute the query? The answer is that if the batch job only needs to execute that query *once*, we should do it the first way. But if it needs to execute it three or more times, the second way is better, but only if we use the second plan every time. If we start out with the first plan, we're always better off sticking with it *unless* we know that we're going to repeat the query at least twice more after the iteration we're currently planning. To make the right decision, the query planner needs a crystal ball. Or, a little help from the DBA. > What should really help here is to have hooks in the cost functions to > test those ideas without the need to patch postgresql-core a lot. Will > it be ok to have hooks or will it add to much CPU consumption in a > sensible part of the code ? Depends on where you put them, I guess. Hooks are pretty cheap, but they're also pretty hard to use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: