Re: Tuning planner cost estimates - Mailing list pgsql-performance
From | Jim C. Nasby |
---|---|
Subject | Re: Tuning planner cost estimates |
Date | |
Msg-id | 20050520224055.GM44623@decibel.org Whole thread Raw |
In response to | Re: Tuning planner cost estimates (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Tuning planner cost estimates
|
List | pgsql-performance |
On Fri, May 20, 2005 at 04:47:38PM -0400, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > > On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote: > >> can test our formula for accuracy and precision. However, such a formula > >> *does* need to take into account concurrent activity, updates, etc ... that > >> is, it needs to approximately estimate the relative cost on a live database, > >> not a test one. > > > Well, that raises an interesting issue, because AFAIK none of the cost > > estimate functions currently do that. > > I'm unconvinced that it'd be a good idea, either. People already > complain that the planner's choices change when they ANALYZE; if the > current load factor or something like that were to be taken into account > then you'd *really* have a problem with irreproducible behavior. > > It might make sense to have something a bit more static, perhaps a GUC > variable that says "plan on the assumption that there's X amount of > concurrent activity". I'm not sure what scale to measure X on, nor > exactly how this would factor into the estimates anyway --- but at least > this approach would maintain reproducibility of behavior. Or allowing the load of the machine to affect query plans dynamically is something that could be disabled by default, so presumably if you turn it on it means you know what you're doing. Of course this is all academic until we have a means to actually measure how much system load affects the different things we estimate cost for, and I don't see that happening until we have a system for measuring how changing different input variables affects costs. > > Another issue is: what state should the buffers/disk cache be in? > > The current cost models are all based on the assumption that every query > starts from ground zero: nothing in cache. Which is pretty bogus in > most real-world scenarios. We need to think about ways to tune that > assumption, too. Maybe this is actually the same discussion, because > certainly one of the main impacts of a concurrent environment is on what > you can expect to find in cache. Well, load doesn't directly effect cache efficiency; it's really a question of the ratios of how often different things in the database are accessed. If you wanted to get a crude idea of how likely pages from some relation are to be in cache, you could take periodic snapshots of io stats and see what percentage of the IO done in a given time period was on the relation you're interested in as compared to the rest of the database. But I think this is probably still a 2nd order effect. In terms of a testing system, here's what I'm thinking of. For each cost estimate, there will be a number of input variables we want to vary, and then check to see how changes in them effect run time. Using index scan as a simple example, 1st order variables will likely be index and table size (especially in relation to cache size), and correlation. So, we need some kind of a test harness that can vary these variables (prefferably one at a time), and run a test case after each change. It would then need to store the timing info, possibly along with other information (such as explain output). If I'm the one to write this it'll end up in perl, since that's the only language I know that would be able to accomplish this. DBT seems to be a reasonable test database to do this testing with, especially since it would provide a ready means to provide external load. Does this sound like a reasonable approach? Also, how important do people think it is to use explain analyze output instead of just doing SELECT count(*) FROM (query you actually want to test)? (The select count(*) wrapper is just a means to throw away the results since we don't really want to worry about data transfer times, etc). The testing I've done (http://stats.distributed.net/~decibel/base.log) shows explain analyze to be almost 5x slower than select count(*), so it seems a big gain if we can avoid that. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
pgsql-performance by date: