Re: Tuning planner cost estimates - Mailing list pgsql-performance
From | Jim C. Nasby |
---|---|
Subject | Re: Tuning planner cost estimates |
Date | |
Msg-id | 20050520202017.GK44623@decibel.org Whole thread Raw |
In response to | Re: Tuning planner cost estimates (Josh Berkus <josh@agliodbs.com>) |
Responses |
Re: Tuning planner cost estimates
Re: Tuning planner cost estimates |
List | pgsql-performance |
On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote: > > In the > > case of testing index scans, we need to be able to vary correlation, > > which so far I've been doing by ordering by different columns. I suspect > > it will also be important to test with different tuple sizes. There's > > also the question of whether or not the cache should be flushed for each > > run or not. > > > > Does this sound like a good way to determine actual costs for index > > scans (and hopefully other access methods in the future)? If so, what > > would be a good way to implement this? > > Well, the problem is that what we need to index scans is a formula, rather > than a graph. The usefulness of benchmarking index scan cost is so that we True, but having a graphical representation of how different input variables (such as correlation) affect runtime is a good way to derive such a formula, or at least point you in the right direction. > can test our formula for accuracy and precision. However, such a formula > *does* need to take into account concurrent activity, updates, etc ... that > is, it needs to approximately estimate the relative cost on a live database, > not a test one. Well, that raises an interesting issue, because AFAIK none of the cost estimate functions currently do that. Heck, AFAIK even the piggyback seqscan code doesn't take other seqscans into account. Another issue is: what state should the buffers/disk cache be in? In the thread that kicked all this off Tom noted that my results were skewed because of caching, so I changed my tests to flush the disk cache as effectively as I could (by running a program that would consume enough available memory to just start the box swapping), but I don't think that's necessarily realistic. Though at least it should preclude the need to run tests multiple times on an otherwise idle box in order to 'pre-seed' the cache (not that that's any more realistic). If you don't use one of these techniques you end up with results that depend on what test was run before the current one... > This is also going to be a moving target because Tom's in-memory-bitmapping > changes relative cost equations. I thought those all had seperate costing functions...? In any case, if we have a cost estimation tool it will make it much easier to derive cost estimation functions. > I think a first step would be, in fact, to develop a tool that allows us to > put EXPLAIN ANALYZE results in a database table. Without that, there is no > possibility of statistical-scale analysis. Rather than trying to parse all possible output, ISTM it would be much better if there was a way to access the info directly. Would it be difficult to have an option that produces output that is a set of different fields? I'm thinking something like: Level (basically how far something's indented) Parent node (what node a child node is feeding) node_id (some kind of identifier for each step) operation (estimate|actual)_(startup|total|rows|width|loops) other (something to hold index condition, filter, etc) But ultimately, I'm not sure if this is really required or not, because I don't see that we need to use explain when running queries. In fact, it's possibly desireable that we don't, because of the overhead it incurs. We would want to log an explain (maybe analyze) just to make sure we knew what the optimizer was doing, but I think we shouldn't need the info to produce cost estimates. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
pgsql-performance by date: