Re: Query optimizer 8.0.1 (and 8.0) - Mailing list pgsql-hackers
From | Ron Mayer |
---|---|
Subject | Re: Query optimizer 8.0.1 (and 8.0) |
Date | |
Msg-id | 4210E61A.7090503@cheapcomplexdevices.com Whole thread Raw |
In response to | Re: Query optimizer 8.0.1 (and 8.0) (pgsql@mohawksoft.com) |
Responses |
Re: Query optimizer 8.0.1 (and 8.0)
|
List | pgsql-hackers |
pgsql@mohawksoft.com wrote: > > You know, I don't think a lot of people "get" the issues I was describing, > or maybe they don't believe it, I don't know, but, I think that it would > be a useful contrib project to create an 'analyze_special('table', > 'column', 'method')' function that does a better job at calculating the > stats for table that contain multiple trend waveforms. A separate function > will probably work well as the trends within the data probably only apply > to specific rows. I've done something similar, but simpler for the Census/TIGER data. If you loaded each TIGER file sequentially, like I did, the data was all grouped by county when it was loaded - so basically all the geographical columns (zip, county, state, census-tract) are actually grouped tightly on disk -- though ANALYZE can't see this because they're not strictly ascending or descending. Since I merely observed the geospatial columns were all clustered pretty well, I merely set the correlation value to the same pretty large value for all the geometric rows with a bunch of statements like this: update pg_statistic set stanumbers3[1] = 0.8 where starelid = 31412043 and staattnum=3; Instead of a complicated analyze function, how about just letting the user "tell" the optimizer that a column is clustered well with a function like: force_correlation_stat('schema', 'table', 'column', 'value') would actually work well for your data. Since you know your distinct values lay on a relatively small number of pages if you merely did: force_correlation('tiger','rt1','zipl',0.8); force_correlation('tiger','rt1','statel',0.8);force_correlation('tiger','rt1','countyl',0.8); the optimizer would then see that not many disk pages would need to be hit for a single zip code. > It's interesting, because I don't think it needs to calculate a perfect > representation of the data so much as better clue to its nature for the > optimizer. Indeed. Using the very arbitrary number "0.8" for the correlation, for all the geographic-related columns in the tiger data, the optimizer guessed a good plan almost every time on my company's 200GB geographical database. > When I get the time (or can get someone to pay me to do it) I'm going to > try it. I still suspect that the correct way to do it would not be to use the single "correlation", but 2 stats - one for estimating how sequential/random accesses would be; and one for estimating the number of pages that would be hit. I think the existing correlation does well for the first estimate; but for many data sets, poorly for the second type. If you want to start a contrib project that looks into additional stats that may help, I might be interested. Ron
pgsql-hackers by date: