Re: [PATCHES] Better default_statistics_target - Mailing list pgsql-hackers
From | Christopher Browne |
---|---|
Subject | Re: [PATCHES] Better default_statistics_target |
Date | |
Msg-id | d6d6637f0801301813n64fa58eu76385cf8a621907@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCHES] Better default_statistics_target (Decibel! <decibel@decibel.org>) |
Responses |
Re: [PATCHES] Better default_statistics_target
Re: [PATCHES] Better default_statistics_target |
List | pgsql-hackers |
On Jan 30, 2008 5:58 PM, Decibel! <decibel@decibel.org> wrote: > > On Mon, Jan 28, 2008 at 11:14:05PM +0000, Christopher Browne wrote: > > On Dec 6, 2007 6:28 PM, Decibel! <decibel@decibel.org> wrote: > > > FWIW, I've never seen anything but a performance increase or no change > > > when going from 10 to 100. In most cases there's a noticeable > > > improvement since it's common to have over 100k rows in a table, and > > > there's just no way to capture any kind of a real picture of that with > > > only 10 buckets. > > > > I'd be more inclined to try to do something that was at least somewhat > > data aware. > > > > The "interesting theory" that I'd like to verify if I had a chance > > would be to run through a by-column tuning using a set of heuristics. > > My "first order approximation" would be: > > > > - If a column defines a unique key, then we know there will be no > > clustering of values, so no need to increase the count... > > > > - If a column contains a datestamp, then the distribution of values is > > likely to be temporal, so no need to increase the count... > > > > - If a column has a highly constricted set of values (e.g. - boolean), > > then we might *decrease* the count. > > > > - We might run a query that runs across the table, looking at > > frequencies of values, and if it finds a lot of repeated values, we'd > > increase the count. > > > > That's a bit "hand-wavy," but that could lead to both increases and > > decreases in the histogram sizes. Given that, we can expect the > > overall stat sizes to not forcibly need to grow *enormously*, because > > we can hope for there to be cases of shrinkage. > > I think that before doing any of that you'd be much better off > investigating how much performance penalty there is for maxing out > default_statistict_target. If, as I suspect, it's essentially 0 on > modern hardware, then I don't think it's worth any more effort. > > BTW, that investigation wouldn't just be academic either; if we could > convince ourselves that there normally wasn't any cost associated with a > high default_statistics_target, we could increase the default, which > would reduce the amount of traffic we'd see on -performance about bad > query plans. There seems to be *plenty* of evidence out there that the performance penalty would NOT be "essentially zero." Tom points out: eqjoinsel(), for one, is O(N^2) in the number of MCV values kept. It seems to me that there are cases where we can *REDUCE* the histogram width, and if we do that, and then pick and choose the columns where the width increases, the performance penalty may be "yea, verily *actually* 0." This fits somewhat with Simon Riggs' discussion earlier in the month about Segment Exclusion; these both represent cases where it is quite likely that there is emergent data in our tables that can help us to better optimize our queries. -- http://linuxfinances.info/info/linuxdistributions.html "The definition of insanity is doing the same thing over and over and expecting different results." -- assortedly attributed to Albert Einstein, Benjamin Franklin, Rita Mae Brown, and Rudyard Kipling
pgsql-hackers by date: