Home > mailing lists

Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Collect frequency statistics for arrays
Date	March 7, 2012 20:52:00
Msg-id	25046.1331167902@sss.pgh.pa.us Whole thread Raw
In response to	Re: Collect frequency statistics for arrays (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: Collect frequency statistics for arrays Re: Collect frequency statistics for arrays
List	pgsql-hackers

Tree view

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Mar 5, 2012 at 1:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Couldn't we reduce the histogram size when there aren't many
>> different counts?
>> 
>> It seems fairly obvious to me that we could bound the histogram
>> size with (max count - min count + 1), but maybe something even
>> tighter would work; or maybe I'm missing something and this would
>> sacrifice accuracy.

> True. If (max count - min count + 1) is small, enumerating of frequencies
> is both more compact and more precise representation. Simultaneously,
> if (max count - min count + 1) is large, we can run out of
> statistics_target with such representation. We can use same representation
> of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> would require additional statkind and statistics slot. Probably, you've
> better ideas?

I wasn't thinking of introducing two different representations,
but just trimming the histogram length when it's larger than necessary.

On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.

More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation.  What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries.  I'm not sure how this could be mechanized but it seems
worth thinking about.
        regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 07 March 2012, 19:38:54
Subject: Re: pgsql_fdw, FDW for PostgreSQL server

From: "David E. Wheeler"
Date: 07 March 2012, 21:17:27
Subject: Custom Operators Cannot be Found for Composite Type Values

Re: Collect frequency statistics for arrays - Mailing list pgsql-hackers

Previous

Next