Home > mailing lists

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

From	John A Meinel
Subject	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date	May 4, 2005 10:30:07
Msg-id	42781B1D.7070101@arbash-meinel.com Whole thread Raw
In response to	Re: [PERFORM] Bad n_distinct estimation; hacks suggested? (Josh Berkus <josh@agliodbs.com>)
List	pgsql-hackers

Tree view

Josh Berkus wrote:
> Mischa,
>
>
>>Okay, although given the track record of page-based sampling for
>>n-distinct, it's a bit like looking for your keys under the streetlight,
>>rather than in the alley where you dropped them :-)
>
>
> Bad analogy, but funny.
>
> The issue with page-based vs. pure random sampling is that to do, for example,
> 10% of rows purely randomly would actually mean loading 50% of pages.  With
> 20% of rows, you might as well scan the whole table.
>
> Unless, of course, we use indexes for sampling, which seems like a *really
> good* idea to me ....
>

But doesn't an index only sample one column at a time, whereas with
page-based sampling, you can sample all of the columns at once. And not
all columns would have indexes, though it could be assumed that if a
column doesn't have an index, then it doesn't matter as much for
calculations such as n_distinct.

But if you had 5 indexed rows in your table, then doing it index wise
means you would have to make 5 passes instead of just one.

Though I agree that page-based sampling is important for performance
reasons.

John
=:->

Attachment

signature.asc

pgsql-hackers by date:

From: Christopher Browne
Date: 04 May 2005, 10:28:53
Subject: Re: inclusions WAS: Increased company involvement

From: Christopher Browne
Date: 04 May 2005, 10:32:11
Subject: Re: inclusions WAS: Increased company involvement

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? - Mailing list pgsql-hackers

Attachment

Previous

Next