Home > mailing lists

Re: Query not using index, please explain. - Mailing list pgsql-hackers

From	Richard Poole
Subject	Re: Query not using index, please explain.
Date	March 8, 2001 15:38:55
Msg-id	20010308203833.C8034@office.vi.net Whole thread Raw
In response to	Re: Query not using index, please explain. (Matthew Hagerty <mhagerty@voyager.net>)
Responses	Re: Query not using index, please explain.
List	pgsql-hackers

Tree view

On Thu, Mar 08, 2001 at 02:43:54PM -0500, Matthew Hagerty wrote:
> Richard,
> 
> Thanks for the response, I guess I should have included a little more 
> information.  The table contains 3.5 million rows.  The indexes were 
> created after the data was imported into the table and I had just run 
> vacuum and vacuum analyze on the database before trying the queries and 
> sending this question to hackers.
> 
> When I turned the seqscan variable off and ran the query with the 
> '04-01-2000' date the results were literally instantaneous.  Turn the 
> seqscan back on and it takes right around 3 minutes.  Also, the query for 
> any date older than the '04-01-2000' returns zero rows.  The actual number 
> of rows for the '04-01-2000' select is right around 8300.

This is where you need an expert. :) But I'll have a go and someone
will correct me if I'm wrong...

The statistics which are kept aren't fine-grained enough to be right
here. All the optimiser knows are the highest and lowest values of
the attribute, the most common value (not really useful here), the
number of nulls in the column, and the "dispersion" (a sort of
handwavy measure of how bunched-together the values are). So in a
case like this, where effectively the values are all different over
a certain range, all it can do is (more or less) linearly interpolate
in the range to guess how many tuples are going to be returned. Which
means it's liable to be completely wrong if your values aren't
evenly distributed over their whole range, which it seems they aren't.
It thinks you're going to hit around 1/28 of the tuples in this table,
presumably because '04/01/2000' is about 1/28 of the way from your
minimum value to your maximum.

This sort of thing will all become much better one fine day when
we have much better statistics available, and so many of us want
such things that that fine day will surely come. Until then, I think
you're best off turning off seqscans from your client code when
you know they'll be wrong. (That's what we do here in several similar
cases).

Can someone who really knows this stuff (Tom?) step in if what I've
just said is completely wrong?

> select domain from history_entries group by domain;
> 
> To me, since there is an index on domain, it seems like this should be a 
> rather fast thing to do?  It takes a *very* long time, no matter if I turn 
> seqscan on or off.

The reason this is slow is that Postgres always has to look at heap
tuples, even when it's been sent there by indexes. This in turn is
because of the way the storage manager works (only by looking in the
heap can you tell for sure whether a tuple is valid for the current
transaction). So a "group by" always has to look at every heap tuple
(that hasn't been eliminated by a where clause). "select distinct"
has the same problem. I don't think there's a way to do what you
want here with your existing schema without a sequential scan over
the table.

Richard

pgsql-hackers by date:

From: Tom Lane
Date: 08 March 2001, 15:19:53
Subject: Re: Query not using index, please explain.

From: Tom Lane
Date: 08 March 2001, 16:06:28
Subject: Use SIGQUIT instead of SIGUSR1?

Re: Query not using index, please explain. - Mailing list pgsql-hackers

Previous

Next