Re: Querying distinct values from a large table - Mailing list pgsql-performance

From Gregory Stark
Subject Re: Querying distinct values from a large table
Date
Msg-id 87d54wmpvd.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Querying distinct values from a large table  ("Luke Lonergan" <llonergan@greenplum.com>)
Responses Re: Querying distinct values from a large table
List pgsql-performance
"Luke Lonergan" <llonergan@greenplum.com> writes:

> Chad,
>
> On 1/30/07 6:13 AM, "Chad Wagner" <chad.wagner@gmail.com> wrote:
>
> > Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash,
> > I guess), there is no need to push 3M rows through a sort algorithm to only
> > shave it down to 1848 unique records.
> >
> > I am assuming this optimization just isn't implemented in PostgreSQL?
>
> Not that it helps Igor, but we've implemented single pass sort/unique,
> grouping and limit optimizations and it speeds things up to a single seqscan
> over the data, from 2-5 times faster than a typical external sort.

Fwiw I also implemented this last September when I worked on the LIMIT/SORT
case. It's blocked on the same issue that that patch is: how do we communicate
the info that only unique records are needed down the plan to the sort node?

Note that it's not just a boolean flag that we can include like the random
access flag: The Unique node could in theory have a different key than the
sort node.

Coming at this fresh now a few months later I'm thinking instead of trying to
put this intelligence into the individual nodes, an optimizer pass could run
through the tree looking for Sort nodes that can have this bit tweaked.

That doesn't quite help the LIMIT/SORT case because the limit isn't calculated
until run-time but perhaps it's possible to finesse that issue by providing
either the Limit or Sort node with pointers to other nodes.

(Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
your data distribution. It's not hard to come up with distributions where it's
1000x as fast and others where there's no speed difference.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

pgsql-performance by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: Querying distinct values from a large table
Next
From: Bruno Wolff III
Date:
Subject: Re: Querying distinct values from a large table