Thread: SELECT DISTINCT

SELECT DISTINCT

From
salah jubeh
Date:
Hello Guys,

During my work, I have seen a common practice of using DISTINCT . Some will argue that developer should know the effect of using it, but keep in mind not all developers are gurus in RDBMs. Normally, developers work in a narrow domain. Using DISTINCT might lead to a huge performance  degradation because of sort and filter or hashaggregate operations. I think also the rules in determining if the distinct is requiered or not  is moderate in complexity.

Example: Please see how much extra cost we have for 119 record

EXPLAIN ANALYZE SELECT  DISTINCT * FROM pg_aggregate;

"HashAggregate  (cost=3.98..5.17 rows=119 width=28) (actual time=0.525..0.743 rows=119 loops=1)"
"  ->  Seq Scan on pg_aggregate  (cost=0.00..2.19 rows=119 width=28) (actual time=0.011..0.195 rows=119 loops=1)"
"Total runtime: 1.008 ms"

I think any query that returns a unique column (primary key, unique) which is not duplicated in some way (join) can use this optimisation technique.

EXAMPLE:

TABLE A (a1 (uinque), a2, ... , an)

SELECT DISTINCT a1, subset of (a2...an)  FROM A;   -- will return always a distinct  result.

When it comes to joins and nested queries , I do not have clear idea how this  can be implemented. But I could do some search.

Regards

Re: SELECT DISTINCT

From
"Kevin Grittner"
Date:
salah jubeh wrote:

> During my work, I have seen a common practice of using DISTINCT.
> Some will argue that developer should know the effect of using
> it, but keep in mind not all developers are gurus in RDBMs.

"SELECT DISTINCT eliminates duplicate rows from the result."
Personally, I would not want to keep a programmer who could not
grasp that concept.

> Using DISTINCT might lead to a huge performance  degradation
> because of sort and filter or hashaggregate operations.

More than that, I have often seen it added when a JOIN was
inadequately constrained and the programmer saw duplicates in the
output and added DISTINCT in response. The problem you have beyond
performance in such cases is that it is usually not showing correct
results; and worse, they are wrong but *plausible*. I would not
want to encourage that kind of sloppy thinking.

> I think any query that returns a unique column (primary key,
> unique) which is not duplicated in some way (join) can use this
> optimisation technique.

I agree that if the planner searched for that, there would be cases
where the DISTINCT keyword could be determined to be a noise word.
The problem with that is that such searching in the planner would
not be free -- doing it accurately would increase planning cost for
every query which was legitimately using the feature. The community
is generally loath to add runtime costs to properly written queries
to try to minimize the penalty paid by those who specify features
they don't need.

-Kevin


Re: SELECT DISTINCT

From
Tom Lane
Date:
"Kevin Grittner" <kgrittn@mail.com> writes:
> salah jubeh wrote:
>> I think any query that returns a unique column (primary key,
>> unique) which is not duplicated in some way (join) can use this
>> optimisation technique.

> I agree that if the planner searched for that, there would be cases
> where the DISTINCT keyword could be determined to be a noise word.
> The problem with that is that such searching in the planner would
> not be free -- doing it accurately would increase planning cost for
> every query which was legitimately using the feature.

There is actually infrastructure in the planner that could be used for
this, at least for the case where the query selects from just one base
table.  I tend to agree though that adding such a check to every
DISTINCT query would be slowing everybody down to benefit only dubious
queries.  It would be easier to swallow the overhead if there were a
solid case that reasonably-written queries might sometimes need the
optimization.

            regards, tom lane


Re: SELECT DISTINCT

From
salah jubeh
Date:
Thanks for the quick response, I would like to add a patch to postgres to do a simple check for  one table,  since this is straight forward. Unfortunatly, I am not familiar too much with postgres source code. So my question is , what are the source files that I need to check and can you please point me to the appropriate developer guide.
 
Regards



From: Tom Lane <tgl@sss.pgh.pa.us>
To: Kevin Grittner <kgrittn@mail.com>
Cc: salah jubeh <s_jubeh@yahoo.com>; pgsql <pgsql-general@postgresql.org>
Sent: Friday, January 18, 2013 2:02 AM
Subject: Re: [GENERAL] SELECT DISTINCT

"Kevin Grittner" <kgrittn@mail.com> writes:
> salah jubeh wrote:
>> I think any query that returns a unique column (primary key,
>> unique) which is not duplicated in some way (join) can use this
>> optimisation technique.

> I agree that if the planner searched for that, there would be cases
> where the DISTINCT keyword could be determined to be a noise word.
> The problem with that is that such searching in the planner would
> not be free -- doing it accurately would increase planning cost for
> every query which was legitimately using the feature.

There is actually infrastructure in the planner that could be used for
this, at least for the case where the query selects from just one base
table.  I tend to agree though that adding such a check to every
DISTINCT query would be slowing everybody down to benefit only dubious
queries.  It would be easier to swallow the overhead if there were a
solid case that reasonably-written queries might sometimes need the
optimization.

            regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general