Thread: [HACKERS] Removing useless DISTINCT clauses

[HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

06 November 2017, 12:16:43

In [1] we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.

This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.

All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.

Over in [2], Rui Liu mentions that the planner could do a better job
for his case.

Using Rui Liu's example:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,10000000), 'test';

Master:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
   Output: col, k
   ->  Unique  (cost=1658556.19..1733557.50 rows=10000175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
         Output: col, k
         ->  Sort  (cost=1658556.19..1683556.63 rows=10000175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
               Output: col, k
               Sort Key: test_tbl.k, test_tbl.col
               Sort Method: external merge  Disk: 215128kB
               ->  Seq Scan on public.test_tbl  (cost=0.00..154056.75
rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000
loops=1)
                     Output: col, k
 Planning time: 0.092 ms
 Execution time: 8958.687 ms
(12 rows)

Patched:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;

 QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
   Output: col, k
   ->  Unique  (cost=0.44..338745.50 rows=10000175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
         Output: col, k
         ->  Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
               Output: col, k
 Planning time: 0.152 ms
 Execution time: 0.985 ms
(8 rows)

A patch to implement this is attached.

I'll add it to the Jan commitfest. (I don't expect anyone to look at
this before then).


[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d4c3a156cb46dcd1f9f97a8011bd94c544079bb5

[2]
https://www.postgresql.org/message-id/flat/CAKJS1f9q0j3BgMUsDbtf9%3DecfVLnqvkYB44MXj0gpVuamcN8Xw%40mail.gmail.com#CAKJS1f9q0j3BgMUsDbtf9=ecfVLnqvkYB44MXj0gpVuamcN8Xw@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

remove_useless_distinct_clauses.patch

Re: [HACKERS] Removing useless DISTINCT clauses

From

Jeff Janes

Date:

06 January 2018, 00:34:40

On Mon, Nov 6, 2017 at 1:16 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:

In [1] we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.

This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.

All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.

Over in [2], Rui Liu mentions that the planner could do a better job
for his case.

Using Rui Liu's example:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,10000000), 'test';

Master:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
Output: col, k
-> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
Output: col, k
-> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
Output: col, k
Sort Key: test_tbl.k, test_tbl.col
Sort Method: external merge Disk: 215128kB
-> Seq Scan on public.test_tbl (cost=0.00..154056.75
rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000
loops=1)
Output: col, k
Planning time: 0.092 ms
Execution time: 8958.687 ms
(12 rows)

Patched:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
Output: col, k
-> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
Output: col, k
-> Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
Output: col, k
Planning time: 0.152 ms
Execution time: 0.985 ms
(8 rows)

A patch to implement this is attached.

Couldn't the Unique node be removed entirely? If k is a primary key, you can't have duplicates in need of removal.

Or would that be a subject for a different patch?

I think remove_functionally_dependant_groupclauses should have a more generic name, like remove_functionally_dependant_clauses.

Cheers,

Jeff

Re: [HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

06 January 2018, 13:08:10

Hi Jeff,

Thanks for looking at the patch.

On 6 January 2018 at 10:34, Jeff Janes <jeff.janes@gmail.com> wrote:
> Couldn't the Unique node be removed entirely?  If k is a primary key, you
> can't have duplicates in need of removal.

It probably could be, if there were no joins, but any join could
duplicate those rows, so we'd only be able to do that if there were no
joins, and I'm not sure we'd want to special case that. It seems just
too naive to be taken seriously. We could do a lot better...

> Or would that be a subject for a different patch?

Yeah, I'd rather do that fix properly, and do have ideas on how to,
just it's a lot more work (basically testing if the joins can
duplicate rows or not (unique joins++)).

> I think remove_functionally_dependant_groupclauses should have a more
> generic name, like remove_functionally_dependant_clauses.

It's been a while since I looked at this. I remember thinking
something similar at the time but I must have not changed it.

I'll look again and post an updated patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

06 January 2018, 14:33:45

On 6 January 2018 at 23:08, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think remove_functionally_dependant_groupclauses should have a more
>> generic name, like remove_functionally_dependant_clauses.
>
> It's been a while since I looked at this. I remember thinking
> something similar at the time but I must have not changed it.
>
> I'll look again and post an updated patch.

I've attached a patch that renames the function as you've suggested.

Thanks again for the review.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

remove_useless_distinct_clauses_v2.patch

Re: [HACKERS] Removing useless DISTINCT clauses

From

Tom Lane

Date:

09 January 2018, 23:26:45

David Rowley <david.rowley@2ndquadrant.com> writes:
> [ remove_useless_distinct_clauses_v2.patch ]

This is a cute idea, but I'm troubled by a couple of points:

1. Once you don't have all the tlist items shown in DISTINCT, it really is
more like DISTINCT ON, seems like.  I am not sure it's a good idea to set
hasDistinctOn, because that engages some planner behaviors we probably
don't want, but I'm also not sure we can get away with just ignoring the
difference.  As an example, in allpaths.c there are assorted assumptions
that having a distinctClause but !hasDistinctOn means all output columns
of a subquery are listed in the distinctClause.

2. There's a comment in planner.c to the effect that

         * When we have DISTINCT ON, we must sort by the more rigorous of
         * DISTINCT and ORDER BY, else it won't have the desired behavior.
         * Also, if we do have to do an explicit sort, we might as well use
         * the more rigorous ordering to avoid a second sort later.  (Note
         * that the parser will have ensured that one clause is a prefix of
         * the other.)

Removing random elements of the distinctClause will break its
correspondence with the sortClause, with probably bad results.

I do not remember for sure at the moment, but it may be that this
correspondence is only important for the case of DISTINCT ON, in which
case we could dodge the problem by not applying the optimization unless
it's plain DISTINCT.  That doesn't help us with point 1 though.

BTW, my dictionary says it's "dependent" not "dependant".

            regards, tom lane

Re: [HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

10 January 2018, 01:12:17

Hi Tom,

Thanks for looking at this.

On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is a cute idea, but I'm troubled by a couple of points:
>
> 1. Once you don't have all the tlist items shown in DISTINCT, it really is
> more like DISTINCT ON, seems like.  I am not sure it's a good idea to set
> hasDistinctOn, because that engages some planner behaviors we probably
> don't want, but I'm also not sure we can get away with just ignoring the
> difference.  As an example, in allpaths.c there are assorted assumptions
> that having a distinctClause but !hasDistinctOn means all output columns
> of a subquery are listed in the distinctClause.

hmm, I see:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I think that particular one would just fail to remove columns from the
subquery if there's a distinct clause (not distinct on). That seems
more like a missed optimisation rather than a future bug. Although it
probably would be a good idea to get rid of the assumption that a
non-NIL distinctClause && !hasDistinctOn means that all output target
items are part of that distinct clause.

I don't see anything else in allpaths.c that would be affected.
Keeping in mind we'll never completely remove a distinctClause, just
possibly remove some items from the list. Everything else just seems
to be checking distinctClause != NIL, or is only doing something if
hasDistinctOn == true.

I'll do some more analysis on places that distinctClause is being used
to check what's safe.

> 2. There's a comment in planner.c to the effect that
>
>          * When we have DISTINCT ON, we must sort by the more rigorous of
>          * DISTINCT and ORDER BY, else it won't have the desired behavior.
>          * Also, if we do have to do an explicit sort, we might as well use
>          * the more rigorous ordering to avoid a second sort later.  (Note
>          * that the parser will have ensured that one clause is a prefix of
>          * the other.)
>
> Removing random elements of the distinctClause will break its
> correspondence with the sortClause, with probably bad results.
>
> I do not remember for sure at the moment, but it may be that this
> correspondence is only important for the case of DISTINCT ON, in which
> case we could dodge the problem by not applying the optimization unless
> it's plain DISTINCT.  That doesn't help us with point 1 though.

Yeah, it seems better to only try and apply this for plain DISTINCT.

> BTW, my dictionary says it's "dependent" not "dependant".

Oops. Thanks.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Removing useless DISTINCT clauses

From

Andres Freund

Date:

01 March 2018, 23:51:34

Hi,

On 2018-01-10 11:12:17 +1300, David Rowley wrote:
> I'll do some more analysis on places that distinctClause is being used
> to check what's safe.

This patch has been waiting on author since 2018-01-09, the next & last
CF has started. I'm inclined to mark this as returned with feedback.

Greetings,

Andres Freund

Re: [HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

01 March 2018, 23:53:07

On 2 March 2018 at 09:51, Andres Freund <andres@anarazel.de> wrote:
> This patch has been waiting on author since 2018-01-09, the next & last
> CF has started. I'm inclined to mark this as returned with feedback.

I'm planning on making the required changes at the weekend.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Removing useless DISTINCT clauses

From

David Rowley

Date:

04 March 2018, 17:02:22

On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 1. Once you don't have all the tlist items shown in DISTINCT, it really is
> more like DISTINCT ON, seems like.  I am not sure it's a good idea to set
> hasDistinctOn, because that engages some planner behaviors we probably
> don't want, but I'm also not sure we can get away with just ignoring the
> difference.  As an example, in allpaths.c there are assorted assumptions
> that having a distinctClause but !hasDistinctOn means all output columns
> of a subquery are listed in the distinctClause.

Looking through allpaths.c for usages of distinctClauses it seems
they're all either interested in distinctClauses being non-NIL, which
we'll have no effect on. There's one that only cares about distinctOn
only, and then there's:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I don't think removing this fast-path would currently have any benefit
as we're not setting the ressortgroupref for the targetlist items
which we've removed the distinctClause for.

If we did that then, you might think that the following code in
remove_unused_subquery_outputs() might be able to remove the
targetlist item for a useless distinct item we've removed:

/*
* If it has a sortgroupref number, it's used in some sort/group
* clause so we'd better not remove it.  Also, don't remove any
* resjunk columns, since their reason for being has nothing to do
* with anybody reading the subquery's output.  (It's likely that
* resjunk columns in a sub-SELECT would always have ressortgroupref
* set, but even if they don't, it seems imprudent to remove them.)
*/
if (tle->ressortgroupref || tle->resjunk)
continue;

but, it can't because we've not removed them yet.
remove_unused_subquery_outputs() is called before subquery_planner()
in set_subquery_pathlist(), so remove_useless_distinct_columns() won't
even have been called by the time we get to the above code. So it
looks like zeroing the ressortgroupref's of the TargetEntry items
belonging to the removed distinct items would be wasted effort. I also
can't imagine why we'd ever try to remove unused outputs from a
subquery that's already been planned since almost all of the benefits
are from allowing the planner more flexibility to do things like join
removal, so I don't think there's any danger of this code becoming
broken later.

I looked in other places in the planner which are looking at
distinctClauses. Many are just testing distinctClauses == NIL, which
are not affected here since we'll always leave at least 1 item in that
List.

Another place that looks interesting is query_is_distinct_for() in
analyzejoins.c. However, the subquery is being analyzed here long
before it's getting planned, so all the distinctClauses will be fully
intact at that time.

> 2. There's a comment in planner.c to the effect that
>
>          * When we have DISTINCT ON, we must sort by the more rigorous of
>          * DISTINCT and ORDER BY, else it won't have the desired behavior.
>          * Also, if we do have to do an explicit sort, we might as well use
>          * the more rigorous ordering to avoid a second sort later.  (Note
>          * that the parser will have ensured that one clause is a prefix of
>          * the other.)
>
> Removing random elements of the distinctClause will break its
> correspondence with the sortClause, with probably bad results.
>
> I do not remember for sure at the moment, but it may be that this
> correspondence is only important for the case of DISTINCT ON, in which
> case we could dodge the problem by not applying the optimization unless
> it's plain DISTINCT.  That doesn't help us with point 1 though.

Good point, although we could probably just apply the same
optimization to the ORDER BY to sidestep that. I've left a note in
that area as something to think about for another day, although I
think the refactoring done in this patch would make it a pretty easy
thing to fix.

> BTW, my dictionary says it's "dependent" not "dependant".

You're right, regardless if that dictionary says Webster's or Oxford
on the front. Fixed.

I've attached an updated patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

remove_useless_distinct_clauses_v3.patch

Re: Removing useless DISTINCT clauses

From

Melanie Plageman

Date:

21 March 2018, 06:29:15

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

This is a review of the patch to remove useless DISTINCT columns from the DISTINCT clause
https://www.postgresql.org/message-id/CAKJS1f8UMJ137sRuVSnEMDDpa57Q71JuLZi4yLCFMekNYVYqaQ%40mail.gmail.com


Contents & Purpose
==================

This patch removes any additional columns in the DISTINCT clause when the
table's primary key columns are entirely present in the DISTINCT clause. This
optimization works because the PK columns functionally determine the other
columns in the relation. The patch includes regression test cases.

Initial Run
===========
The patch applies cleanly to HEAD. All tests which are run as part of the
`installcheck` target pass correctly (all existing tests pass, all new tests
pass with the patch applied and fail without it). All TAP tests pass. All tests
which are run as part of the `installcheck-world` target pass except one of the
Bloom contrib module tests in `contrib/bloom/sql/bloom.sql`,
 `CREATE INDEX bloomidx ON tst USING bloom (i, t) WITH (col1 = 3);`
which is currently also failing (crashing) for me on master, so it is unrelated to the
patch. The test cases seem to cover the new behavior.

Functionality
=============
Given that this optimization is safe for the GROUP BY clause (you can remove
functionally dependent attributes from the clause there as well), the logic
does seem to follow for DISTINCT. It seems semantically correct to do this. As
for the implementation, the author seems to have reasonably addressed concerns
expressed over the distinction between DISTINCT and DISTINCT ON. As far as I
can see, this should be fine.

Code Review
===========
For a small performance hit but an improvement in readability, the length check
can be moved from the individual group by and distinct clause checks into the
helper function

    if (list_length(parse->distinctClause) < 2)
      return;

    and

     if (list_length(parse->groupClause) < 2)
      return;

can be moved into `remove_functionally_dependent_clauses`


Comments/Documentation
=======================

The main helper function that is added `remove_functionally_dependent_clauses`
uses one style of comment--with the name of the function and then the rest of
the description indented which is found some places in the code, 
/*
 * remove_functionally_dependent_clauses
 *        Processes clauselist and removes any items which are deemed to be
 *        functionally dependent on other clauselist items.
 *
 * If any item from the list can be removed, then a new list is built which
 * does not contain the removed items.  If no item can be removed then the
 * original list is returned.
 */

while other helper functions in the same file use a different style--all lines
flush to the side with a space. I'm not sure which is preferred

/*
 * limit_needed - do we actually need a Limit plan node?
 *
 * If we have constant-zero OFFSET and constant-null LIMIT, we can skip adding
 * a Limit node.  This is worth checking for because "OFFSET 0" is a common
 * locution for an optimization fence.  (Because other places in the planner
 ...

it looks like the non-indented ones are from older commits (2013 vs 2016).

The list length change is totally optional, but I'll leave it as Waiting On Author in case the author wants to make
thatchange.
 

Overall, LGTM.

The new status of this patch is: Waiting on Author

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

22 March 2018, 08:58:32

On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote:
> For a small performance hit but an improvement in readability, the length check
> can be moved from the individual group by and distinct clause checks into the
> helper function
>
>         if (list_length(parse->distinctClause) < 2)
>           return;
>
>         and
>
>         if (list_length(parse->groupClause) < 2)
>           return;
>
> can be moved into `remove_functionally_dependent_clauses`

I remember thinking about this when writing, and I think I ended up
doing the check earlier as I thought that having 1 GROUP BY clause
item would be the most common case, so it seems like a good idea to
abort as early as possible in that case. That's most likely not the
case for DISTINCT.

I imagine it probably does not make that much difference anyway, so
I've moved it into the remove_functionally_dependent_clauses()
function, as mentioned.

> The main helper function that is added `remove_functionally_dependent_clauses`
> uses one style of comment--with the name of the function and then the rest of
> the description indented which is found some places in the code,
> /*
>  * remove_functionally_dependent_clauses
>  *              Processes clauselist and removes any items which are deemed to be
>  *              functionally dependent on other clauselist items.
>  *
>  * If any item from the list can be removed, then a new list is built which
>  * does not contain the removed items.  If no item can be removed then the
>  * original list is returned.
>  */
>
> while other helper functions in the same file use a different style--all lines
> flush to the side with a space. I'm not sure which is preferred

You might have a point, but remove_useless_groupby_columns already
does it this way, so I don't think changing that is a good idea.

> The new status of this patch is: Waiting on Author

Thanks for reviewing this.  I've attached an updated patch. I'll set
back to waiting for review again.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

remove_useless_distinct_clauses_v4.patch

Re: Removing useless DISTINCT clauses

From

Laurenz Albe

Date:

22 March 2018, 11:16:52

Melanie Plageman wrote:
> Contents & Purpose
> ==================
> 
> This patch removes any additional columns in the DISTINCT clause when the
> table's primary key columns are entirely present in the DISTINCT clause. This
> optimization works because the PK columns functionally determine the other
> columns in the relation. The patch includes regression test cases.

Would it be very difficult to extend that to "if any unique constraints are
contained in the DISTINCT clause"?

Yours,
Laurenz Albe

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

23 March 2018, 03:20:43

On 22 March 2018 at 21:16, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> Would it be very difficult to extend that to "if any unique constraints are
> contained in the DISTINCT clause"?

Unfortunately, it's quite a bit more work to make that happen. It's
not just unique constraints that are required to make this work. We
also need to pay attention to NOT NULL constraints too, as we're
unable to prove function dependency when the keys have NULLs, as there
may be any number of rows containing NULL values.

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on.  We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute.  For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Laurenz Albe

Date:

23 March 2018, 10:00:39

David Rowley wrote:
> > Would it be very difficult to extend that to "if any unique constraints are
> > contained in the DISTINCT clause"?
> 
> Unfortunately, it's quite a bit more work to make that happen. It's
> not just unique constraints that are required to make this work. We
> also need to pay attention to NOT NULL constraints too, as we're
> unable to prove function dependency when the keys have NULLs, as there
> may be any number of rows containing NULL values.
> 
> The problem is that in order to properly invalidate cached plans we
> must record the constraint OIDs which the plan depends on.  We can do
> that for PK and UNIQUE constraints, unfortunately, we can't do it for
> NOT NULL constraints as, at the moment, these are just properties of
> pg_attribute.  For this to be made to work we'd need to store those in
> pg_constraint too, which is more work that I'm going to do for this
> patch.

That makes sense; thanks for explaining.

Yours,
Laurenz Albe

Re: Removing useless DISTINCT clauses

From

Jim Finnerty

Date:

23 March 2018, 15:42:47

David, first of all, I'm delighted to see this change.  I've been intending
to give you feedback on it for a while, but I've been flat out with work, so
I'm sorry.

Any improvement in our DISTINCT optimization would be helpful.  Note,
though, that DISTINCT optimization can include the surrounding query block
context, because sometimes the DISTINCT is required, sometimes it is
prohibited, and sometimes it is optional.  When DISTINCT is optional, you
can produce distinct rows if it is efficient and convenient to do so, as you
might from an index scan having leading keys that match the DISTINCT, but
you 

Consider if you have a SELECT DISTINCT under a UNION that does a distinct on
the same grouping keys.  In that case, the UNION will do the distinct
operation anyway, so the SELECT DISTINCT can become a SELECT.

Please search for and read the old-but-good paper entitled "Extensible/Rule
Based Query Optimization in Starburst", by Pirahesh et. al.

Distinctness can also be preserved across joins, so if you have a 'snowflake
query' type join, where all the joins are to a unique key, then the
distinctness of the other side of the join is preserved.  For example, a
SELECT DISTINCT * FROM fact_table ... that joins from each column in its
compound primary key to a unique key of another (dimension) table would
remain distinct, and so you could drop the DISTINCT from the query.

This is a relatively weak area of PostgreSQL at the moment, so IMHO we
really need this.

Regarding constraint OIDs and invalidation when a NOT NULL column is
modified to remove this constraint, this would be accomplished with an ALTER
TABLE <t> ALTER COLUMN <c> <type> NULL.

Why wouldn't / can't the ALTER TABLE cause an invalidation of cached plans
that depend on table t?




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Removing useless DISTINCT clauses

From

Melanie Plageman

Date:

23 March 2018, 19:55:22

On Thu, Mar 22, 2018 at 5:20 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:

The problem is that in order to properly invalidate cached plans we
must record the constraint OIDs which the plan depends on. We can do
that for PK and UNIQUE constraints, unfortunately, we can't do it for
NOT NULL constraints as, at the moment, these are just properties of
pg_attribute. For this to be made to work we'd need to store those in
pg_constraint too, which is more work that I'm going to do for this
patch.

I was just wondering, since get_primary_key_attnos opens pg_constraint and just skips attributes with other constraint types than primary, what would be the reason we wouldn't just also open pg_attribute and check for the non-nullness of the non-primary key unique attributes?
Couldn't we add a function which opens both pg_attribute and pg_constraint to get non-null unique attributes?
I suppose that would have a performance impact.
And, I did notice the FIXME in the comment before check_functional_grouping which seems to make the argument that there are other use cases for wanting the non-nullness represented in pg_constraint.

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.
*/

Melanie Plageman

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 March 2018, 00:12:35

On 22 March 2018 at 18:58, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 21 March 2018 at 16:29, Melanie Plageman <melanieplageman@gmail.com> wrote:
>> The new status of this patch is: Waiting on Author
>
> Thanks for reviewing this.  I've attached an updated patch. I'll set
> back to waiting for review again.

I've attached another version of the patch. The only change is a
thinko in a comment which claimed it may be possible to do something
with DISTINCT ON clauses, but on further thought, I don't think that's
possible, so I've updated the comment to reflect my new thoughts.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

remove_useless_distinct_clauses_v5.patch

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 March 2018, 00:46:06

On 24 March 2018 at 05:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
> I was just wondering, since get_primary_key_attnos opens pg_constraint and
> just skips attributes with other constraint types than primary, what would
> be the reason we wouldn't just also open pg_attribute and check for the
> non-nullness of the non-primary key unique attributes?
> Couldn't we add a function which opens both pg_attribute and pg_constraint
> to get non-null unique attributes?
> I suppose that would have a performance impact.
> And, I did notice the FIXME in the comment before check_functional_grouping
> which seems to make the argument that there are other use cases for wanting
> the non-nullness represented in pg_constraint.

It's certainly possible to do more here.  I'm uncertain what needs to
be done in regards to cached plan invalidation, but we'd certainly
need to ensure cached plans are invalidated whenever the attnotnull is
changed.

There would be no need to look at the pg_attribute table directly. A
syscache function would just need added named get_attnotnull() which
would be akin to something like get_atttypmod().

I'm personally not planning on doing anything in that regard for this
patch. We have 1 week to go before PG11 feature freeze. My goals for
this patch was to apply the same optimization to DISTINCT as I did for
GROUP BY a while back, which is exactly what the patch does. I agree
that more would be nice, but unfortunately, time is finite.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 March 2018, 00:52:56

On 24 March 2018 at 01:42, Jim Finnerty <jfinnert@amazon.com> wrote:
> Distinctness can also be preserved across joins, so if you have a 'snowflake
> query' type join, where all the joins are to a unique key, then the
> distinctness of the other side of the join is preserved.  For example, a
> SELECT DISTINCT * FROM fact_table ... that joins from each column in its
> compound primary key to a unique key of another (dimension) table would
> remain distinct, and so you could drop the DISTINCT from the query.

I'm aware. It is something I'm interested in but would require several
orders of magnitude more work than what I've done for this patch. You
may have noticed the other work I did a while back to detect if joins
cause row duplicate or not, so it's certainly something I've thought
about.

If Amazon would like to sponsor work in this area then please see [1].

It certainly would be great to see that happen.

[1] https://wiki.postgresql.org/wiki/How_to_sponsor_a_feature

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Melanie Plageman

Date:

05 April 2018, 18:36:32

Hi David,

The updated patch looks good to me.

I've changed the status to "ready for committer"

Thanks

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

06 April 2018, 01:37:30

On 6 April 2018 at 03:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
> The updated patch looks good to me.
> I've changed the status to "ready for committer"

Great. Thanks for reviewing this!


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Tom Lane

Date:

07 April 2018, 20:06:48

David Rowley <david.rowley@2ndquadrant.com> writes:
> It's certainly possible to do more here.  I'm uncertain what needs to
> be done in regards to cached plan invalidation, but we'd certainly
> need to ensure cached plans are invalidated whenever the attnotnull is
> changed.

They already are; any change at all in a pg_class or pg_attribute
row results in a relcache inval for the associated table, cf
CacheInvalidateHeapTuple.  As far as the invalidation machinery is
concerned, changing attnotnull is indistinguishable from changing
atttypid, which I'm sure you'll agree must force plan regeneration.

So this line of thought leads to the conclusion that we were too
conservative in not allowing unique constraints to be used along
with actual pkeys in remove_useless_groupby_columns.  We do have
the additional requirement of checking attnotnull, but I think
we can assume that a plan inval will occur if that changes.

In fact, now that I think about it, I wonder why we're insisting
on tracking the constraint OID either.  Don't index drops force
relcache invals regardless?  If not, how do we ensure that an
indexscan plan relying on that index gets redone?

Part of this probably comes from looking at the parser's
check_ungrouped_columns(), but that is operating under different
constraints, since it is generating what might be a long-lived
parse tree, eg something that gets stored as a view.  We do need
to be able to register an actual dependency on the uniqueness
constraint in that situation.  But plans get tossed on the basis
of relcache invals, so I think they don't need it.

Anyway, that's clearly a future improvement rather than something
I'd insist on this patch tackling immediately.  My gripe about
this as it stands is mainly that it seems like it's going to do
a lot of catalog-lookup work twice over, at least in cases that
have both DISTINCT and GROUP BY --- or is that too small a set to
worry about?

            regards, tom lane

Re: Removing useless DISTINCT clauses

From

Tom Lane

Date:

07 April 2018, 23:55:36

I wrote:
> My gripe about
> this as it stands is mainly that it seems like it's going to do
> a lot of catalog-lookup work twice over, at least in cases that
> have both DISTINCT and GROUP BY --- or is that too small a set to
> worry about?

I convinced myself that that was a silly objection, and started looking
through the patch with intent to commit.  However, re-reading the thread
re-awakened my concern about whether deleting items from the
distinctClause is actually safe.

You're right that remove_unused_subquery_outputs isn't really in trouble
with this; it might fail to remove some things it could remove, but that
seems OK, and at best deserving of a comment.

However, that doesn't mean we're out of the woods.  See also
preprocess_groupclause, particularly this comment:
 * Note: we need no comparable processing of the distinctClause because
 * the parser already enforced that that matches ORDER BY.

as well as the interesting assumptions in create_distinct_paths about
what it means to compare the lengths of the pathkey lists for these
clauses.  Once I noticed that, I became convinced that this patch actually
breaks planning of sort/unique-based DISTINCT: if we have a pkey a,b,
but the ORDER BY list is a,c,b, then we will sort by a,c,b but the
Unique node will be told it only needs to unique-ify on the first two
sort columns.

That led me to try this test case, and it blew up even better than
I expected:

regression=# create table t1 (a int,b int, c int, d int, primary key(a,b));
CREATE TABLE
regression=# explain verbose select distinct * from t1 order by a,c,b;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The reason is we're hitting this assert, a bit further down in
create_distinct_paths:

            /* Assert checks that parser didn't mess up... */
            Assert(pathkeys_contained_in(root->distinct_pathkeys,
                                         needed_pathkeys));


I don't think that that makes this patch unsalvageable.  The way forward
probably involves remembering that we removed some distinctClause items
(so that we know the correspondence to the sortClause no longer holds)
and then working harder in create_distinct_paths when that's the case.

However, that's not something that's going to happen on the last day
of the commitfest.  So I'm going to mark this Returned With Feedback
and encourage you to return to the matter in v12.

In the meantime, attached is the version of the patch that I was about to
commit before getting cold feet.  It has some cosmetic improvements
over yours, notably comment additions in allpaths.c.

            regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 65a34a2..345ed17 100644
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
*************** remove_unused_subquery_outputs(Query *su
*** 3306,3311 ****
--- 3306,3315 ----
      /*
       * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
       * time: all its output columns must be used in the distinctClause.
+      * (Note: the latter is not necessarily true anymore, because planner.c
+      * might have found some of the DISTINCT columns to be redundant and
+      * dropped them.  But they'd still have sortgroupref markings, so unless
+      * we improve the heuristic below, we would not recognize them as unused.)
       */
      if (subquery->distinctClause && !subquery->hasDistinctOn)
          return;
*************** remove_unused_subquery_outputs(Query *su
*** 3348,3358 ****

          /*
           * If it has a sortgroupref number, it's used in some sort/group
!          * clause so we'd better not remove it.  Also, don't remove any
!          * resjunk columns, since their reason for being has nothing to do
!          * with anybody reading the subquery's output.  (It's likely that
!          * resjunk columns in a sub-SELECT would always have ressortgroupref
!          * set, but even if they don't, it seems imprudent to remove them.)
           */
          if (tle->ressortgroupref || tle->resjunk)
              continue;
--- 3352,3365 ----

          /*
           * If it has a sortgroupref number, it's used in some sort/group
!          * clause so we'd better not remove it.  (This is a conservative
!          * heuristic, since it might not actually be used by any surviving
!          * sort/group clause; but we don't bother to expend the cycles needed
!          * for a more accurate test.)  Also, don't remove any resjunk columns,
!          * since their reason for being has nothing to do with anybody reading
!          * the subquery's output.  (It's likely that resjunk columns in a
!          * sub-SELECT would always have ressortgroupref set, but even if they
!          * don't, it seems imprudent to remove them.)
           */
          if (tle->ressortgroupref || tle->resjunk)
              continue;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 008492b..a9ccfed 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** static double preprocess_limit(PlannerIn
*** 125,130 ****
--- 125,133 ----
                   int64 *offset_est, int64 *count_est);
  static bool limit_needed(Query *parse);
  static void remove_useless_groupby_columns(PlannerInfo *root);
+ static void remove_useless_distinct_columns(PlannerInfo *root);
+ static List *remove_dependent_grouping_clauses(PlannerInfo *root,
+                                   List *clauselist);
  static List *preprocess_groupclause(PlannerInfo *root, List *force);
  static List *extract_rollup_sets(List *groupingSets);
  static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
*************** subquery_planner(PlannerGlobal *glob, Qu
*** 965,970 ****
--- 968,976 ----
      /* Remove any redundant GROUP BY columns */
      remove_useless_groupby_columns(root);

+     /* Likewise for redundant DISTINCT columns */
+     remove_useless_distinct_columns(root);
+
      /*
       * If we have any outer joins, try to reduce them to plain inner joins.
       * This step is most easily done after we've done expression
*************** static void
*** 2941,2967 ****
  remove_useless_groupby_columns(PlannerInfo *root)
  {
      Query       *parse = root->parse;
-     Bitmapset **groupbyattnos;
-     Bitmapset **surplusvars;
-     ListCell   *lc;
-     int            relid;
-
-     /* No chance to do anything if there are less than two GROUP BY items */
-     if (list_length(parse->groupClause) < 2)
-         return;

      /* Don't fiddle with the GROUP BY clause if the query has grouping sets */
      if (parse->groupingSets)
          return;

      /*
!      * Scan the GROUP BY clause to find GROUP BY items that are simple Vars.
!      * Fill groupbyattnos[k] with a bitmapset of the column attnos of RTE k
!      * that are GROUP BY items.
       */
!     groupbyattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
!                                            (list_length(parse->rtable) + 1));
!     foreach(lc, parse->groupClause)
      {
          SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
          TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
--- 2947,3022 ----
  remove_useless_groupby_columns(PlannerInfo *root)
  {
      Query       *parse = root->parse;

      /* Don't fiddle with the GROUP BY clause if the query has grouping sets */
      if (parse->groupingSets)
          return;

+     parse->groupClause =
+         remove_dependent_grouping_clauses(root, parse->groupClause);
+ }
+
+ /*
+  * remove_useless_distinct_columns
+  *        Like remove_useless_groupby_columns, but for the DISTINCT clause
+  *
+  * If we have both a multi-member GROUP BY clause and a multi-member DISTINCT
+  * clause, this will do a lot of the same catalog lookup work that
+  * remove_useless_groupby_columns already did.  That seems like an unlikely
+  * case, so for now we don't worry about it, but eventually it might be good
+  * to refactor to avoid that.
+  */
+ static void
+ remove_useless_distinct_columns(PlannerInfo *root)
+ {
+     Query       *parse = root->parse;
+
      /*
!      * Don't try to remove anything from a DISTINCT ON clause.  For this case,
!      * the distinctClauses are closely entwined with the ORDER BY clause, so
!      * we'd better not meddle with them.  You might think that we can just
!      * apply the same optimizations to the ORDER BY too, but we can't since
!      * removing an item there could affect the order of the query results.
       */
!     if (parse->hasDistinctOn)
!         return;
!
!     parse->distinctClause =
!         remove_dependent_grouping_clauses(root, parse->distinctClause);
! }
!
! /*
!  * remove_dependent_grouping_clauses
!  *        Process clauselist (a list of SortGroupClause) and remove any items
!  *        that can be proven to be functionally dependent on other items.
!  *        We will have the same grouping semantics without them.
!  *
!  * If any item from the list can be removed, then a new list is built which
!  * does not contain the removed items.  If nothing can be removed then the
!  * original list is returned.
!  */
! static List *
! remove_dependent_grouping_clauses(PlannerInfo *root,
!                                   List *clauselist)
! {
!     Query       *parse = root->parse;
!     Bitmapset **clauseattnos;
!     Bitmapset **surplusvars;
!     ListCell   *lc;
!     int            relid;
!
!     /* No chance of removing anything if there are fewer than two items */
!     if (list_length(clauselist) < 2)
!         return clauselist;
!
!     /*
!      * Scan the clauselist to find items that are simple Vars.  Fill
!      * clauseattnos[k] with a bitmapset of the column attnos of RTE k that
!      * appear in the clauselist.
!      */
!     clauseattnos = (Bitmapset **) palloc0(sizeof(Bitmapset *) *
!                                           (list_length(parse->rtable) + 1));
!     foreach(lc, clauselist)
      {
          SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
          TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
*************** remove_useless_groupby_columns(PlannerIn
*** 2971,2979 ****
           * Ignore non-Vars and Vars from other query levels.
           *
           * XXX in principle, stable expressions containing Vars could also be
!          * removed, if all the Vars are functionally dependent on other GROUP
!          * BY items.  But it's not clear that such cases occur often enough to
!          * be worth troubling over.
           */
          if (!IsA(var, Var) ||
              var->varlevelsup > 0)
--- 3026,3034 ----
           * Ignore non-Vars and Vars from other query levels.
           *
           * XXX in principle, stable expressions containing Vars could also be
!          * removed, if all the Vars are functionally dependent on other items
!          * in the clauselist.  But it's not clear that such cases occur often
!          * enough to be worth troubling over.
           */
          if (!IsA(var, Var) ||
              var->varlevelsup > 0)
*************** remove_useless_groupby_columns(PlannerIn
*** 2982,2996 ****
          /* OK, remember we have this Var */
          relid = var->varno;
          Assert(relid <= list_length(parse->rtable));
!         groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
!                                               var->varattno - FirstLowInvalidHeapAttributeNumber);
      }

      /*
       * Consider each relation and see if it is possible to remove some of its
!      * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
!      * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
!      * of the column attnos of RTE k that are removable GROUP BY items.
       */
      surplusvars = NULL;            /* don't allocate array unless required */
      relid = 0;
--- 3037,3055 ----
          /* OK, remember we have this Var */
          relid = var->varno;
          Assert(relid <= list_length(parse->rtable));
!         clauseattnos[relid] = bms_add_member(clauseattnos[relid],
!                                              var->varattno - FirstLowInvalidHeapAttributeNumber);
      }

      /*
       * Consider each relation and see if it is possible to remove some of its
!      * Vars from the clauselist.  We can do so if they are functionally
!      * dependent on other Vars from the same relation that are also in the
!      * clauselist (independently of any other relations that are mentioned).
!      *
!      * For simplicity and speed, we do the actual removal in a separate pass.
!      * Here, we just fill surplusvars[k] with a bitmapset of the column attnos
!      * of RTE k that are removable clauselist items.
       */
      surplusvars = NULL;            /* don't allocate array unless required */
      relid = 0;
*************** remove_useless_groupby_columns(PlannerIn
*** 3007,3014 ****
          if (rte->rtekind != RTE_RELATION)
              continue;

!         /* Nothing to do unless this rel has multiple Vars in GROUP BY */
!         relattnos = groupbyattnos[relid];
          if (bms_membership(relattnos) != BMS_MULTIPLE)
              continue;

--- 3066,3073 ----
          if (rte->rtekind != RTE_RELATION)
              continue;

!         /* Nothing to do unless this rel has multiple Vars in clauselist */
!         relattnos = clauseattnos[relid];
          if (bms_membership(relattnos) != BMS_MULTIPLE)
              continue;

*************** remove_useless_groupby_columns(PlannerIn
*** 3022,3028 ****

          /*
           * If the primary key is a proper subset of relattnos then we have
!          * some items in the GROUP BY that can be removed.
           */
          if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
          {
--- 3081,3087 ----

          /*
           * If the primary key is a proper subset of relattnos then we have
!          * some items in the clauselist that can be removed.
           */
          if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
          {
*************** remove_useless_groupby_columns(PlannerIn
*** 3044,3058 ****
      }

      /*
!      * If we found any surplus Vars, build a new GROUP BY clause without them.
       * (Note: this may leave some TLEs with unreferenced ressortgroupref
       * markings, but that's harmless.)
       */
      if (surplusvars != NULL)
      {
!         List       *new_groupby = NIL;

!         foreach(lc, parse->groupClause)
          {
              SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
              TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
--- 3103,3117 ----
      }

      /*
!      * If we found any surplus Vars, build a new clause list without them.
       * (Note: this may leave some TLEs with unreferenced ressortgroupref
       * markings, but that's harmless.)
       */
      if (surplusvars != NULL)
      {
!         List       *new_clauselist = NIL;

!         foreach(lc, clauselist)
          {
              SortGroupClause *sgc = lfirst_node(SortGroupClause, lc);
              TargetEntry *tle = get_sortgroupclause_tle(sgc, parse->targetList);
*************** remove_useless_groupby_columns(PlannerIn
*** 3066,3076 ****
                  var->varlevelsup > 0 ||
                  !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
                                 surplusvars[var->varno]))
!                 new_groupby = lappend(new_groupby, sgc);
          }

!         parse->groupClause = new_groupby;
      }
  }

  /*
--- 3125,3138 ----
                  var->varlevelsup > 0 ||
                  !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
                                 surplusvars[var->varno]))
!                 new_clauselist = lappend(new_clauselist, sgc);
          }

!         return new_clauselist;
      }
+
+     /* nothing to change, just return the old list */
+     return clauselist;
  }

  /*
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f85e913..2976c4b 100644
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** explain (costs off) select * from t3 gro
*** 1017,1022 ****
--- 1017,1081 ----
     ->  Seq Scan on t3
  (3 rows)

+ --
+ -- Test removal of redundant DISTINCT columns
+ --
+ -- Non-primary-key columns can be removed from DISTINCT clause
+ explain (costs off) select distinct a,b,c,d from t1;
+       QUERY PLAN
+ ----------------------
+  HashAggregate
+    Group Key: a, b
+    ->  Seq Scan on t1
+ (3 rows)
+
+ -- No removal can happen if the complete PK is not present in DISTINCT clause
+ explain (costs off) select distinct a,c,d from t1;
+       QUERY PLAN
+ ----------------------
+  HashAggregate
+    Group Key: a, c, d
+    ->  Seq Scan on t1
+ (3 rows)
+
+ -- Test removal across multiple relations
+ explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z
+ from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y;
+                       QUERY PLAN
+ ------------------------------------------------------
+  HashAggregate
+    Group Key: t1.a, t1.b, t2.x, t2.y
+    ->  Hash Join
+          Hash Cond: ((t2.x = t1.a) AND (t2.y = t1.b))
+          ->  Seq Scan on t2
+          ->  Hash
+                ->  Seq Scan on t1
+ (7 rows)
+
+ -- Test case where t1 can be optimized but not t2
+ explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.z
+ from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y;
+                       QUERY PLAN
+ ------------------------------------------------------
+  HashAggregate
+    Group Key: t1.a, t1.b, t2.x, t2.z
+    ->  Hash Join
+          Hash Cond: ((t2.x = t1.a) AND (t2.y = t1.b))
+          ->  Seq Scan on t2
+          ->  Hash
+                ->  Seq Scan on t1
+ (7 rows)
+
+ -- Ensure we don't remove DISTINCT ON items
+ explain (costs off) select distinct on (a,b,c) d from t1 order by a,b,c,d;
+           QUERY PLAN
+ ------------------------------
+  Unique
+    ->  Sort
+          Sort Key: a, b, c, d
+          ->  Seq Scan on t1
+ (4 rows)
+
  drop table t1;
  drop table t2;
  drop table t3;
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 84c6e9b..fd03559 100644
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*************** select d.* from d left join (select * fr
*** 4126,4147 ****
           ->  Seq Scan on d
  (8 rows)

! -- similarly, but keying off a DISTINCT clause
  explain (costs off)
  select d.* from d left join (select distinct * from b) s
    on d.a = s.id;
!               QUERY PLAN
! --------------------------------------
!  Merge Right Join
!    Merge Cond: (b.id = d.a)
!    ->  Unique
!          ->  Sort
!                Sort Key: b.id, b.c_id
!                ->  Seq Scan on b
!    ->  Sort
!          Sort Key: d.a
!          ->  Seq Scan on d
! (9 rows)

  -- check join removal works when uniqueness of the join condition is enforced
  -- by a UNION
--- 4126,4147 ----
           ->  Seq Scan on d
  (8 rows)

! -- similarly, but keying off a DISTINCT clause (again, removal of b.c_id
! -- from the DISTINCT step happens too late for join removal)
  explain (costs off)
  select d.* from d left join (select distinct * from b) s
    on d.a = s.id;
!               QUERY PLAN
! ---------------------------------------
!  Hash Left Join
!    Hash Cond: (d.a = s.id)
!    ->  Seq Scan on d
!    ->  Hash
!          ->  Subquery Scan on s
!                ->  HashAggregate
!                      Group Key: b.id
!                      ->  Seq Scan on b
! (8 rows)

  -- check join removal works when uniqueness of the join condition is enforced
  -- by a UNION
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 506d044..1ee26ca 100644
*** a/src/test/regress/sql/aggregates.sql
--- b/src/test/regress/sql/aggregates.sql
*************** group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;
*** 362,367 ****
--- 362,387 ----
  -- Cannot optimize when PK is deferrable
  explain (costs off) select * from t3 group by a,b,c;

+ --
+ -- Test removal of redundant DISTINCT columns
+ --
+ -- Non-primary-key columns can be removed from DISTINCT clause
+ explain (costs off) select distinct a,b,c,d from t1;
+
+ -- No removal can happen if the complete PK is not present in DISTINCT clause
+ explain (costs off) select distinct a,c,d from t1;
+
+ -- Test removal across multiple relations
+ explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.y,t2.z
+ from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y;
+
+ -- Test case where t1 can be optimized but not t2
+ explain (costs off) select distinct t1.a,t1.b,t1.c,t1.d,t2.x,t2.z
+ from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y;
+
+ -- Ensure we don't remove DISTINCT ON items
+ explain (costs off) select distinct on (a,b,c) d from t1 order by a,b,c,d;
+
  drop table t1;
  drop table t2;
  drop table t3;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index b1e05a3..31fd397 100644
*** a/src/test/regress/sql/join.sql
--- b/src/test/regress/sql/join.sql
*************** explain (costs off)
*** 1360,1366 ****
  select d.* from d left join (select * from b group by b.id, b.c_id) s
    on d.a = s.id;

! -- similarly, but keying off a DISTINCT clause
  explain (costs off)
  select d.* from d left join (select distinct * from b) s
    on d.a = s.id;
--- 1360,1367 ----
  select d.* from d left join (select * from b group by b.id, b.c_id) s
    on d.a = s.id;

! -- similarly, but keying off a DISTINCT clause (again, removal of b.c_id
! -- from the DISTINCT step happens too late for join removal)
  explain (costs off)
  select d.* from d left join (select distinct * from b) s
    on d.a = s.id;

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

08 April 2018, 00:17:24

On 8 April 2018 at 08:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> regression=# create table t1 (a int,b int, c int, d int, primary key(a,b));
> CREATE TABLE
> regression=# explain verbose select distinct * from t1 order by a,c,b;
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.

Ouch!

> The reason is we're hitting this assert, a bit further down in
> create_distinct_paths:
>
>             /* Assert checks that parser didn't mess up... */
>             Assert(pathkeys_contained_in(root->distinct_pathkeys,
>                                          needed_pathkeys));
>
>
> I don't think that that makes this patch unsalvageable.  The way forward
> probably involves remembering that we removed some distinctClause items
> (so that we know the correspondence to the sortClause no longer holds)
> and then working harder in create_distinct_paths when that's the case.
>
> However, that's not something that's going to happen on the last day
> of the commitfest.  So I'm going to mark this Returned With Feedback
> and encourage you to return to the matter in v12.
>
> In the meantime, attached is the version of the patch that I was about to
> commit before getting cold feet.  It has some cosmetic improvements
> over yours, notably comment additions in allpaths.c.

Thanks a lot for spending time on this.

I'll no doubt have some time over this coming winter to see if it can
be fixed and re-submitted in the PG12 cycle.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Jim Finnerty

Date:

22 August 2018, 23:11:55

distinct_optimization_v6.patch
<http://www.postgresql-archive.org/file/t348990/distinct_optimization_v6.patch>  

Here is an update to this thread, for potential inclusion in v12.  I
couldn't get the most recent 'v5' patch to apply cleanly, so I recreated a
v6 patch on PG10.5 by hand, and made a few changes and improvements:

-  (a) If there is only one relation and the PK is present in the SELECT
list, then the distinctClause can be removed.  Thus, a query of the form
SELECT DISTINCT pk, * FROM t; recognizes that the body of the SELECT is
already distinct on pk, and doesn't do an unnecessary DISTINCT operation
-  (b) If there is only one relation, the PK is present, and there are no
aggregates, then the groupByClause can similarly be removed
-  (c) If DISTINCT ON is specified, but no ORDER BY is specified, then it
acts like a regular DISTINCT
-  (d) If the distinct clause is modified, this fact is recorded in a new
bool in the Query struct so that when the distinct was modified,
create_distinct_paths no longer Asserts that the needed pathkeys are
contained in the distinct_pathkeys.
-  (e) Since the new bool is in the Query struct,
copyfuncs/equalfuncs/outfuncs/readfuncs support is also provided.  This will
cause a problem unless the pg_rewrite catalog is regenerated.  

The pg_rewrite catalog contains a serialized representation of the Query
node in its ev_action column.  If there is a way to recreate the contents of
the pg_rewrite relation without bumping the catversion, can someone please
explain how?  If not, then this change is incomplete and would require a new
catalog version (catversion.h) too.

Additional work on this patch would be desirable.  It should check for
unique + not null, in addition to just the pk constraint.  The DISTINCT
could be eliminated in cases with multiple relations if all the joins are
1:1, although that would arguably belong in a different patch.

    /Jim

p.s. the v6 patch works for the problem case that Tom Lane reported with the
v5 patch




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

23 August 2018, 03:25:04

On 23 August 2018 at 08:11, Jim Finnerty <jfinnert@amazon.com> wrote:
> Here is an update to this thread, for potential inclusion in v12.  I
> couldn't get the most recent 'v5' patch to apply cleanly, so I recreated a
> v6 patch on PG10.5 by hand, and made a few changes and improvements:

Well, the patch I wrote was never intended for v10. Why did you go to
the effort of doing that when you're proposing the patch for v12? It
seems my v5 patch and Tom's v6 version apply fine to today's master
without any rejected hunks.

> The pg_rewrite catalog contains a serialized representation of the Query
> node in its ev_action column.  If there is a way to recreate the contents of
> the pg_rewrite relation without bumping the catversion, can someone please
> explain how?  If not, then this change is incomplete and would require a new
> catalog version (catversion.h) too.

If you're proposing for v12, why do you care about that? catversion
changes are commonplace during major version development.  Or are you
proposing this gets committed to PostgreSQL 10? ... That's not going
to happen... Plus, it does not seem well aligned with the intent of
this thread to discuss ways to do that either.

> Additional work on this patch would be desirable.  It should check for
> unique + not null, in addition to just the pk constraint.  The DISTINCT
> could be eliminated in cases with multiple relations if all the joins are
> 1:1, although that would arguably belong in a different patch.

Certainly. Patches are welcome, but I think I've mentioned to you
previously that I'm not planning that for this patch.

> p.s. the v6 patch works for the problem case that Tom Lane reported with the
> v5 patch

That would be less ambiguous if there wasn't now two v6 version of the patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Jim Finnerty

Date:

23 August 2018, 16:29:10

I was thinking about this last night and I realized that putting the new
hasModifiedDistinct flag on the PlannerInfo struct eliminates the need to
deal with the serialization issues, and makes it simpler.

Here's a new patch (v7) that puts the bool on the PlannerInfo struct, and
adds a couple of tests.

re: why did you apply the patch on v10?

I'm developing on a v10.5 codebase at the moment, though this will change
soon.  If the v7 patch doesn't apply cleanly on later versions, please let
me know and I'll fix it.

re: if you're proposing the patch for v12, why do you care about catversion?

Only because it would be a problem to test the patch on 10.5 with a
catversion change that wouldn't come until v12.  With the v7 patch this
issue is moot because it no longer requires a catversion change.

cheers,

    /Jim

distinct_opt_v7.patch
<http://www.postgresql-archive.org/file/t348990/distinct_opt_v7.patch>  



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Removing useless DISTINCT clauses

From

Alvaro Herrera

Date:

23 August 2018, 17:29:53

On 2018-Aug-23, Jim Finnerty wrote:

> distinct_opt_v7.patch
> <http://www.postgresql-archive.org/file/t348990/distinct_opt_v7.patch>  

Are you posting this via postgresql-archive.org?  Please don't do that.
These posts of yours are mangled by postgresql-archive and do not
contain the file you're attaching.  Use direct posting by email to
pgsql-hackers@lists.postgresql.org instead.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Removing useless DISTINCT clauses

From

"Finnerty, Jim"

Date:

23 August 2018, 18:29:04

Thank you Álvaro.  Here's the distinct_opt_v7 patch.

Attachment

distinct_opt_v7.patch

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 August 2018, 01:38:47

On 24 August 2018 at 03:29, Finnerty, Jim <jfinnert@amazon.com> wrote:
> Thank you Álvaro.  Here's the distinct_opt_v7 patch.

Determining if there's just 1 base relation by checking the length the
rtable is not a good way. If you want to see why, try creating a
primary key on a partitioned table.

My personal opinion of only being able to completely remove the
DISTINCT when there's a single item in the rtable (or a single base
table) is that it's just too poor to bother with.  I think such a
solution is best left to another patch and it should be much more
complete and be able to track the unique properties through joins.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Stephen Frost

Date:

24 August 2018, 02:34:31

Greetings,

* David Rowley (david.rowley@2ndquadrant.com) wrote:
> My personal opinion of only being able to completely remove the
> DISTINCT when there's a single item in the rtable (or a single base
> table) is that it's just too poor to bother with.  I think such a
> solution is best left to another patch and it should be much more
> complete and be able to track the unique properties through joins.

Doesn't that move the goalposts for this pretty far off?  Is there
some reason that doing this for the single-item case will make it
difficult to implement the additional improvements you're suggesting
later?

Given the cost of a DISTINCT and that we know pretty easily if one
exists or not, trying to eliminate it, even in the single-item case,
seems pretty worthwhile to me.  If it makes it difficult to improve on
this later then I could see pushing back on it, but this patch doesn't
seem like its doing that.

Thanks!

Stephen

Attachment

signature.asc

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 August 2018, 02:46:42

On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote:
> Greetings,
>
> * David Rowley (david.rowley@2ndquadrant.com) wrote:
>> My personal opinion of only being able to completely remove the
>> DISTINCT when there's a single item in the rtable (or a single base
>> table) is that it's just too poor to bother with.  I think such a
>> solution is best left to another patch and it should be much more
>> complete and be able to track the unique properties through joins.
>
> Doesn't that move the goalposts for this pretty far off?  Is there
> some reason that doing this for the single-item case will make it
> difficult to implement the additional improvements you're suggesting
> later?

The DISTINCT clause removal would likely need to wait until around
where the distinct paths are created and all the join processing has
been done.

If the optimisation is deemed worthwhile, then maybe that would be the
place to put this code along with a comment that explains how it can
be improved to support the removal with multiple relations.

> Given the cost of a DISTINCT and that we know pretty easily if one
> exists or not, trying to eliminate it, even in the single-item case,
> seems pretty worthwhile to me.  If it makes it difficult to improve on
> this later then I could see pushing back on it, but this patch doesn't
> seem like its doing that.

If people truly think it's worthwhile, then maybe it is, but if it's
being done because it can be done then perhaps it's best not to
bother.  I was just thinking of the bug reports we'd get when people
see it does not work with multiple relations. There's already no
shortage of bug reports that are not bugs. I also wanted to keep this
patch just doing the simple case that we already apply to GROUP BY,
but control over the patch may well be out of my hands now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

Stephen Frost

Date:

24 August 2018, 03:05:01

Greetings,

* Jim Finnerty (jfinnert@amazon.com) wrote:
> I was thinking about this last night and I realized that putting the new
> hasModifiedDistinct flag on the PlannerInfo struct eliminates the need to
> deal with the serialization issues, and makes it simpler.
>
> Here's a new patch (v7) that puts the bool on the PlannerInfo struct, and
> adds a couple of tests.
>
> re: why did you apply the patch on v10?
>
> I'm developing on a v10.5 codebase at the moment, though this will change
> soon.  If the v7 patch doesn't apply cleanly on later versions, please let
> me know and I'll fix it.

Just going back through this thread to check that I didn't miss
anything, I saw this and felt it deserved a comment- please don't post
feature patches like this against existing released versions and then
ask everyone else to see if it works against current head or not.  We're
all for having new contributors of features, but those contributions
should be against the current HEAD.

> re: if you're proposing the patch for v12, why do you care about catversion?
>
> Only because it would be a problem to test the patch on 10.5 with a
> catversion change that wouldn't come until v12.  With the v7 patch this
> issue is moot because it no longer requires a catversion change.

This would be an example of concerns that really shouldn't be getting
raised on this list for new features.  The project has a very clear
policy regarding features vs. bug-fixes and I really don't think it's
necessary or desirable to have patches being floated with subsequent
threads here which go directly against that policy.

Thanks!

Stephen

Attachment

signature.asc

Re: Removing useless DISTINCT clauses

From

Stephen Frost

Date:

24 August 2018, 03:30:06

Greetings,

* David Rowley (david.rowley@2ndquadrant.com) wrote:
> On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote:
> > * David Rowley (david.rowley@2ndquadrant.com) wrote:
> >> My personal opinion of only being able to completely remove the
> >> DISTINCT when there's a single item in the rtable (or a single base
> >> table) is that it's just too poor to bother with.  I think such a
> >> solution is best left to another patch and it should be much more
> >> complete and be able to track the unique properties through joins.
> >
> > Doesn't that move the goalposts for this pretty far off?  Is there
> > some reason that doing this for the single-item case will make it
> > difficult to implement the additional improvements you're suggesting
> > later?
>
> The DISTINCT clause removal would likely need to wait until around
> where the distinct paths are created and all the join processing has
> been done.
>
> If the optimisation is deemed worthwhile, then maybe that would be the
> place to put this code along with a comment that explains how it can
> be improved to support the removal with multiple relations.

Hm, so you're suggesting that this isn't the right place for this
optimization to be implemented, even now, with the single-relation
caveat?

I do agree that including comments about how it could/should be improved
would be good for future developers.

> > Given the cost of a DISTINCT and that we know pretty easily if one
> > exists or not, trying to eliminate it, even in the single-item case,
> > seems pretty worthwhile to me.  If it makes it difficult to improve on
> > this later then I could see pushing back on it, but this patch doesn't
> > seem like its doing that.
>
> If people truly think it's worthwhile, then maybe it is, but if it's
> being done because it can be done then perhaps it's best not to
> bother.  I was just thinking of the bug reports we'd get when people
> see it does not work with multiple relations. There's already no
> shortage of bug reports that are not bugs. I also wanted to keep this
> patch just doing the simple case that we already apply to GROUP BY,
> but control over the patch may well be out of my hands now.

I'm not really sure I'm following along this part of the discussion.
As long as it isn't making the code particularly ugly or violating the
seperation of various components or making later hacking in this area
particularly worse, then the question primairly comes down to making
sure that the code is of quality and that the performance optimization
is worth the cycles to check for the opportunity and that it's not too
expensive compared to the cost of the operation to implement it, all of
which can be pretty objectively assessed and provide the answer to if
it's worthwhile or not.  Given the cost of a Unique node, and that we
will only check for this optimization when we are going to have to
introduce one, it certainly seems like it's worthwhile (again, proviso
on code quality, et al, above, but I don't think that's what you were
arguing a concern about here).

In short, I don't buy the concern about having to deal with "bug reports
that aren't bug reports" or that they should be a barrier to feature
work.  We already see complaints pretty regularly about optimizations we
don't have and this would at least be a step in the right direction for
dealing with these kinds of queries and that does seem better than doing
nothing, imv.

Thanks!

Stephen

Attachment

signature.asc

Re: Removing useless DISTINCT clauses

From

Tom Lane

Date:

24 August 2018, 05:03:36

Stephen Frost <sfrost@snowman.net> writes:
> * David Rowley (david.rowley@2ndquadrant.com) wrote:
>> On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote:
>>> * David Rowley (david.rowley@2ndquadrant.com) wrote:
>>>> My personal opinion of only being able to completely remove the
>>>> DISTINCT when there's a single item in the rtable (or a single base
>>>> table) is that it's just too poor to bother with.

> Hm, so you're suggesting that this isn't the right place for this
> optimization to be implemented, even now, with the single-relation
> caveat?

There is no case where planner optimizations should depend on the length
of the rtable.  Full stop.

It could make sense to optimize if there is just one baserel in the join
tree --- although even that is best checked only after join removal.
As an example of the difference, such an optimization should be able to
optimize "select * from view" if the view contains just one base table.
The rtable will list both the view and the base table, but the view
is only hanging around for permissions-checking purposes; it should not
affect the planner's behavior.

I've not read the patch, but David's reaction makes it sound like its
processing is done too early.  There are right places and wrong places
to do most everything in the planner, and I do not wish to accept a
patch that does something in the wrong place.

            regards, tom lane

Re: Removing useless DISTINCT clauses

From

Stephen Frost

Date:

24 August 2018, 05:12:14

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * David Rowley (david.rowley@2ndquadrant.com) wrote:
> >> On 24 August 2018 at 11:34, Stephen Frost <sfrost@snowman.net> wrote:
> >>> * David Rowley (david.rowley@2ndquadrant.com) wrote:
> >>>> My personal opinion of only being able to completely remove the
> >>>> DISTINCT when there's a single item in the rtable (or a single base
> >>>> table) is that it's just too poor to bother with.
>
> > Hm, so you're suggesting that this isn't the right place for this
> > optimization to be implemented, even now, with the single-relation
> > caveat?
>
> There is no case where planner optimizations should depend on the length
> of the rtable.  Full stop.
>
> It could make sense to optimize if there is just one baserel in the join
> tree --- although even that is best checked only after join removal.

Hm, that's certainly a fair point.

> As an example of the difference, such an optimization should be able to
> optimize "select * from view" if the view contains just one base table.
> The rtable will list both the view and the base table, but the view
> is only hanging around for permissions-checking purposes; it should not
> affect the planner's behavior.

This is happening at the same time as some optimizations around GROUP
BY, so either there's something different about what's happening there
and I didn't appreciate it, or does that optimization suffer from a
similar issue?

> I've not read the patch, but David's reaction makes it sound like its
> processing is done too early.  There are right places and wrong places
> to do most everything in the planner, and I do not wish to accept a
> patch that does something in the wrong place.

Right, I definitely agree with you there.  This seemed like a reasonable
place given the similar optimization (at least in appearance to me)
being done there for the GROUP BY case.  I'm happy to admit that I
haven't looked at it in very much depth (hence my question to David) and
I'm not an expert in this area, but I did want to bring up that the
general idea and the relative trade-offs at least sounded reasonable.

I'll also note that I didn't see these concerned raised earlier on the
thread when I re-read your remarks on it, so I'm a bit concerned that
perhaps either this isn't an actual concern to be realized or perhaps it
was missed previously.

Thanks!

Stephen

Attachment

signature.asc

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

24 August 2018, 06:12:12

On 24 August 2018 at 14:12, Stephen Frost <sfrost@snowman.net> wrote:
> This is happening at the same time as some optimizations around GROUP
> BY, so either there's something different about what's happening there
> and I didn't appreciate it, or does that optimization suffer from a
> similar issue?

There are two separate optimisations in Jim's version of the patch:

1) Attempting to remove useless DISTINCT clause items in cases where a
true subset of the clauses make up the entire table's primary key.
This can be applied no matter how many joins are in the query.  If a
table's PK is a,b then it's not going to be any more distinct if we
distinctify on a,b,c. Distification on a,b is the most we'll never
need.  We already apply this optimisation to the GROUP BY clause.

2) Attempt to remove the entire DISTINCT clause if the query is to a
single base table when the DISTINCT clause contains the entire table's
PK. Here there's need uniquify the results.

I started this thread for #1 and Jim would like to add #2 as an
additional optimisation.  I feel that #2 is different enough from #1
that they should be considered independently of each other. The area
where #2 should be implemented is far away at the other end of
planning from where #1 is being done.  I believe that two separate
patches are the best way forward here as it does not really make sense
to reject #1 because there are concerns with #2, vice versa.  I think
if Jim would like to propose #2, then I think a separate thread is a
better option than tagging along on this one. This might also reduce
confusion.

I'd also rather not turn this thread into people discussing/reviewing
Jim's work that he plans to implement in some fork taken from Postgres
10.5. I don't really think even this mailing list is the right place
for that, let alone this thread.

> > I've not read the patch, but David's reaction makes it sound like its
> > processing is done too early.  There are right places and wrong places
> > to do most everything in the planner, and I do not wish to accept a
> > patch that does something in the wrong place.
>
> Right, I definitely agree with you there.  This seemed like a reasonable
> place given the similar optimization (at least in appearance to me)
> being done there for the GROUP BY case.  I'm happy to admit that I
> haven't looked at it in very much depth (hence my question to David) and
> I'm not an expert in this area, but I did want to bring up that the
> general idea and the relative trade-offs at least sounded reasonable.

You might be confusing #1 and #2. My concern is with #2.  The existing
GROUP BY clause optimisation is almost identical to #1. I just wanted
to also apply it to the DISTINCT clause.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

"Finnerty, Jim"

Date:

24 August 2018, 19:05:21

I feel strongly that eliminating the entire DISTINCT or GROUP BY clause (when there are no aggs) is an important
optimization,especially when the incremental cost to test for it is so tiny.  I'm happy to submit that as a separate
thread.

My goal here was to move the original proposal along and contribute a little something back to the community in the
process. DISTINCT optimization is currently quite poor compared to the leading commercial RDBMS alternatives, and doing
unnecessaryDISTINCT in the single-table case is an example of that.  There are other missing DISTINCT optimizations.
 

I'll explore a proper way to test that it's in the single-relation case, and will post a separate thread for the
'removeunnecessary DISTINCT' optimization.
 

Cheers,

    /Jim


On 8/23/18, 11:12 PM, "David Rowley" <david.rowley@2ndquadrant.com> wrote:

    You might be confusing #1 and #2. My concern is with #2.  The existing
    GROUP BY clause optimisation is almost identical to #1. I just wanted
    to also apply it to the DISTINCT clause.
    
    -- 
     David Rowley                   http://www.2ndQuadrant.com/
     PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing useless DISTINCT clauses

From

David Rowley

Date:

27 August 2018, 08:24:06

On 25 August 2018 at 04:05, Finnerty, Jim <jfinnert@amazon.com> wrote:
> I'll explore a proper way to test that it's in the single-relation case, and will post a separate thread for the
'removeunnecessary DISTINCT' optimization.
 

See: has_multiple_baserels() in allpaths.c

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services