Home > mailing lists

Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From	David Rowley
Subject	Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date	March 9, 2020 22:21:27
Msg-id	CAApHDvq7i0=O97r4Y1pv68+tprVczKsXRsV28rM9H-rVPOfeNQ@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Erase the distinctClause if the result is unique by definition (Andy Fan <zhihui.fan1213@gmail.com>)
Responses	Re: [PATCH] Erase the distinctClause if the result is unique by definition Re: [PATCH] Erase the distinctClause if the result is unique by definition
List	pgsql-hackers

Tree view

On Sat, 7 Mar 2020 at 00:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Upload the newest patch so that the cfbot can pass.  The last patch failed
> because some explain without the (cost off).

I've only really glanced at this patch, but I think we need to do this
in a completely different way.

I've been mentioning UniqueKeys around this mailing list for quite a
while now [1]. To summarise the idea:

1. Add a new List field to RelOptInfo named unique_keys
2. During get_relation_info() process the base relation's unique
indexes and add to the RelOptInfo's unique_keys list the indexed
expressions from each unique index (this may need to be delayed until
check_index_predicates() since predOK is only set there)
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index). If the
condition was t1.unique = t2.unique then we could take the unique keys
from both sides of the join, and with t1.non_unique = t2.non_unique,
we can take neither.
4. When creating the GROUP BY paths (when there are no aggregates),
don't bother doing anything if the input rel's unique keys are a
subset of the GROUP BY clause. Otherwise, create the group by paths
and tag the new unique keys onto the GROUP BY rel.
5. When creating the DISTINCT paths, don't bother if the input rel has
unique keys are a subset of the distinct clause.

4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
non_unique will just uniquify once for the GROUP BY and not for the
distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
anything to uniquify the results.

Because both 4 and 5 require that the uniquekeys are a subset of the
distinct/group by clause, an empty uniquekey set would mean that the
RelOptInfo returns no more than 1 row.  That would allow your:

SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos.

I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

[1] https://www.postgresql.org/search/?m=1&ln=pgsql-hackers&q=uniquekeys

pgsql-hackers by date:

From: Andres Freund
Date: 09 March 2020, 22:15:58
Subject: Re: Add an optional timeout clause to isolationtester step.

From: Thomas Munro
Date: 09 March 2020, 22:28:23
Subject: Re: effective_io_concurrency's steampunk spindle maths

Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

Previous

Next