Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | Re: [PATCH] Erase the distinctClause if the result is unique by definition |
Date | |
Msg-id | CAKU4AWqOORqW900O-+L4L2+0xknsEqpfcs9FF7SeiO9TmpeZOg@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Erase the distinctClause if the result is unique by definition (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: [PATCH] Erase the distinctClause if the result is unique by definition
|
List | pgsql-hackers |
Hi David:
On Wed, Mar 18, 2020 at 9:56 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Mon, 16 Mar 2020 at 06:01, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Hi All:
>
> I have re-implemented the patch based on David's suggestion/code, Looks it
> works well. The updated patch mainly includes:
>
> 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from
> catalog and the not null from vars.
What about non-nullability that we can derive from means other than
NOT NULL constraints. Where will you track that now that you've
removed the UniqueKeySet type?
I tracked it in 'deconstruct_recurse', just before the distribute_qual_to_rels call.
+ ListCell *lc;
+ foreach(lc, find_nonnullable_vars(qual))
+ {
+ Var *var = lfirst_node(Var, lc);
+ RelOptInfo *rel = root->simple_rel_array[var->varno];
+ if (var->varattno > InvalidAttrNumber)
+ rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
+ }
+ foreach(lc, find_nonnullable_vars(qual))
+ {
+ Var *var = lfirst_node(Var, lc);
+ RelOptInfo *rel = root->simple_rel_array[var->varno];
+ if (var->varattno > InvalidAttrNumber)
+ rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
+ }
Traditionally we use attno or attnum rather than colno for variable
names containing attribute numbers
Currently I use a list of Var for a UnqiueKey, I guess it is ok?
> 3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths
> since we know jointype at that time. so we can handle the semi/anti join specially.
ok, but the join type was known already where I was calling the
function from. It just wasn't passed to the function.
> 4. Add the rule I suggested above, if both of the 2 relation yields the a unique result,
> the join result will be unique as well. the UK can be ( (rel1_uk1, rel1_uk2).. )
I see. So basically you're saying that the joinrel's uniquekeys should
be the cartesian product of the unique rels from either side of the
join. I wonder if that's a special case we need to worry about too
much. Surely it only applies for clauseless joins
where m1.b = m2.b;
The cartesian product of the unique rels will make the unqiue keys too long, so I maintain
the UnqiueKeyContext to make it short. The idea is if (UK1) is unique already, no bother
to add another UK as (UK1, UK2) which is just a superset of it.
> 5. If the unique key is impossible to be referenced by others, we can safely ignore
> it in order to keep the (join)rel->unqiuekeys short.
You could probably have an equivalent of has_useful_pathkeys() and
pathkeys_useful_for_ordering()
Thanks for suggestion, I will do so in the v5-xx.patch.
> 6. I only consider the not null check/opfamily check for the uniquekey which comes
> from UniqueIndex. I think that should be correct.
> 7. I defined each uniquekey as List of Expr, so I didn't introduce new node type.
Where will you store the collation Oid? I left comments to mention
that needed to be checked but just didn't wire it up.
This is too embarrassed, I am not sure if it is safe to ignore it. I removed it due to
the following reasons (sorry for that I didn't explain it carefully for the last email).
1. When we choose if an UK is usable, we have chance to compare the collation info
for restrictinfo (where uk = 1) or target list (select uk from t) with the indexinfo's collation,
the targetlist one has more trouble since we need to figure out the default collation for it.
However relation_has_unique_index_for has the same situation as us, it ignores it as well.
See comment /* XXX at some point we may need to check collations here too. */. It think
if there are some reasons we can ignore that.
2. What we expect from UK is:
a). Where m1.uniquekey = m2.b m2.uk will not be duplicated by this joinclause. Here
if m1.uk has different collation, it will raise runtime error.
b). Where m1.uniquekey collate 'xxxx' = m2.b. We may can't depends on
the run-time error this time. But if we are sure that *if uk is uk at default collation is unique,
then (uk collcate 'other-colation') is unique as well**. if so we may safe ignore it as well.
c). select uniquekey from t / select uniquekey collate 'xxxx' from t. This have the same
requirement as item b).
3). Looks maintain the collation for our case totally is a big effort, and user rarely use it, If
my expectation for 2.b is not true, I prefer to detect such case (user use a different collation),
we can just ignore the UK for that.
But After all, I think this should be an open question for now.
---
At last, I am so grateful for your suggestion/feedback, that's really heuristic and constructive.
And so thanks Tom's for the quick review and suggest to add a new fields for RelOptInfo,
without it I don't think I can add a new field to a so important struct. And also thanks Bapat who
explains the thing more detailed. I'm now writing the code for partition index stuff, which
is a bit of boring, since every partition may have different unique index. I am expecting that
I can finish it in the following 2 days, and hope you can have another round of review again.
Thanks for your feedback!
Best Regards
Andy Fan
pgsql-hackers by date: