Home > mailing lists

Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

From	Andy Fan
Subject	Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date	March 18, 2020 02:56:58
Msg-id	CAKU4AWqOORqW900O-+L4L2+0xknsEqpfcs9FF7SeiO9TmpeZOg@mail.gmail.com Whole thread
In response to	Re: [PATCH] Erase the distinctClause if the result is unique by definition (David Rowley <dgrowleyml@gmail.com>)
Responses	Re: [PATCH] Erase the distinctClause if the result is unique by definition
List	pgsql-hackers

Tree view

Hi David:

Thanks for your time.

On Wed, Mar 18, 2020 at 9:56 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 16 Mar 2020 at 06:01, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Hi All:
>
> I have re-implemented the patch based on David's suggestion/code, Looks it
> works well. The updated patch mainly includes:
>
> 1. Maintain the not_null_colno in RelOptInfo, which includes the not null from
> catalog and the not null from vars.

What about non-nullability that we can derive from means other than
NOT NULL constraints. Where will you track that now that you've
removed the UniqueKeySet type?

I tracked it in 'deconstruct_recurse', just before the distribute_qual_to_rels call.

+ ListCell *lc;
+ foreach(lc, find_nonnullable_vars(qual))
+ {
+ Var *var = lfirst_node(Var, lc);
+ RelOptInfo *rel = root->simple_rel_array[var->varno];
+ if (var->varattno > InvalidAttrNumber)
+ rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
+ }

Traditionally we use attno or attnum rather than colno for variable
names containing attribute numbers

Currently I use a list of Var for a UnqiueKey, I guess it is ok?

> 3. postpone the propagate_unique_keys_to_joinrel call to populate_joinrel_with_paths
> since we know jointype at that time. so we can handle the semi/anti join specially.

ok, but the join type was known already where I was calling the
function from. It just wasn't passed to the function.

> 4. Add the rule I suggested above, if both of the 2 relation yields the a unique result,
> the join result will be unique as well. the UK can be ( (rel1_uk1, rel1_uk2).. )

I see. So basically you're saying that the joinrel's uniquekeys should
be the cartesian product of the unique rels from either side of the
join. I wonder if that's a special case we need to worry about too
much. Surely it only applies for clauseless joins

Some other cases we may need this as well:). like select m1.pk, m2.pk from m1, m2

where m1.b = m2.b;

The cartesian product of the unique rels will make the unqiue keys too long, so I maintain

the UnqiueKeyContext to make it short. The idea is if (UK1) is unique already, no bother

to add another UK as (UK1, UK2) which is just a superset of it.

> 5. If the unique key is impossible to be referenced by others, we can safely ignore
> it in order to keep the (join)rel->unqiuekeys short.

You could probably have an equivalent of has_useful_pathkeys() and
pathkeys_useful_for_ordering()

Thanks for suggestion, I will do so in the v5-xx.patch.

> 6. I only consider the not null check/opfamily check for the uniquekey which comes
> from UniqueIndex. I think that should be correct.
> 7. I defined each uniquekey as List of Expr, so I didn't introduce new node type.

Where will you store the collation Oid? I left comments to mention
that needed to be checked but just didn't wire it up.

This is too embarrassed, I am not sure if it is safe to ignore it. I removed it due to

the following reasons (sorry for that I didn't explain it carefully for the last email).

1. When we choose if an UK is usable, we have chance to compare the collation info

for restrictinfo (where uk = 1) or target list (select uk from t) with the indexinfo's collation,

the targetlist one has more trouble since we need to figure out the default collation for it.

However relation_has_unique_index_for has the same situation as us, it ignores it as well.

See comment /* XXX at some point we may need to check collations here too. */. It think

if there are some reasons we can ignore that.

2. What we expect from UK is:

a). Where m1.uniquekey = m2.b m2.uk will not be duplicated by this joinclause. Here

if m1.uk has different collation, it will raise runtime error.

b). Where m1.uniquekey collate 'xxxx' = m2.b. We may can't depends on

the run-time error this time. But if we are sure that *if uk is uk at default collation is unique,

then (uk collcate 'other-colation') is unique as well**. if so we may safe ignore it as well.

c). select uniquekey from t / select uniquekey collate 'xxxx' from t. This have the same

requirement as item b).

3). Looks maintain the collation for our case totally is a big effort, and user rarely use it, If

my expectation for 2.b is not true, I prefer to detect such case (user use a different collation),

we can just ignore the UK for that.

But After all, I think this should be an open question for now.

---

At last, I am so grateful for your suggestion/feedback, that's really heuristic and constructive.

And so thanks Tom's for the quick review and suggest to add a new fields for RelOptInfo,

without it I don't think I can add a new field to a so important struct. And also thanks Bapat who

explains the thing more detailed. I'm now writing the code for partition index stuff, which

is a bit of boring, since every partition may have different unique index. I am expecting that

I can finish it in the following 2 days, and hope you can have another round of review again.

Thanks for your feedback!

Best Regards

Andy Fan

pgsql-hackers by date:

From: Michael Paquier
Date: 18 March 2020, 02:48:37
Subject: Re: ALTER tbl rewrite loses CLUSTER ON index

From: Tom Lane
Date: 18 March 2020, 02:58:54
Subject: Re: Small docs bugfix: make it clear what can be used in UPDATE FROM and DELETE USING

Re: [PATCH] Erase the distinctClause if the result is unique by definition - Mailing list pgsql-hackers

Previous

Next