Re: Use merge-based matching for MCVs in eqjoinsel - Mailing list pgsql-hackers

From David Geier
Subject Re: Use merge-based matching for MCVs in eqjoinsel
Date
Msg-id 412007b6-a324-44f5-91ed-c869004854ec@gmail.com
Whole thread Raw
In response to Re: Use merge-based matching for MCVs in eqjoinsel  (Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>)
Responses Re: Use merge-based matching for MCVs in eqjoinsel
List pgsql-hackers
Hi Ilia!

> I have read all the previous messages - and yes, you are right. I don’t
> know why I didn’t consider using a hash table approach initially. Your
> idea makes a lot of sense.

Your solution would be beneficial on top, for cases where the data type
is not hashable. But I think that's overkill for a v1. I would start
with the hash-based version.

> To evaluate it, I ran benchmarks on JOB with three variants:
> 
> $ ./benchmark.sh master
> $ ./benchmark.sh merge
> $ ./benchmark.sh hash
> 
> I compared total planning time across all 113 queries.

Was this running with optimizations? How did you extract the planning time?

> 
> $ python3 planning_time.py master hash
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> --------------------------------------------------------------------------------
> 100                       | 1.00                | 1892.627            |
> 1886.969
> 1000                      | 1.09                | 2286.922            |
> 2100.099
> 2500                      | 1.94                | 4647.167            |
> 2400.711
> 5000                      | 6.15                | 17964.779           |
> 2919.914
> 7500                      | 10.58               | 38622.443           |
> 3650.375
> 10000                     | 16.33               | 69538.085           |
> 4257.864
> 
> $ python3 planning_time.py master merge
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> --------------------------------------------------------------------------------
> 100                       | 1.00                | 1892.627            |
> 1898.622
> 1000                      | 1.12                | 2286.922            |
> 2033.553
> 2500                      | 1.92                | 4647.167            |
> 2423.552
> 5000                      | 5.94                | 17964.779           |
> 3025.739
> 7500                      | 10.48               | 38622.443           |
> 3684.262
> 10000                     | 16.72               | 69538.085           |
> 4159.418

I would have expected the delta between the "merge" and "hash" variant
to be bigger, especially for default_statistics_target=10000. My small
test also showed that. Any idea why this is not showing in your results?

> Based on these results, I’d prefer the hash lookup implementation, so I
> think it makes sense to improve your patch further and bring it into
> good shape. Shall I take care of that, or would you prefer to do it
> yourself?

I think process-wise it's best if you review my code and I do the changes.

Could you as part of your review test tables with just a few MCVs to
make sure we're not regressing "small" cases? For now I'm only bailing
if one of the two MCV lists has just a single value. I'm expecting the
gains from fine tuning this value to be not measurable but let's double
check.

--
David Geier



pgsql-hackers by date:

Previous
From: Ilia Evdokimov
Date:
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Next
From: shveta malik
Date:
Subject: Re: Conflict detection for update_deleted in logical replication