Re: [HACKERS] advanced partition matching algorithm forpartition-wise join - Mailing list pgsql-hackers
From | Dmitry Dolgov |
---|---|
Subject | Re: [HACKERS] advanced partition matching algorithm forpartition-wise join |
Date | |
Msg-id | CA+q6zcVizVudWJmy=XZ4uLA_5p9Rrr-W98K_SghtVQNBpertUA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] advanced partition matching algorithm forpartition-wise join (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>) |
Responses |
Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
|
List | pgsql-hackers |
> On Tue, 17 Jul 2018 at 11:58, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> > >> On 2018/06/27 22:21, Ashutosh Bapat wrote: > >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote > >> >> Ah, okay. I thought of reporting this because I felt the errors may have > >> >> to do with changes to the related code in HEAD between May 14 when you > >> >> last posted the patches and today that you may need to account for in you > >> >> patches. For instance, there are many diffs like this: > >> >> > >> >> Looks like the Result node on top of Append is no longer there after > >> >> applying your patch. > >> > > >> > Yes. They are coming because of a commit which removed Result node on > >> > top of an Append node. I don't remember exactly which. > >> > > >> > I wouldn't worry about those diffs at this time. As I have mentioned > >> > in earlier mails, the expected output from 0006 is quite large and is > >> > not supposed to be committed. So, I don't see much value in fixing the > >> > plans in that output. > >> > > >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005? > >> > >> I think not. I'll ignore 0006 for now and focus on other patches. > > > > Hi, > > > > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased > > again. > > I rebased the patches without any problem on top of commit Thanks! > > In the meantime I have few more commentaries about range_bounds_merge: > > > > * From what I see partition_range_bound_cmp is executed on the same bounds few > > times to find whether they are overlapping and during the range merging, is > > it necessary? Probably it would be enough just to compare current ranges once > > per iteration. > > The bounds that are compared in those cases are different. Any two > bounds are compared only once per iteration. Remember each range has > two bounds, thus there are total four comparisons possible. In case of > overlap, we do all four comparisons. Yep, indeed, now I see. > > * Just to clarify - the iterating through all the partitions, is it the best > > way of finding matching ranges? Correct me if I'm wrong, but from what I see > > in the comments for PartitionBoundInfoData, all the ranges are arranged in > > increasing order - maybe then it's possible to use binary search for finding > > a matching range? > > The loop in partition_range_bounds_merge() runs as many times as the > maximum of number of datums from the given partition bounds. So the > complexity is O(n) where n is the maximum of number of datums. With > binary search the complexity will increase to O(nlogn). I might be > missing something here. Now I'm confused even more. Correct me if I'm wrong, but here is what I see right now: * We're trying to solve the standard problem of finding overlapping intervals from two sets of intervals * The current implementation implicitly compares every range from one side of a join with every range from another side, which is O(n^2). Let's imagine we have two tables: =# \d+ test1_overlap Table "public.test1_overlap" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ id | integer | | | | plain | data | jsonb | | | | extended | Partition key: RANGE (id) Partitions: test11_overlap FOR VALUES FROM (200) TO (210), test12_overlap FOR VALUES FROM (220) TO (230), test13_overlap FOR VALUES FROM (240) TO (250) =# \d+ test2_overlap Table "public.test2_overlap" Column | Type | Collation | Nullable | Default | Storage | --------+---------+-----------+----------+---------+----------+ id | integer | | | | plain | data | jsonb | | | | extended | Partition key: RANGE (id) Partitions: test210_overlap FOR VALUES FROM (160) TO (170), test211_overlap FOR VALUES FROM (180) TO (190), test21_overlap FOR VALUES FROM (0) TO (10), test22_overlap FOR VALUES FROM (20) TO (30), test23_overlap FOR VALUES FROM (200) TO (210), test24_overlap FOR VALUES FROM (40) TO (50), test25_overlap FOR VALUES FROM (60) TO (70), test26_overlap FOR VALUES FROM (80) TO (90), test27_overlap FOR VALUES FROM (100) TO (110), test28_overlap FOR VALUES FROM (120) TO (130), test29_overlap FOR VALUES FROM (140) TO (150) And the join: select * from test1_overlap inner join test2_overlap using (id); Partition wise join will work fine, but what will happen (I see this following the code under gdb) is that inside the function partition_range_bounds_merge we start from two partitions: test11_overlap (200, 210) and test21_overlap (0, 10) In the comparison loop we figure out that there is no overlap and go to the ub_cmpval > 0 branch, because test11_overlap is higher that test21_overlap. Inside this branch we think that we need to handle a missing partition case (apparently, by mistake, but it works - in case if it's not a missing partition, there are no any records to join with from a default one). Since in this case there isn't any default partition, the result is merged = true. After that partition_range_get_next_lb_index will move us to another partition pair: test11_overlap (200, 210) and test22_overlap (20, 30) And so on and so forth until we will reach the partition test23_overlap (200, 210), which would be actually what we're looking for. So basically as I said above we will iterate over the all partitions, and we could skip some of them using binary search. > > > > * I'm not sure why in this case partition wise join is not applied? Maybe I'm > > wrong, but I was under the impression that it should work here > > > > =# \d+ test1; > > Table "public.test1" > > Column | Type | Collation | Nullable | Default | Storage | > > --------+---------+-----------+----------+---------+----------+ > > data | jsonb | | | | extended | > > id | integer | | | | plain | > > Partition key: RANGE (id) > > Indexes: > > "test1_idx" btree (id) > > Partitions: test11 FOR VALUES FROM (0) TO (100), > > test12 FOR VALUES FROM (100) TO (200), > > test13 FOR VALUES FROM (200) TO (300) > > > > =# \d+ test2; > > Table "public.test2" > > Column | Type | Collation | Nullable | Default | Storage | > > --------+---------+-----------+----------+---------+----------+ > > data | jsonb | | | | extended | > > id | integer | | | | plain | > > Partition key: RANGE (id) > > Indexes: > > "test2_idx" btree (id) > > Partitions: test21 FOR VALUES FROM (10) TO (110), > > test22 FOR VALUES FROM (110) TO (210), > > test23 FOR VALUES FROM (210) TO (310) > > > > > > I'm confused, since there is only one-to-one mapping of partitions. > > In this case, test21 overlaps test11 (10-100) and test12 (100-110), > test22 overlaps test12 (110-200) and test13(200-210). So, there is no > one-to-one mapping. Yep, thanks for the explanation.
pgsql-hackers by date: