Home > mailing lists

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join - Mailing list pgsql-hackers

From	Dmitry Dolgov
Subject	Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
Date	July 19, 2018 22:04:22
Msg-id	CA+q6zcVizVudWJmy=XZ4uLA_5p9Rrr-W98K_SghtVQNBpertUA@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] advanced partition matching algorithm forpartition-wise join (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses	Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
List	pgsql-hackers

Tree view

> On Tue, 17 Jul 2018 at 11:58, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Sun, Jul 15, 2018 at 11:13 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >> On Thu, 28 Jun 2018 at 07:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >>
> >> On 2018/06/27 22:21, Ashutosh Bapat wrote:
> >> > On Wed, Jun 27, 2018 at 12:28 PM, Amit Langote
> >> >> Ah, okay.  I thought of reporting this because I felt the errors may have
> >> >> to do with changes to the related code in HEAD between May 14 when you
> >> >> last posted the patches and today that you may need to account for in you
> >> >> patches.  For instance, there are many diffs like this:
> >> >>
> >> >> Looks like the Result node on top of Append is no longer there after
> >> >> applying your patch.
> >> >
> >> > Yes. They are coming because of a commit which removed Result node on
> >> > top of an Append node. I don't remember exactly which.
> >> >
> >> > I wouldn't worry about those diffs at this time. As I have mentioned
> >> > in earlier mails, the expected output from 0006 is quite large and is
> >> > not supposed to be committed. So, I don't see much value in fixing the
> >> > plans in that output.
> >> >
> >> > Do you see that as a hindrance in reviewing the code changes and tests in 0005?
> >>
> >> I think not.  I'll ignore 0006 for now and focus on other patches.
> >
> > Hi,
> >
> > Sorry for my irregular reviews. Unfortunately, the patch need to be rebased
> > again.
>
> I rebased the patches without any problem on top of commit

Thanks!

> > In the meantime I have few more commentaries about range_bounds_merge:
> >
> > * From what I see partition_range_bound_cmp is executed on the same bounds few
> >   times to find whether they are overlapping and during the range merging, is
> >   it necessary? Probably it would be enough just to compare current ranges once
> >   per iteration.
>
> The bounds that are compared in those cases are different. Any two
> bounds are compared only once per iteration. Remember each range has
> two bounds, thus there are total four comparisons possible. In case of
> overlap, we do all four comparisons.

Yep, indeed, now I see.

> > * Just to clarify - the iterating through all the partitions, is it the best
> >   way of finding matching ranges? Correct me if I'm wrong, but from what I see
> >   in the comments for PartitionBoundInfoData, all the ranges are arranged in
> >   increasing order - maybe then it's possible to use binary search for finding
> >   a matching range?
>
> The loop in partition_range_bounds_merge() runs as many times as the
> maximum of number of datums from the given partition bounds. So the
> complexity is O(n) where n is the maximum of number of datums. With
> binary search the complexity will increase to O(nlogn). I might be
> missing something here.

Now I'm confused even more. Correct me if I'm wrong, but here is what I see
right now:

* We're trying to solve the standard problem of finding overlapping intervals
  from two sets of intervals

* The current implementation implicitly compares every range from one side of a
  join with every range from another side, which is O(n^2).

Let's imagine we have two tables:

=# \d+ test1_overlap
   Table "public.test1_overlap"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 id     | integer |           |          |         | plain    |
 data   | jsonb   |           |          |         | extended |
Partition key: RANGE (id)
Partitions: test11_overlap FOR VALUES FROM (200) TO (210),
            test12_overlap FOR VALUES FROM (220) TO (230),
            test13_overlap FOR VALUES FROM (240) TO (250)

=# \d+ test2_overlap
   Table "public.test2_overlap"
 Column |  Type   | Collation | Nullable | Default | Storage  |
--------+---------+-----------+----------+---------+----------+
 id     | integer |           |          |         | plain    |
 data   | jsonb   |           |          |         | extended |
Partition key: RANGE (id)
Partitions: test210_overlap FOR VALUES FROM (160) TO (170),
            test211_overlap FOR VALUES FROM (180) TO (190),
            test21_overlap FOR VALUES FROM (0) TO (10),
            test22_overlap FOR VALUES FROM (20) TO (30),
            test23_overlap FOR VALUES FROM (200) TO (210),
            test24_overlap FOR VALUES FROM (40) TO (50),
            test25_overlap FOR VALUES FROM (60) TO (70),
            test26_overlap FOR VALUES FROM (80) TO (90),
            test27_overlap FOR VALUES FROM (100) TO (110),
            test28_overlap FOR VALUES FROM (120) TO (130),
            test29_overlap FOR VALUES FROM (140) TO (150)

And the join:

select * from test1_overlap inner join test2_overlap using (id);

Partition wise join will work fine, but what will happen (I see this following
the code under gdb) is that inside the function partition_range_bounds_merge we
start from two partitions:

test11_overlap (200, 210) and test21_overlap (0, 10)

In the comparison loop we figure out that there is no overlap and go to the
ub_cmpval > 0 branch, because test11_overlap is higher that test21_overlap.
Inside this branch we think that we need to handle a missing partition case
(apparently, by mistake, but it works - in case if it's not a missing
partition, there are no any records to join with from a default one). Since in
this case there isn't any default partition, the result is merged = true. After
that partition_range_get_next_lb_index will move us to another partition pair:

test11_overlap (200, 210) and test22_overlap (20, 30)

And so on and so forth until we will reach the partition test23_overlap (200,
210), which would be actually what we're looking for. So basically as I said
above we will iterate over the all partitions, and we could skip some of them
using binary search.

> >
> > * I'm not sure why in this case partition wise join is not applied? Maybe I'm
> >   wrong, but I was under the impression that it should work here
> >
> > =# \d+ test1;
> >                         Table "public.test1"
> >  Column |  Type   | Collation | Nullable | Default | Storage  |
> > --------+---------+-----------+----------+---------+----------+
> >  data   | jsonb   |           |          |         | extended |
> >  id     | integer |           |          |         | plain    |
> > Partition key: RANGE (id)
> > Indexes:
> >     "test1_idx" btree (id)
> > Partitions: test11 FOR VALUES FROM (0) TO (100),
> >             test12 FOR VALUES FROM (100) TO (200),
> >             test13 FOR VALUES FROM (200) TO (300)
> >
> > =# \d+ test2;
> >                         Table "public.test2"
> >  Column |  Type   | Collation | Nullable | Default | Storage  |
> > --------+---------+-----------+----------+---------+----------+
> >  data   | jsonb   |           |          |         | extended |
> >  id     | integer |           |          |         | plain    |
> > Partition key: RANGE (id)
> > Indexes:
> >     "test2_idx" btree (id)
> > Partitions: test21 FOR VALUES FROM (10) TO (110),
> >             test22 FOR VALUES FROM (110) TO (210),
> >             test23 FOR VALUES FROM (210) TO (310)
> >
> >
> > I'm confused, since there is only one-to-one mapping of partitions.
>
> In this case, test21 overlaps test11 (10-100) and test12 (100-110),
> test22 overlaps test12 (110-200) and test13(200-210). So, there is no
> one-to-one mapping.

Yep, thanks for the explanation.

pgsql-hackers by date:

From: Alvaro Herrera
Date: 19 July 2018, 21:29:51
Subject: Re: Bug in gin insert redo code path during re-compression of emptygin data leaf pages

From: Andres Freund
Date: 19 July 2018, 22:20:53
Subject: Re: [HACKERS] possible self-deadlock window after badProcessStartupPacket

Re: [HACKERS] advanced partition matching algorithm forpartition-wise join - Mailing list pgsql-hackers

Previous

Next