Re: Declarative partitioning - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Declarative partitioning |
Date | |
Msg-id | CAFjFpRdvAPAoH+wUF+f4jfui3pKGM6i78DHBT3GTfoUO4Aj=+g@mail.gmail.com Whole thread Raw |
In response to | Re: Declarative partitioning (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>) |
Responses |
Re: Declarative partitioning
|
List | pgsql-hackers |
<div dir="ltr">What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned table,I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see thatthe code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like PartitionKeyExecInfo.I don't have any specific observation as to why we need such information in RelOptInfo, but lack ofit makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your patchre-uses inheritance code. I am not sure.<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On Tue,Aug 9, 2016 at 9:17 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp" target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/08/09 6:02, Robert Haas wrote:<br /> > OnMon, Aug 8, 2016 at 1:40 AM, Amit Langote<br /> > <<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a><wbr/>> wrote:<br /> >>> +1, ifwe could do it. It will need a change in the way Amit's patch stores<br /> >>> partitioning scheme in PartitionDesc.<br/> >><br /> >> Okay, I will try to implement this in the next version of the patch.<br /> >><br/> >> One thing that comes to mind is what if a user wants to apply hash<br /> >> operator class equalityto list partitioned key by specifying a hash<br /> >> operator class for the corresponding column. In thatcase, we would not<br /> >> have the ordering procedure with an hash operator class, hence any<br /> >> orderingbased optimization becomes impossible to implement. The current<br /> >> patch rejects a column for partitionkey if its type does not have a btree<br /> >> operator class for both range and list methods, so this issuedoesn't<br /> >> exist, however it could be seen as a limitation.<br /> ><br /> > Yes, I think you shouldexpect careful scrutiny of that issue. It<br /> > seems clear to me that range partitioning requires a btree opclass,<br/> > that hash partitioning requires a hash opclass, and that list<br /> > partitioning requires at leastone of those things. It would probably<br /> > be reasonable to pick one or the other and insist that list<br />> partitioning always requires exactly that, but I can't see how it's<br /> > reasonable to insist that you musthave both types of opclass for any<br /> > type of partitioning.<br /><br /></span>So because we intend to implementoptimizations for list partition<br /> metadata that presuppose existence of corresponding btree operator class,<br/> we should just always require user to specify one (or error out if user<br /> doesn't specify and a default onedoesn't exist). That way, we explicitly<br /> do not support specify hash equality operator for list partitioning.<br/><span class=""><br /> >> So, we have 3 choices for the internal representation of list partitions:<br/> >><br /> >> Choice 1 (the current approach): Load them in the same order as they are<br />>> found in the partition catalog:<br /> >><br /> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a','e'}<br /> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}<br /> >><br /> >> In this case,mismatch on the first list would make the two tables<br /> >> incompatibly partitioned, whereas they really aren'tincompatible.<br /> ><br /> > Such a limitation seems clearly unacceptable. We absolutely must be<br /> >able to match up compatible partitioning schemes without getting<br /> > confused by little details like the orderof the partitions.<br /><br /></span>Agreed. Will change my patch to adopt the below method.<br /><span class=""><br/> >> Choice 2: Representation with 2 arrays:<br /> >><br /> >> Table 1: ['a', 'b', 'c', 'd','e', 'f'], [3, 1, 2, 2, 3, 1]<br /> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]<br /> >><br/> >> It still doesn't help the case of pairwise joins because it's hard to tell<br /> >> which valuebelongs to which partition (the 2nd array carries the original<br /> >> partition numbers). Although it mightstill work for tuple-routing.<br /> ><br /> > It's very good for tuple routing. It can also be used to matchup<br /> > partitions for pairwise joins. Compare the first arrays. If they are<br /> > unequal, stop. Else,compare the second arrays, incrementally<br /> > building a mapping between them and returning false if the mapping<br/> > turns out to be non-bijective. For example, in this case, we look at<br /> > index 0 and decide that3 -> 2. We look at index 1 and decide 1 -> 3.<br /> > We look at index 2 and decide 2 -> 1. We look atindex 4 and find<br /> > that we already have a mapping for 2, but it's compatible because we<br /> > need 2 ->1 and that's what is already there. Similarly for the<br /> > remaining entries. This is really a pretty easy loopto write and it<br /> > should run very quickly.<br /><br /></span>I see, it does make sense to try to implement thisway.<br /><span class=""><br /> >> Choice 3: Order all lists' elements for each list individually and then<br />>> order the lists themselves on their first values:<br /> >><br /> >> Table 1: p3 {'a', 'e'}, p2 {'b','f'}, p1 {'c', 'd'}<br /> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}<br /> >><br /> >>This representation makes pairing partitions for pairwise joining<br /> >> convenient but for tuple-routingwe still need to visit each partition in<br /> >> the worst case.<br /> ><br /> > I think this isclearly not good enough for tuple routing. If the<br /> > algorithm I proposed above turns out to be too slow for matching<br/> > partitions, then we could keep both this representation and the<br /> > previous one. We are not limitedto just one. But I don't think<br /> > that's likely to be the case.<br /><br /></span>I agree. Let's see howthe option 2 turns out.<br /><span class=""><br /> > Also, note that all of this presupposes we're doing range<br />> partitioning, or perhaps list partitioning with a btree opclass. For<br /> > partitioning based on a hash opclass,you'd organize the data based on<br /> > the hash values rather than range comparisons.<br /><br /></span>Yes,the current patch does not implement hash partitioning, although I<br /> have to think about how to supportthe hash case when designing the<br /> internal data structures.<br /><br /><br /> By the way, I am planning to starta new thread with the latest set of<br /> patches which I will post in a day or two. I have tried to implement all<br/> the bug fixes and improvements that have been suggested on this thread so<br /> far. Thanks to all those who reviewedand gave their comments. Please<br /> check this page to get a link to the new thread:<br /><a href="https://commitfest.postgresql.org/10/611/"rel="noreferrer" target="_blank">https://commitfest.postgresql.<wbr />org/10/611/</a><br/><br /> Thanks,<br /> Amit<br /><br /><br /></blockquote></div><br /><br clear="all" /><br />-- <br/><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Best Wishes,<br />Ashutosh Bapat<br />EnterpriseDBCorporation<br />The Postgres Database Company<br /></div></div></div>
pgsql-hackers by date: