Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers
From | amul sul |
---|---|
Subject | Re: [HACKERS] [POC] hash partitioning |
Date | |
Msg-id | CAAJ_b97mTb=dG2pv6+1ougxEVZFVnZJajW+0QHj46mEE7WsoOQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [POC] hash partitioning (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] [POC] hash partitioning
Re: [HACKERS] [POC] hash partitioning |
List | pgsql-hackers |
Hi, Please find the following updated patches attached: 0001-Cleanup.patch : Does some cleanup and code refactoring required for hash partition patch. Otherwise, there will be unnecessary diff in 0002 patch 0002-hash-partitioning_another_design-v3.patch: Addressed review comments given by Ashutosh and Robert. On Wed, May 10, 2017 at 11:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote: >> Fixed in the attached version. > > +[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable > class="parameter">column_name</replaceable> | ( <replaceable > class="parameter">expression</replaceable> ) } [ COLLATE <replaceable > > In the department of severe nitpicking, I would have expected this to > either use alphabetical order (HASH | LIST | RANGE) or to add the new > method at the end on the theory that we probably did the important > ones first (RANGE | LIST | HASH). > Fixed in the attached version. > + WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>, > REMAINDER <replaceable class="PARAMETER">value</replaceable> ) } > > Maybe value -> modulus and value -> remainder? > Fixed in the attached version. > <para> > + When creating a hash partition, <literal>MODULUS</literal> should be > + greater than zero and <literal>REMAINDER</literal> should be greater than > + or equal to zero. Every <literal>MODULUS</literal> must be a factor of > + the next larger modulus. > [ ... and it goes on from there ... ] > > This paragraph is fairly terrible, because it's a design spec that I > wrote, not an explanation intended for users. Here's an attempt to > improve it: > > === > When creating a hash partition, a modulus and remainder must be > specified. The modulus must be a positive integer, and the remainder > must a non-negative integer less than the modulus. Typically, when > initially setting up a hash-partitioned table, you should choose a > modulus equal to the number of partitions and assign every table the > same modulus and a different remainder (see examples, below). > However, it is not required that every partition have the same > modulus, only that every modulus which occurs among the children of a > hash-partitioned table is a factor of the next larger modulus. This > allows the number of partitions to be increased incrementally without > needing to move all the data at once. For example, suppose you have a > hash-partitioned table with 8 children, each of which has modulus 8, > but find it necessary to increase the number of partitions to 16. You > can detach one of the modulus-8 partitions, create two new modulus-16 > partitions covering the same portion of the key space (one with a > remainder equal to the remainder of the detached partition, and the > other with a remainder equal to that value plus 8), and repopulate > them with data. You can then repeat this -- perhaps at a later time > -- for each modulus-8 partition until none remain. While this may > still involve a large amount of data movement at each step, it is > still better than having to create a whole new table and move all the > data at once. > === > Thanks a lot, added in attached version. > +CREATE TABLE postal_code ( > + code int not null, > + city_id bigint not null, > + address text > +) PARTITION BY HASH (code); > > It would be fairly silly to hash-partition the postal_code table, > because there aren't enough postal codes to justify it. Maybe make > this a lineitem or order table, and partition on the order number. > Also, extend the example to show creating 4 partitions with modulus 4. > Understood, added order table example. > + if (spec->strategy != PARTITION_STRATEGY_HASH) > + elog(ERROR, "invalid strategy in partition bound spec"); > > I think this should be an ereport() if it can happen or an Assert() if > it's supposed to be prevented by the grammar. > Used Assert() in the attach version patch, also changed same for RANGE and LIST in 0001- cleanup patch. > + if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0], > + true, sizeof(int)) && > > It doesn't seem necessary to use datumIsEqual() here. You know the > datums are pass-by-value, so why not just use == ? I'd include a > comment but I don't think using datumIsEqual() adds anything here > except unnecessary complexity. More broadly, I wonder why we're > cramming this into the datums arrays instead of just adding another > field to PartitionBoundInfoData that is only used by hash > partitioning. > Fixed in the attached version. > /* > + * Check rule that every modulus must be a factor of the > + * next larger modulus. For example, if you have a bunch > + * of partitions that all have modulus 5, you can add a new > + * new partition with modulus 10 or a new partition with > + * modulus 15, but you cannot add both a partition with > + * modulus 10 and a partition with modulus 15, because 10 > + * is not a factor of 15. However, you could > simultaneously > + * use modulus 4, modulus 8, modulus 16, and modulus 32 if > + * you wished, because each modulus is a factor of the next > + * larger one. You could also use modulus 10, modulus 20, > + * and modulus 60. But you could not use modulus 10, > + * modulus 15, and modulus 60 for the same reason. > + */ > > I think just the first sentence is fine here; I'd nuke the rest of this. > Fixed in the attached version. > The block that follows could be merged into the surrounding block. > There's no need to increase the indentation level here, so let's not. > I also suspect that the code itself is wrong. There are two ways a > modulus can be invalid: it can either fail to be a multiple of the > next lower-modulus, or it can fail to be a factor of the next-higher > modulus. I think your code only checks the latter. So for example, > if the current modulus list is (4, 36), your code would correctly > disallow 3 because it's not a factor of 4 and would correctly disallow > 23 because it's not a factor of 36, but it looks to me like it would > allow 9 because that's a factor of 36. However, then the list would be > (4, 9, 36), and 4 is not a factor of 9. > This case is already handled in previous patch and similar regression test does exists in create_table.sql, see this in v2 patch. +-- check partition bound syntax for the hash partition +CREATE TABLE hash_parted ( + a int +) PARTITION BY HASH (a); +CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH (modulus 10, remainder 1); +CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH (modulus 50, remainder 0); +-- modulus 25 is factor of modulus of 50 but 10 is not factor of 25. +CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (modulus 25, remainder 2); > + greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]); > > Here, insert: /* Normally, the lowest remainder that could conflict > with the new partition is equal to the remainder specified for the new > partition, but when the new partition has a modulus higher than any > used so far, we need to adjust. */ > > + place = spec->remainder; > + if (place >= greatest_modulus) > + place = place % greatest_modulus; > Fixed in the attached version. > Here, insert: /* Check every potentially-conflicting remainder. */ > > + do > + { > + if (boundinfo->indexes[place] != -1) > + { > + overlap = true; > + with = boundinfo->indexes[place]; > + break; > + } > + place = place + spec->modulus; > > Maybe use += ? > Fixed. > + } while (place < greatest_modulus); > > + * Used when sorting hash bounds across all hash modulus > + * for hash partitioning > > This is not a very descriptive comment. Maybe /* We sort hash bounds > by modulus, then by remainder. */ > Fixed. > +cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull) > > I agree with Ashutosh's critique of this name. > Fixed. > + /* > + * Cache hash function information, similar to how record_eq() caches > + * equality operator information. (Perhaps no SQL syntax could cause > + * PG_NARGS()/nkeys to change between calls through the same FmgrInfo. > + * Checking nkeys here is just defensiveness.) > + */ > > Unless I'm missing something, this comment does not actually describe > what the code does. Each call to the function repeats the same > TypeCacheEntry lookups. I'm not actually sure whether caching here > can actually help - is there any situation in which the same FmgrInfo > will get used repeatedly here? But if it is possible then this code > fails to achieve its intended objective. > This code is no longer exists in new satisfies_hash_partition() code. > Another problem with this code is that, unless I'm missing something, > it completely ignores the opclass the user specified and just looks up > the default hash opclass. I think you should create a non-default > hash opclass for some data type -- maybe create one for int4 that just > returns the input value unchanged -- and test that the specifying > default hash opclass routes tuples according to hash_uint32(val) % > modulus while specifying your customer opclass routes tuples according > to val % modulus. > > Unless I'm severely misunderstanding the situation this code is > seriously undertested. > You are correct, I've missed to opclass handling. Fixed in the attached version, and added same case regression test. > + * Identify a btree opclass to use. Currently, we use only btree > + * operators, which seems enough for list and range partitioning. > > This comment is false, right? > Not really, this has been re-added due to indentation change. > + appendStringInfoString(buf, "FOR VALUES"); > + appendStringInfo(buf, " WITH (modulus %d, > remainder %d)", > + spec->modulus, spec->remainder); > > You could combine these. > I am not sure about this, I've used same code style exist in get_rule_expr() for range and list. Do you want me to change this for other partitioning as well? > +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH > (modulus 0, remainder 1); > +ERROR: invalid bound specification for a hash partition > +HINT: modulus must be greater than zero > +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH > (modulus 8, remainder 8); > +ERROR: invalid bound specification for a hash partition > +HINT: modulus must be greater than remainder > +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH > (modulus 3, remainder 2); > +ERROR: invalid bound specification for a hash partition > +HINT: every modulus must be factor of next largest modulus > > It seems like you could merge the hint back into the error: > > ERROR: hash partition modulus must be greater than 0 > ERROR: hash partition remainder must be less than modulus > ERROR: every hash partition modulus must be a factor of the next larger modulus > Added same in the attached version. Thanks again. > +DETAIL: Partition key of the failing row contains (HASHa, b) = (c, 5). > > That's obviously garbled somehow. > Oops. Fixed in the attached version. > +hash_partbound_elem: > + NonReservedWord Iconst > + { > + $$ = makeDefElem($1, (Node *)makeInteger($2), @1); > + } > + ; > + > +hash_partbound: > + hash_partbound_elem ',' hash_partbound_elem > + { > + $$ = list_make2($1, $3); > + } > + ; > > I don't think that it's the grammar's job to enforce that exactly two > options are present. It should allow any number of options, and some > later code, probably during parse analysis, should check that the ones > you need are present and that there are no invalid ones. See the code > for EXPLAIN, VACUUM, etc. > Tried to fixed in the attached version. > Regarding the test cases, I think that you've got a lot of tests for > failure scenarios (which is good) but not enough for success > scenarios. For example, you test that inserting a row into the wrong > hash partition fails, but not (unless I missed it) that tuple routing > succeeds. I think it would be good to have a test where you insert > 1000 or so rows into a hash partitioned table just to see it all work. > I am quite unsure about this test, now sure how can we verify correct tuple routing? > Also, you haven't done anything about the fact that constraint > exclusion doesn't work for hash partitioned tables, a point I raised > in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com > and which I still think is quite important. I think that to have a > committable patch for this feature that would have to be addressed. > Do you mean, we should come up with special handling(pre-pruning) for hash partitioning or modify constraints exclusion so that it will handle hash partition expression and cases that you have discussed in thread[1] as well? I was under the impression that we might going to have this as a separate feature proposal. 1]. https://www.postgresql.org/message-id/CA%2BTgmoaE9NZ_RiqZQLp2aJXPO4E78QxkQYL-FR2zCDop96Ahdg%40mail.gmail.com Regards, Amul Sul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: