Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] [POC] hash partitioning |
Date | |
Msg-id | CA+TgmoZd1RSk+j1NsOoza_EqdjzvyqeozcxGpBFj948aONkixQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [POC] hash partitioning (amul sul <sulamul@gmail.com>) |
Responses |
Re: [HACKERS] [POC] hash partitioning
Re: [HACKERS] [POC] hash partitioning |
List | pgsql-hackers |
On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote: > Fixed in the attached version. +[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [ COLLATE <replaceable In the department of severe nitpicking, I would have expected this to either use alphabetical order (HASH | LIST | RANGE) or to add the new method at the end on the theory that we probably did the important ones first (RANGE | LIST | HASH). + WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>, REMAINDER <replaceable class="PARAMETER">value</replaceable> ) } Maybe value -> modulus and value -> remainder? <para> + When creating a hash partition, <literal>MODULUS</literal> should be + greater than zero and <literal>REMAINDER</literal> should be greater than + or equal to zero. Every <literal>MODULUS</literal> must be a factor of + the next larger modulus. [ ... and it goes on from there ... ] This paragraph is fairly terrible, because it's a design spec that I wrote, not an explanation intended for users. Here's an attempt to improve it: === When creating a hash partition, a modulus and remainder must be specified. The modulus must be a positive integer, and the remainder must a non-negative integer less than the modulus. Typically, when initially setting up a hash-partitioned table, you should choose a modulus equal to the number of partitions and assign every table the same modulus and a different remainder (see examples, below). However, it is not required that every partition have the same modulus, only that every modulus which occurs among the children of a hash-partitioned table is a factor of the next larger modulus. This allows the number of partitions to be increased incrementally without needing to move all the data at once. For example, suppose you have a hash-partitioned table with 8 children, each of which has modulus 8, but find it necessary to increase the number of partitions to 16. You can detach one of the modulus-8 partitions, create two new modulus-16 partitions covering the same portion of the key space (one with a remainder equal to the remainder of the detached partition, and the other with a remainder equal to that value plus 8), and repopulate them with data. You can then repeat this -- perhaps at a later time -- for each modulus-8 partition until none remain. While this may still involve a large amount of data movement at each step, it is still better than having to create a whole new table and move all the data at once. === +CREATE TABLE postal_code ( + code int not null, + city_id bigint not null, + address text +) PARTITION BY HASH (code); It would be fairly silly to hash-partition the postal_code table, because there aren't enough postal codes to justify it. Maybe make this a lineitem or order table, and partition on the order number. Also, extend the example to show creating 4 partitions with modulus 4. + if (spec->strategy != PARTITION_STRATEGY_HASH) + elog(ERROR, "invalid strategy in partition bound spec"); I think this should be an ereport() if it can happen or an Assert() if it's supposed to be prevented by the grammar. + if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0], + true, sizeof(int)) && It doesn't seem necessary to use datumIsEqual() here. You know the datums are pass-by-value, so why not just use == ? I'd include a comment but I don't think using datumIsEqual() adds anything here except unnecessary complexity. More broadly, I wonder why we're cramming this into the datums arrays instead of just adding another field to PartitionBoundInfoData that is only used by hash partitioning. /* + * Check rule that every modulus must be a factor of the + * next larger modulus. For example, if you have a bunch + * of partitions that all have modulus 5, you can add a new + * new partition with modulus 10 or a new partition with + * modulus 15, but you cannot add both a partition with + * modulus 10 and a partition with modulus 15, because 10 + * is not a factor of 15. However, you could simultaneously + * use modulus 4, modulus 8, modulus 16, and modulus 32 if + * you wished, because each modulus is a factor of the next + * larger one. You could also use modulus 10, modulus 20, + * and modulus 60. But you could not use modulus 10, + * modulus 15, and modulus 60 for the same reason. + */ I think just the first sentence is fine here; I'd nuke the rest of this. The block that follows could be merged into the surrounding block. There's no need to increase the indentation level here, so let's not. I also suspect that the code itself is wrong. There are two ways a modulus can be invalid: it can either fail to be a multiple of the next lower-modulus, or it can fail to be a factor of the next-higher modulus. I think your code only checks the latter. So for example, if the current modulus list is (4, 36), your code would correctly disallow 3 because it's not a factor of 4 and would correctly disallow 23 because it's not a factor of 36, but it looks to me like it would allow 9 because that's a factor of 36. However, then the list would be (4, 9, 36), and 4 is not a factor of 9. + greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]); Here, insert: /* Normally, the lowest remainder that could conflict with the new partition is equal to the remainder specified for the new partition, but when the new partition has a modulus higher than any used so far, we need to adjust. */ + place = spec->remainder; + if (place >= greatest_modulus) + place = place % greatest_modulus; Here, insert: /* Check every potentially-conflicting remainder. */ + do + { + if (boundinfo->indexes[place] != -1) + { + overlap = true; + with = boundinfo->indexes[place]; + break; + } + place = place + spec->modulus; Maybe use += ? + } while (place < greatest_modulus); + * Used when sorting hash bounds across all hash modulus + * for hash partitioning This is not a very descriptive comment. Maybe /* We sort hash bounds by modulus, then by remainder. */ +cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull) I agree with Ashutosh's critique of this name. + /* + * Cache hash function information, similar to how record_eq() caches + * equality operator information. (Perhaps no SQL syntax could cause + * PG_NARGS()/nkeys to change between calls through the same FmgrInfo. + * Checking nkeys here is just defensiveness.) + */ Unless I'm missing something, this comment does not actually describe what the code does. Each call to the function repeats the same TypeCacheEntry lookups. I'm not actually sure whether caching here can actually help - is there any situation in which the same FmgrInfo will get used repeatedly here? But if it is possible then this code fails to achieve its intended objective. Another problem with this code is that, unless I'm missing something, it completely ignores the opclass the user specified and just looks up the default hash opclass. I think you should create a non-default hash opclass for some data type -- maybe create one for int4 that just returns the input value unchanged -- and test that the specifying default hash opclass routes tuples according to hash_uint32(val) % modulus while specifying your customer opclass routes tuples according to val % modulus. Unless I'm severely misunderstanding the situation this code is seriously undertested. + * Identify a btree opclass to use. Currently, we use only btree + * operators, which seems enough for list and range partitioning. This comment is false, right? + appendStringInfoString(buf, "FOR VALUES"); + appendStringInfo(buf, " WITH (modulus %d, remainder %d)", + spec->modulus, spec->remainder); You could combine these. +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH (modulus 0, remainder 1); +ERROR: invalid bound specification for a hash partition +HINT: modulus must be greater than zero +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH (modulus 8, remainder 8); +ERROR: invalid bound specification for a hash partition +HINT: modulus must be greater than remainder +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH (modulus 3, remainder 2); +ERROR: invalid bound specification for a hash partition +HINT: every modulus must be factor of next largest modulus It seems like you could merge the hint back into the error: ERROR: hash partition modulus must be greater than 0 ERROR: hash partition remainder must be less than modulus ERROR: every hash partition modulus must be a factor of the next larger modulus +DETAIL: Partition key of the failing row contains (HASHa, b) = (c, 5). That's obviously garbled somehow. +hash_partbound_elem: + NonReservedWord Iconst + { + $$ = makeDefElem($1, (Node *)makeInteger($2), @1); + } + ; + +hash_partbound: + hash_partbound_elem ',' hash_partbound_elem + { + $$ = list_make2($1, $3); + } + ; I don't think that it's the grammar's job to enforce that exactly two options are present. It should allow any number of options, and some later code, probably during parse analysis, should check that the ones you need are present and that there are no invalid ones. See the code for EXPLAIN, VACUUM, etc. Regarding the test cases, I think that you've got a lot of tests for failure scenarios (which is good) but not enough for success scenarios. For example, you test that inserting a row into the wrong hash partition fails, but not (unless I missed it) that tuple routing succeeds. I think it would be good to have a test where you insert 1000 or so rows into a hash partitioned table just to see it all work. Also, you haven't done anything about the fact that constraint exclusion doesn't work for hash partitioned tables, a point I raised in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com and which I still think is quite important. I think that to have a committable patch for this feature that would have to be addressed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: