Home > mailing lists
Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: [HACKERS] [POC] hash partitioning
Date	May 10, 2017 21:09:04
Msg-id	CA+TgmoZd1RSk+j1NsOoza_EqdjzvyqeozcxGpBFj948aONkixQ@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] [POC] hash partitioning (amul sul <sulamul@gmail.com>)
Responses	Re: [HACKERS] [POC] hash partitioning Re: [HACKERS] [POC] hash partitioning
List	pgsql-hackers
Tree view
On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote:
> Fixed in the attached version.

+[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable
class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable> ) } [ COLLATE <replaceable

In the department of severe nitpicking, I would have expected this to
either use alphabetical order (HASH | LIST | RANGE) or to add the new
method at the end on the theory that we probably did the important
ones first (RANGE | LIST | HASH).

+  WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>,
REMAINDER <replaceable class="PARAMETER">value</replaceable> ) }

Maybe value -> modulus and value -> remainder?
     <para>
+      When creating a hash partition, <literal>MODULUS</literal> should be
+      greater than zero and <literal>REMAINDER</literal> should be greater than
+      or equal to zero.  Every <literal>MODULUS</literal> must be a factor of
+      the next larger modulus.
[ ... and it goes on from there ... ]

This paragraph is fairly terrible, because it's a design spec that I
wrote, not an explanation intended for users.  Here's an attempt to
improve it:

===
When creating a hash partition, a modulus and remainder must be
specified.  The modulus must be a positive integer, and the remainder
must a non-negative integer less than the modulus.  Typically, when
initially setting up a hash-partitioned table, you should choose a
modulus equal to the number of partitions and assign every table the
same modulus and a different remainder (see examples, below).
However, it is not required that every partition have the same
modulus, only that every modulus which occurs among the children of a
hash-partitioned table is a factor of the next larger modulus.  This
allows the number of partitions to be increased incrementally without
needing to move all the data at once.  For example, suppose you have a
hash-partitioned table with 8 children, each of which has modulus 8,
but find it necessary to increase the number of partitions to 16.  You
can detach one of the modulus-8 partitions, create two new modulus-16
partitions covering the same portion of the key space (one with a
remainder equal to the remainder of the detached partition, and the
other with a remainder equal to that value plus 8), and repopulate
them with data.  You can then repeat this -- perhaps at a later time
-- for each modulus-8 partition until none remain.  While this may
still involve a large amount of data movement at each step, it is
still better than having to create a whole new table and move all the
data at once.
===

+CREATE TABLE postal_code (
+    code         int not null,
+    city_id      bigint not null,
+    address      text
+) PARTITION BY HASH (code);

It would be fairly silly to hash-partition the postal_code table,
because there aren't enough postal codes to justify it.  Maybe make
this a lineitem or order table, and partition on the order number.
Also, extend the example to show creating 4 partitions with modulus 4.

+                if (spec->strategy != PARTITION_STRATEGY_HASH)
+                    elog(ERROR, "invalid strategy in partition bound spec");

I think this should be an ereport() if it can happen or an Assert() if
it's supposed to be prevented by the grammar.

+            if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0],
+                               true, sizeof(int)) &&

It doesn't seem necessary to use datumIsEqual() here.  You know the
datums are pass-by-value, so why not just use == ?  I'd include a
comment but I don't think using datumIsEqual() adds anything here
except unnecessary complexity.  More broadly, I wonder why we're
cramming this into the datums arrays instead of just adding another
field to PartitionBoundInfoData that is only used by hash
partitioning.
                   /*
+                     * Check rule that every modulus must be a factor of the
+                     * next larger modulus.  For example, if you have a bunch
+                     * of partitions that all have modulus 5, you can add a new
+                     * new partition with modulus 10 or a new partition with
+                     * modulus 15, but you cannot add both a partition with
+                     * modulus 10 and a partition with modulus 15, because 10
+                     * is not a factor of 15.  However, you could
simultaneously
+                     * use modulus 4, modulus 8, modulus 16, and modulus 32 if
+                     * you wished, because each modulus is a factor of the next
+                     * larger one.  You could also use modulus 10, modulus 20,
+                     * and modulus 60. But you could not use modulus 10,
+                     * modulus 15, and modulus 60 for the same reason.
+                     */

I think just the first sentence is fine here; I'd nuke the rest of this.

The block that follows could be merged into the surrounding block.
There's no need to increase the indentation level here, so let's not.
I also suspect that the code itself is wrong.  There are two ways a
modulus can be invalid: it can either fail to be a multiple of the
next lower-modulus, or it can fail to be a factor of the next-higher
modulus.  I think your code only checks the latter.  So for example,
if the current modulus list is (4, 36), your code would correctly
disallow 3 because it's not a factor of 4 and would correctly disallow
23 because it's not a factor of 36, but it looks to me like it would
allow 9 because that's a factor of 36. However, then the list would be
(4, 9, 36), and 4 is not a factor of 9.

+                    greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]);

Here, insert: /* Normally, the lowest remainder that could conflict
with the new partition is equal to the remainder specified for the new
partition, but when the new partition has a modulus higher than any
used so far, we need to adjust. */

+                    place = spec->remainder;
+                    if (place >= greatest_modulus)
+                        place = place % greatest_modulus;

Here, insert: /* Check every potentially-conflicting remainder. */

+                    do
+                    {
+                        if (boundinfo->indexes[place] != -1)
+                        {
+                            overlap = true;
+                            with = boundinfo->indexes[place];
+                            break;
+                        }
+                        place = place + spec->modulus;

Maybe use += ?

+                    } while (place < greatest_modulus);

+ * Used when sorting hash bounds across all hash modulus
+ * for hash partitioning

This is not a very descriptive comment.  Maybe /* We sort hash bounds
by modulus, then by remainder. */

+cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull)

I agree with Ashutosh's critique of this name.

+    /*
+     * Cache hash function information, similar to how record_eq() caches
+     * equality operator information.  (Perhaps no SQL syntax could cause
+     * PG_NARGS()/nkeys to change between calls through the same FmgrInfo.
+     * Checking nkeys here is just defensiveness.)
+     */

Unless I'm missing something, this comment does not actually describe
what the code does.  Each call to the function repeats the same
TypeCacheEntry lookups.  I'm not actually sure whether caching here
can actually help - is there any situation in which the same FmgrInfo
will get used repeatedly here?  But if it is possible then this code
fails to achieve its intended objective.

Another problem with this code is that, unless I'm missing something,
it completely ignores the opclass the user specified and just looks up
the default hash opclass.  I think you should create a non-default
hash opclass for some data type -- maybe create one for int4 that just
returns the input value unchanged -- and test that the specifying
default hash opclass routes tuples according to hash_uint32(val) %
modulus while specifying your customer opclass routes tuples according
to val % modulus.

Unless I'm severely misunderstanding the situation this code is
seriously undertested.

+             * Identify a btree opclass to use. Currently, we use only btree
+             * operators, which seems enough for list and range partitioning.

This comment is false, right?

+                        appendStringInfoString(buf, "FOR VALUES");
+                        appendStringInfo(buf, " WITH (modulus %d,
remainder %d)",
+                                         spec->modulus, spec->remainder);

You could combine these.

+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 0, remainder 1);
+ERROR:  invalid bound specification for a hash partition
+HINT:  modulus must be greater than zero
+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 8, remainder 8);
+ERROR:  invalid bound specification for a hash partition
+HINT:  modulus must be greater than remainder
+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 3, remainder 2);
+ERROR:  invalid bound specification for a hash partition
+HINT:  every modulus must be factor of next largest modulus

It seems like you could merge the hint back into the error:

ERROR: hash partition modulus must be greater than 0
ERROR: hash partition remainder must be less than modulus
ERROR: every hash partition modulus must be a factor of the next larger modulus

+DETAIL:  Partition key of the failing row contains (HASHa, b) = (c, 5).

That's obviously garbled somehow.

+hash_partbound_elem:
+        NonReservedWord Iconst
+            {
+                $$ = makeDefElem($1, (Node *)makeInteger($2), @1);
+            }
+        ;
+
+hash_partbound:
+        hash_partbound_elem ',' hash_partbound_elem
+            {
+                $$ = list_make2($1, $3);
+            }
+        ;

I don't think that it's the grammar's job to enforce that exactly two
options are present.  It should allow any number of options, and some
later code, probably during parse analysis, should check that the ones
you need are present and that there are no invalid ones.  See the code
for EXPLAIN, VACUUM, etc.

Regarding the test cases, I think that you've got a lot of tests for
failure scenarios (which is good) but not enough for success
scenarios.  For example, you test that inserting a row into the wrong
hash partition fails, but not (unless I missed it) that tuple routing
succeeds.  I think it would be good to have a test where you insert
1000 or so rows into a hash partitioned table just to see it all work.

Also, you haven't done anything about the fact that constraint
exclusion doesn't work for hash partitioned tables, a point I raised
in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
and which I still think is quite important.  I think that to have a
committable patch for this feature that would have to be addressed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
pgsql-hackers by date:
From: Tom Lane
Date: 10 May 2017, 20:28:39
Subject: Re: [HACKERS] [BUGS] Concurrent ALTER SEQUENCE RESTART Regression
From: Robert Haas
Date: 10 May 2017, 21:10:16
Subject: Re: [HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()
Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

Previous

Next