From e788b293f3770c7d89bc2156658f4bde3aba1303 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 9 Nov 2018 10:20:14 +1300
Subject: [PATCH v16 2/2] Delay locking of partitions during INSERT and UPDATE

During INSERT, even if we were inserting a single row into a partitioned
table, we would obtain a lock on every partition which was a direct or
an indirect partition of the insert target table.  This was done in order
to provide a consistent order to the locking of the partitions, which happens
to be the same order that partitions are locked during planning.  The
problem with locking all these partitions was that if a partitioned table
had many partitions and the INSERT inserted one, or just a few rows, the
overhead of the locking was significantly more than the inserting the actual
rows.

This commit changes the locking to only lock partitions the first time we
route a tuple to them, so if you insert one row, then only 1 leaf
partition will be locked, plus any sub-partitioned tables that we search
through before we find the correct home of the tuple.  This does mean that
the locking order of partitions during INSERT does become less well defined.
Previously operations such as CREATE INDEX and TRUNCATE when performed on
leaf partitions could defend against deadlocking with concurrent INSERT by
performing the operation in table oid order. However, to deadlock, such
DDL would have had to have been performed inside a transaction and not in
table oid order.  With this commit it's now possible to get deadlocks even
if the DDL is performed in table oid order.   If required such
transactions can defend against such deadlocks by performing a LOCK TABLE
on the partitioned table before performing the DDL.

Currently, only INSERTs are affected by this change as UPDATEs to a
partitioned table still obtain locks on all partitions either during
planning or during AcquireExecutorLocks, however, there are upcoming
patches which may change this too.
---
 src/backend/executor/execPartition.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 962db6d7f0..f37371f561 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,9 +167,6 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * Callers must use the returned PartitionTupleRouting during calls to
  * ExecFindPartition().  The actual ResultRelInfo for a partition is only
  * allocated when the first tuple is routed there.
@@ -180,9 +177,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
 	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
@@ -535,11 +529,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	bool		found_whole_row;
 	int			part_result_rel_index;
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -987,7 +977,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 
 	if (partoid != RelationGetRelid(proute->partition_root))
-		rel = heap_open(partoid, NoLock);
+		rel = heap_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
-- 
2.16.2.windows.1