Thread: Fundamental scheduling bug in parallel restore of partitioned tables

Fundamental scheduling bug in parallel restore of partitioned tables

From

Tom Lane

Date:

05 April, 00:12:23

Looking into the complaint at [1], I find that pg_restore really gets
it quite wrong when trying to do a parallel restore of partitioned
tables with foreign key constraints.  The historical way that we
got parallel restore to work correctly with FK constraints is:

1. The dump file shows dependencies of the FK constraint object on
the two tables named in the constraint.  (This matches what it
says in pg_depend.)

2. If it's a parallel restore, repoint_table_dependencies() replaces
the dependencies on the tables with dependencies on their TABLE DATA
objects.

3. Now restore of the FK constraint will not be started until all
the relevant data is loaded.

However, if we're talking about partitioned tables, there is no
TABLE DATA object for a partitioned table; only for its leaf
partitions.  So repoint_table_dependencies() does nothing for the
FK object's dependencies, meaning that it's a candidate to be
launched as soon as we begin the parallel restore phase.

This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
command might fail outright if we've loaded data for the referencing
table but not the referenced table.  It could deadlock against other
parallel restore jobs, as reported in [1] (and which I find not
too terribly hard to reproduce here).  Even if it doesn't fail, if
it completes before we load data for the referencing table, we'll
have to do retail FK checks, greatly slowing that data load.

I think that the most intellectually rigorous solution is to
generate dummy TABLE DATA objects for partitioned tables, which
don't actually contain data but merely carry dependencies on
each of the child tables' TABLE DATA objects.  (In a multilevel
partitioned structure, this'd result in the top TABLE DATA having
indirect dependencies on all the leaf partition TABLE DATAs.)
Then repoint_table_dependencies() does the right thing with a
dependency on a partitioned table, and the dependency-driven
scheduler will take care of the rest.

There are two places we could make that happen.  The easiest
to code, likely, is to get pg_dump to create such objects and
include them in the dump.  We could possibly get pg_restore
to fake up such objects from the data it has available, but
I expect that that will be harder and slower than having
pg_dump do it.  So I'm leaning towards the first way.
The disadvantage is that existing dump files would still be
hazardous to restore in parallel.  But given that this has
been broken for some time and nobody's reported it till now,
I feel maybe that's okay.  (I don't think there would be
a backwards compatibility problem in introducing such new
objects into dumps, because AFAICS pg_restore would not need
any explicit knowledge of them.)

Thoughts?

            regards, tom lane

PS: attached is a text dump file for a trivial DB containing
two partitioned tables with an FK.  If I load this,
dump it into an -Fc-format dump file, and run something like
    pg_restore -j10 src.dump -d target
then I can reproduce the reported deadlock failure --- not
entirely reliably, but it does happen.

[1] https://www.postgresql.org/message-id/67469c1c-38bc-7d94-918a-67033f5dd731%40gmx.net

--
-- PostgreSQL database dump
--

-- Dumped from database version 18devel
-- Dumped by pg_dump version 18devel

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET transaction_timeout = 0;
SET client_encoding = 'SQL_ASCII';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

--
-- Name: parent1; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.parent1 (
    id integer NOT NULL,
    b text
)
PARTITION BY LIST (id);


ALTER TABLE public.parent1 OWNER TO postgres;

SET default_table_access_method = heap;

--
-- Name: c11; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.c11 (
    id integer CONSTRAINT parent1_id_not_null NOT NULL,
    b text
);


ALTER TABLE public.c11 OWNER TO postgres;

--
-- Name: c12; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.c12 (
    id integer CONSTRAINT parent1_id_not_null NOT NULL,
    b text
);


ALTER TABLE public.c12 OWNER TO postgres;

--
-- Name: parent2; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.parent2 (
    id integer NOT NULL,
    ref integer,
    b text
)
PARTITION BY LIST (id);


ALTER TABLE public.parent2 OWNER TO postgres;

--
-- Name: c21; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.c21 (
    id integer CONSTRAINT parent2_id_not_null NOT NULL,
    ref integer,
    b text
);


ALTER TABLE public.c21 OWNER TO postgres;

--
-- Name: c22; Type: TABLE; Schema: public; Owner: postgres
--

CREATE TABLE public.c22 (
    id integer CONSTRAINT parent2_id_not_null NOT NULL,
    ref integer,
    b text
);


ALTER TABLE public.c22 OWNER TO postgres;

--
-- Name: c11; Type: TABLE ATTACH; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent1 ATTACH PARTITION public.c11 FOR VALUES IN (1);


--
-- Name: c12; Type: TABLE ATTACH; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent1 ATTACH PARTITION public.c12 FOR VALUES IN (2);


--
-- Name: c21; Type: TABLE ATTACH; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent2 ATTACH PARTITION public.c21 FOR VALUES IN (1);


--
-- Name: c22; Type: TABLE ATTACH; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent2 ATTACH PARTITION public.c22 FOR VALUES IN (2);


--
-- Data for Name: c11; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.c11 (id, b) FROM stdin;
1    foo
\.


--
-- Data for Name: c12; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.c12 (id, b) FROM stdin;
\.


--
-- Data for Name: c21; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.c21 (id, ref, b) FROM stdin;
\.


--
-- Data for Name: c22; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.c22 (id, ref, b) FROM stdin;
2    1    bar
\.


--
-- Statistics for Name: c11; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c11',
    'relpages', '1'::integer,
    'reltuples', '1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: c12; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c12',
    'relpages', '0'::integer,
    'reltuples', '-1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: c21; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c21',
    'relpages', '0'::integer,
    'reltuples', '-1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: c22; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c22',
    'relpages', '1'::integer,
    'reltuples', '1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: parent1; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'parent1',
    'relpages', '0'::integer,
    'reltuples', '-1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: parent2; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'parent2',
    'relpages', '0'::integer,
    'reltuples', '-1'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Name: parent1 parent1_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent1
    ADD CONSTRAINT parent1_pkey PRIMARY KEY (id);


--
-- Name: c11 c11_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.c11
    ADD CONSTRAINT c11_pkey PRIMARY KEY (id);


--
-- Statistics for Name: c11_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c11_pkey',
    'relpages', '1'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Name: c12 c12_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.c12
    ADD CONSTRAINT c12_pkey PRIMARY KEY (id);


--
-- Statistics for Name: c12_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c12_pkey',
    'relpages', '1'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Name: parent2 parent2_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.parent2
    ADD CONSTRAINT parent2_pkey PRIMARY KEY (id);


--
-- Name: c21 c21_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.c21
    ADD CONSTRAINT c21_pkey PRIMARY KEY (id);


--
-- Statistics for Name: c21_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c21_pkey',
    'relpages', '1'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Name: c22 c22_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE ONLY public.c22
    ADD CONSTRAINT c22_pkey PRIMARY KEY (id);


--
-- Statistics for Name: c22_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'c22_pkey',
    'relpages', '1'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: parent1_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'parent1_pkey',
    'relpages', '0'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Statistics for Name: parent2_pkey; Type: STATISTICS DATA; Schema: public; Owner: -
--

SELECT * FROM pg_catalog.pg_restore_relation_stats(
    'version', '180000'::integer,
    'schemaname', 'public',
    'relname', 'parent2_pkey',
    'relpages', '0'::integer,
    'reltuples', '0'::real,
    'relallvisible', '0'::integer,
    'relallfrozen', '0'::integer
);


--
-- Name: c11_pkey; Type: INDEX ATTACH; Schema: public; Owner: postgres
--

ALTER INDEX public.parent1_pkey ATTACH PARTITION public.c11_pkey;


--
-- Name: c12_pkey; Type: INDEX ATTACH; Schema: public; Owner: postgres
--

ALTER INDEX public.parent1_pkey ATTACH PARTITION public.c12_pkey;


--
-- Name: c21_pkey; Type: INDEX ATTACH; Schema: public; Owner: postgres
--

ALTER INDEX public.parent2_pkey ATTACH PARTITION public.c21_pkey;


--
-- Name: c22_pkey; Type: INDEX ATTACH; Schema: public; Owner: postgres
--

ALTER INDEX public.parent2_pkey ATTACH PARTITION public.c22_pkey;


--
-- Name: parent2 parent2_ref_fkey; Type: FK CONSTRAINT; Schema: public; Owner: postgres
--

ALTER TABLE public.parent2
    ADD CONSTRAINT parent2_ref_fkey FOREIGN KEY (ref) REFERENCES public.parent1(id);


--
-- PostgreSQL database dump complete
--

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Tom Lane

Date:

11 April, 03:08:22

I wrote:
> I think that the most intellectually rigorous solution is to
> generate dummy TABLE DATA objects for partitioned tables, which
> don't actually contain data but merely carry dependencies on
> each of the child tables' TABLE DATA objects.

Here's a draft patch for this.  It seems to fix the problem in
light testing.  Some notes:

* Quite a lot of the patch is concerned with making various places
treat the new PARTITIONED DATA TOC entry type the same as TABLE DATA.
I considered removing that distinction and representing a partitioned
table's data object as TABLE DATA with no dataDumper, but it seems to
me this way is clearer.  Maybe others will think differently though;
it'd make for a smaller patch.

* It's annoying that we have to touch _tocEntryRequired's "Special
Case" logic for deciding whether an entry is schema or data, because
that means that old copies of pg_restore will think these entries are
schema and thus ignore them in a data-only restore.  But I think it
doesn't matter too much, because in a data-only restore we'd not be
creating indexes or foreign keys, so the scheduling bug isn't really
problematic.

* I'm not quite certain whether identify_locking_dependencies() needs
to treat PARTITIONED DATA dependencies as lockable.  I assumed here
that it does, but maybe we don't take out exclusive locks on
partitioned tables during restore?

* I noticed that a --data-only dump of the regression database now
complains:

$ pg_dump --data-only regression >r.dump
pg_dump: warning: there are circular foreign-key constraints on this table:
pg_dump: detail: parted_self_fk
pg_dump: hint: You might not be able to restore the dump without using --disable-triggers or temporarily dropping the
constraints.
pg_dump: hint: Consider using a full dump instead of a --data-only dump to avoid this problem.

The existing code does not produce this warning, but I think doing so
is correct.  The reason we missed the issue before is that
getTableDataFKConstraints ignores tables without a dataObj, so before
this patch it ignored partitioned tables altogether.

Comments?

            regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index f961162f365..a05ff716fe6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1988,12 +1988,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
         AH->tocsByDumpId[te->dumpId] = te;

         /*
-         * tableDataId provides the TABLE DATA item's dump ID for each TABLE
-         * TOC entry that has a DATA item.  We compute this by reversing the
-         * TABLE DATA item's dependency, knowing that a TABLE DATA item has
-         * just one dependency and it is the TABLE item.
+         * tableDataId provides the DATA item's dump ID for each TABLE TOC
+         * entry that has a TABLE DATA or PARTITIONED DATA item.  We compute
+         * this by reversing the DATA item's dependency, knowing that its
+         * first dependency is the TABLE item.
          */
-        if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+        if (te->nDeps > 0 &&
+            (strcmp(te->desc, "TABLE DATA") == 0 ||
+             strcmp(te->desc, "PARTITIONED DATA") == 0))
         {
             DumpId        tableId = te->dependencies[0];

@@ -2003,7 +2005,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
              * item's dump ID, so there should be a place for it in the array.
              */
             if (tableId <= 0 || tableId > maxDumpId)
-                pg_fatal("bad table dumpId for TABLE DATA item");
+                pg_fatal("bad table dumpId for %s item", te->desc);

             AH->tableDataId[tableId] = te->dumpId;
         }
@@ -3140,6 +3142,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
         {
             if (strcmp(te->desc, "TABLE") == 0 ||
                 strcmp(te->desc, "TABLE DATA") == 0 ||
+                strcmp(te->desc, "PARTITIONED DATA") == 0 ||
                 strcmp(te->desc, "VIEW") == 0 ||
                 strcmp(te->desc, "FOREIGN TABLE") == 0 ||
                 strcmp(te->desc, "MATERIALIZED VIEW") == 0 ||
@@ -3194,13 +3197,14 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
     if (!te->hadDumper)
     {
         /*
-         * Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it
-         * is considered a data entry.  We don't need to check for BLOBS or
-         * old-style BLOB COMMENTS entries, because they will have hadDumper =
-         * true ... but we do need to check new-style BLOB ACLs, comments,
-         * etc.
+         * Special Case: If 'PARTITIONED DATA', 'SEQUENCE SET' or anything to
+         * do with LOs, then it is considered a data entry.  We don't need to
+         * check for BLOBS or old-style BLOB COMMENTS entries, because they
+         * will have hadDumper = true ... but we do need to check new-style
+         * BLOB ACLs, comments, etc.
          */
-        if (strcmp(te->desc, "SEQUENCE SET") == 0 ||
+        if (strcmp(te->desc, "PARTITIONED DATA") == 0 ||
+            strcmp(te->desc, "SEQUENCE SET") == 0 ||
             strcmp(te->desc, "BLOB") == 0 ||
             strcmp(te->desc, "BLOB METADATA") == 0 ||
             (strcmp(te->desc, "ACL") == 0 &&
@@ -4996,14 +5000,14 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
         return;

     /*
-     * We assume the entry requires exclusive lock on each TABLE or TABLE DATA
-     * item listed among its dependencies.  Originally all of these would have
-     * been TABLE items, but repoint_table_dependencies would have repointed
-     * them to the TABLE DATA items if those are present (which they might not
-     * be, eg in a schema-only dump).  Note that all of the entries we are
-     * processing here are POST_DATA; otherwise there might be a significant
-     * difference between a dependency on a table and a dependency on its
-     * data, so that closer analysis would be needed here.
+     * We assume the entry requires exclusive lock on each TABLE, TABLE DATA,
+     * or PARTITIONED DATA item listed among its dependencies.  Originally all
+     * of these would have been TABLE items, but repoint_table_dependencies
+     * would have repointed them to the DATA items if those are present (which
+     * they might not be, eg in a schema-only dump).  Note that all of the
+     * entries we are processing here are POST_DATA; otherwise there might be
+     * a significant difference between a dependency on a table and a
+     * dependency on its data, so that closer analysis would be needed here.
      */
     lockids = (DumpId *) pg_malloc(te->nDeps * sizeof(DumpId));
     nlockids = 0;
@@ -5012,8 +5016,9 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
         DumpId        depid = te->dependencies[i];

         if (depid <= AH->maxDumpId && AH->tocsByDumpId[depid] != NULL &&
-            ((strcmp(AH->tocsByDumpId[depid]->desc, "TABLE DATA") == 0) ||
-             strcmp(AH->tocsByDumpId[depid]->desc, "TABLE") == 0))
+            (strcmp(AH->tocsByDumpId[depid]->desc, "TABLE") == 0 ||
+             strcmp(AH->tocsByDumpId[depid]->desc, "TABLE DATA") == 0 ||
+             strcmp(AH->tocsByDumpId[depid]->desc, "PARTITIONED DATA") == 0))
             lockids[nlockids++] = depid;
     }

diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..5e2eaf2d4c6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -310,7 +310,8 @@ struct _archiveHandle

     /* arrays created after the TOC list is complete: */
     struct _tocEntry **tocsByDumpId;    /* TOCs indexed by dumpId */
-    DumpId       *tableDataId;    /* TABLE DATA ids, indexed by table dumpId */
+    DumpId       *tableDataId;    /* TABLE DATA and PARTITIONED DATA dumpIds,
+                                 * indexed by table dumpId */

     struct _tocEntry *currToc;    /* Used when dumping data */
     pg_compress_specification compression_spec; /* Requested specification for
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index f7c3af56304..42dda2c4220 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -25,6 +25,8 @@
  */
 #include "postgres_fe.h"

+#include <limits.h>
+
 #include "common/file_utils.h"
 #include "compress_io.h"
 #include "pg_backup_utils.h"
@@ -819,10 +821,10 @@ _ReopenArchive(ArchiveHandle *AH)
 /*
  * Prepare for parallel restore.
  *
- * The main thing that needs to happen here is to fill in TABLE DATA and BLOBS
- * TOC entries' dataLength fields with appropriate values to guide the
- * ordering of restore jobs.  The source of said data is format-dependent,
- * as is the exact meaning of the values.
+ * The main thing that needs to happen here is to fill in TABLE DATA,
+ * PARTITIONED_DATA, and BLOBS TOC entries' dataLength fields with appropriate
+ * values to guide the ordering of restore jobs.  The source of said data is
+ * format-dependent, as is the exact meaning of the values.
  *
  * A format module might also choose to do other setup here.
  */
@@ -830,6 +832,7 @@ static void
 _PrepParallelRestore(ArchiveHandle *AH)
 {
     lclContext *ctx = (lclContext *) AH->formatData;
+    bool        have_partitioned_data = false;
     TocEntry   *prev_te = NULL;
     lclTocEntry *prev_tctx = NULL;
     TocEntry   *te;
@@ -843,6 +846,10 @@ _PrepParallelRestore(ArchiveHandle *AH)
     {
         lclTocEntry *tctx = (lclTocEntry *) te->formatData;

+        /* Track whether there are any PARTITIONED_DATA items */
+        if (!have_partitioned_data && strcmp(te->desc, "PARTITIONED DATA") == 0)
+            have_partitioned_data = true;
+
         /*
          * Ignore entries without a known data offset; if we were unable to
          * seek to rewrite the TOC when creating the archive, this'll be all
@@ -873,6 +880,38 @@ _PrepParallelRestore(ArchiveHandle *AH)
         if (endpos > prev_tctx->dataPos)
             prev_te->dataLength = endpos - prev_tctx->dataPos;
     }
+
+    /*
+     * For PARTITIONED DATA items, add up the sizes of their child objects.
+     * (We couldn't do this earlier, since when we encounter a PARTITIONED
+     * DATA item in the first loop we typically don't know the dataLength of
+     * its last child yet.)
+     */
+    if (have_partitioned_data)
+    {
+        for (te = AH->toc->next; te != AH->toc; te = te->next)
+        {
+            if (strcmp(te->desc, "PARTITIONED DATA") != 0)
+                continue;
+            for (int i = 0; i < te->nDeps; i++)
+            {
+                DumpId        depid = te->dependencies[i];
+
+                if (depid <= AH->maxDumpId && AH->tocsByDumpId[depid] != NULL)
+                {
+                    pgoff_t        childLength = AH->tocsByDumpId[depid]->dataLength;
+
+                    te->dataLength += childLength;
+                    /* handle overflow -- unlikely except with 32-bit pgoff_t */
+                    if (unlikely(te->dataLength < 0))
+                    {
+                        te->dataLength = INT_MAX;
+                        break;
+                    }
+                }
+            }
+        }
+    }
 }

 /*
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 21b00792a8a..dcb7dfc2ee7 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -37,6 +37,7 @@
 #include "postgres_fe.h"

 #include <dirent.h>
+#include <limits.h>
 #include <sys/stat.h>

 #include "common/file_utils.h"
@@ -722,16 +723,17 @@ setFilePath(ArchiveHandle *AH, char *buf, const char *relativeFilename)
 /*
  * Prepare for parallel restore.
  *
- * The main thing that needs to happen here is to fill in TABLE DATA and BLOBS
- * TOC entries' dataLength fields with appropriate values to guide the
- * ordering of restore jobs.  The source of said data is format-dependent,
- * as is the exact meaning of the values.
+ * The main thing that needs to happen here is to fill in TABLE DATA,
+ * PARTITIONED_DATA, and BLOBS TOC entries' dataLength fields with appropriate
+ * values to guide the ordering of restore jobs.  The source of said data is
+ * format-dependent, as is the exact meaning of the values.
  *
  * A format module might also choose to do other setup here.
  */
 static void
 _PrepParallelRestore(ArchiveHandle *AH)
 {
+    bool        have_partitioned_data = false;
     TocEntry   *te;

     for (te = AH->toc->next; te != AH->toc; te = te->next)
@@ -740,6 +742,10 @@ _PrepParallelRestore(ArchiveHandle *AH)
         char        fname[MAXPGPATH];
         struct stat st;

+        /* Track whether there are any PARTITIONED_DATA items */
+        if (!have_partitioned_data && strcmp(te->desc, "PARTITIONED DATA") == 0)
+            have_partitioned_data = true;
+
         /*
          * A dumpable object has set tctx->filename, any other object has not.
          * (see _ArchiveEntry).
@@ -784,6 +790,38 @@ _PrepParallelRestore(ArchiveHandle *AH)
         if (strcmp(te->desc, "BLOBS") == 0)
             te->dataLength *= 1024;
     }
+
+    /*
+     * For PARTITIONED DATA items, add up the sizes of their child objects.
+     * (Unlike pg_backup_custom.c, we could theoretically do this within the
+     * previous loop, but it seems best to keep the logic looking the same in
+     * both functions.)
+     */
+    if (have_partitioned_data)
+    {
+        for (te = AH->toc->next; te != AH->toc; te = te->next)
+        {
+            if (strcmp(te->desc, "PARTITIONED DATA") != 0)
+                continue;
+            for (int i = 0; i < te->nDeps; i++)
+            {
+                DumpId        depid = te->dependencies[i];
+
+                if (depid <= AH->maxDumpId && AH->tocsByDumpId[depid] != NULL)
+                {
+                    pgoff_t        childLength = AH->tocsByDumpId[depid]->dataLength;
+
+                    te->dataLength += childLength;
+                    /* handle overflow -- unlikely except with 32-bit pgoff_t */
+                    if (unlikely(te->dataLength < 0))
+                    {
+                        te->dataLength = INT_MAX;
+                        break;
+                    }
+                }
+            }
+        }
+    }
 }

 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index c6e6d3b2b86..3c174924770 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -258,6 +258,7 @@ static void prohibit_crossdb_refs(PGconn *conn, const char *dbname,

 static NamespaceInfo *findNamespace(Oid nsoid);
 static void dumpTableData(Archive *fout, const TableDataInfo *tdinfo);
+static void dumpPartitionedData(Archive *fout, const TableDataInfo *tdinfo);
 static void refreshMatViewData(Archive *fout, const TableDataInfo *tdinfo);
 static const char *getRoleName(const char *roleoid_str);
 static void collectRoleNames(Archive *fout);
@@ -347,6 +348,7 @@ static void getDomainConstraints(Archive *fout, TypeInfo *tyinfo);
 static void getTableData(DumpOptions *dopt, TableInfo *tblinfo, int numTables, char relkind);
 static void makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo);
 static void buildMatViewRefreshDependencies(Archive *fout);
+static void buildPartitionedDataDependencies(Archive *fout);
 static void getTableDataFKConstraints(void);
 static void determineNotNullFlags(Archive *fout, PGresult *res, int r,
                                   TableInfo *tbinfo, int j,
@@ -1076,6 +1078,7 @@ main(int argc, char **argv)
     {
         getTableData(&dopt, tblinfo, numTables, 0);
         buildMatViewRefreshDependencies(fout);
+        buildPartitionedDataDependencies(fout);
         if (!dopt.dumpSchema)
             getTableDataFKConstraints();
     }
@@ -2883,6 +2886,33 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
     destroyPQExpBuffer(clistBuf);
 }

+/*
+ * dumpPartitionedData -
+ *      dump the contents of a partitioned table
+ *
+ * Actually, this just makes an ArchiveEntry for the table contents.
+ * Furthermore, the ArchiveEntry doesn't really carry any data itself.
+ * What it carries is dependencies on the table data objects for the
+ * partitioned table's immediate children.  In that way, a dependency
+ * on the PARTITIONED DATA TOC entry represents a dependency on all
+ * data within the partition hierarchy.
+ */
+static void
+dumpPartitionedData(Archive *fout, const TableDataInfo *tdinfo)
+{
+    TableInfo  *tbinfo = tdinfo->tdtable;
+
+    if (tdinfo->dobj.dump & DUMP_COMPONENT_DATA)
+        ArchiveEntry(fout,
+                     tdinfo->dobj.catId,    /* catalog ID */
+                     tdinfo->dobj.dumpId,    /* dump ID */
+                     ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+                                  .namespace = tbinfo->dobj.namespace->dobj.name,
+                                  .owner = tbinfo->rolname,
+                                  .description = "PARTITIONED DATA",
+                                  .section = SECTION_DATA));
+}
+
 /*
  * refreshMatViewData -
  *      load or refresh the contents of a single materialized view
@@ -2965,9 +2995,6 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
          !simple_oid_list_member(&foreign_servers_include_oids,
                                  tbinfo->foreign_server)))
         return;
-    /* Skip partitioned tables (data in partitions) */
-    if (tbinfo->relkind == RELKIND_PARTITIONED_TABLE)
-        return;

     /* Don't dump data in unlogged tables, if so requested */
     if (tbinfo->relpersistence == RELPERSISTENCE_UNLOGGED &&
@@ -2986,6 +3013,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
         tdinfo->dobj.objType = DO_REFRESH_MATVIEW;
     else if (tbinfo->relkind == RELKIND_SEQUENCE)
         tdinfo->dobj.objType = DO_SEQUENCE_SET;
+    else if (tbinfo->relkind == RELKIND_PARTITIONED_TABLE)
+        tdinfo->dobj.objType = DO_PARTITIONED_DATA;
     else
         tdinfo->dobj.objType = DO_TABLE_DATA;

@@ -3000,6 +3029,12 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
     tdinfo->dobj.namespace = tbinfo->dobj.namespace;
     tdinfo->tdtable = tbinfo;
     tdinfo->filtercond = NULL;    /* might get set later */
+
+    /*
+     * The first dependency of any of these objTypes must be their table.  We
+     * may add more dependencies later, for example between PARTITIONED DATA
+     * objects and their children, or to enforce foreign key dump order.
+     */
     addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);

     /* A TableDataInfo contains data, of course */
@@ -3134,6 +3169,52 @@ buildMatViewRefreshDependencies(Archive *fout)
     destroyPQExpBuffer(query);
 }

+/*
+ * buildPartitionedDataDependencies -
+ *      add dump-order dependencies for partitioned tables' data
+ *
+ * We make all PARTITIONED_DATA objects depend on the TABLE_DATA or
+ * PARTITIONED_DATA objects of the immediate children of the partitioned
+ * table.  This might seem like the wrong direction for the dependencies
+ * to run, but it's what we want for controlling restore order.  The
+ * PARTITIONED_DATA object will not be considered restorable until after
+ * all the child data objects are restored, and thus a dependency on it
+ * from another object such as an FK constraint will block that object from
+ * being restored until all the data in the partition hierarchy is loaded.
+ */
+static void
+buildPartitionedDataDependencies(Archive *fout)
+{
+    DumpableObject **dobjs;
+    int            numObjs;
+    int            i;
+
+    /* Search through all the dumpable objects for TableAttachInfos */
+    getDumpableObjects(&dobjs, &numObjs);
+    for (i = 0; i < numObjs; i++)
+    {
+        if (dobjs[i]->objType == DO_TABLE_ATTACH)
+        {
+            TableAttachInfo *attachinfo = (TableAttachInfo *) dobjs[i];
+            TableInfo  *parentTbl = attachinfo->parentTbl;
+            TableInfo  *partitionTbl = attachinfo->partitionTbl;
+
+            /* Not interesting unless both tables are to be dumped */
+            if (parentTbl->dataObj == NULL ||
+                partitionTbl->dataObj == NULL)
+                continue;
+
+            /*
+             * Okay, make parent table's PARTITIONED_DATA object depend on the
+             * child table's TABLE_DATA or PARTITIONED_DATA object.
+             */
+            addObjectDependency(&parentTbl->dataObj->dobj,
+                                partitionTbl->dataObj->dobj.dumpId);
+        }
+    }
+    free(dobjs);
+}
+
 /*
  * getTableDataFKConstraints -
  *      add dump-order dependencies reflecting foreign key constraints
@@ -3173,7 +3254,8 @@ getTableDataFKConstraints(void)

             /*
              * Okay, make referencing table's TABLE_DATA object depend on the
-             * referenced table's TABLE_DATA object.
+             * referenced table's TABLE_DATA object.  (Either one could be a
+             * PARTITIONED_DATA object, too.)
              */
             addObjectDependency(&cinfo->contable->dataObj->dobj,
                                 ftable->dataObj->dobj.dumpId);
@@ -11448,6 +11530,9 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj)
         case DO_TABLE_DATA:
             dumpTableData(fout, (const TableDataInfo *) dobj);
             break;
+        case DO_PARTITIONED_DATA:
+            dumpPartitionedData(fout, (const TableDataInfo *) dobj);
+            break;
         case DO_DUMMY_TYPE:
             /* table rowtypes and array types are never dumped separately */
             break;
@@ -19723,6 +19808,7 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs,
                 addObjectDependency(preDataBound, dobj->dumpId);
                 break;
             case DO_TABLE_DATA:
+            case DO_PARTITIONED_DATA:
             case DO_SEQUENCE_SET:
             case DO_LARGE_OBJECT:
             case DO_LARGE_OBJECT_DATA:
@@ -19792,8 +19878,8 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs,
  *
  * Just to make things more complicated, there are also "special" dependencies
  * such as the dependency of a TABLE DATA item on its TABLE, which we must
- * not rearrange because pg_restore knows that TABLE DATA only depends on
- * its table.  In these cases we must leave the dependencies strictly as-is
+ * not rearrange because pg_restore knows that TABLE DATA's first dependency
+ * is its table.  In these cases we must leave the dependencies strictly as-is
  * even if they refer to not-to-be-dumped objects.
  *
  * To handle this, the convention is that "special" dependencies are created
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index b426b5e4736..49c0e489ccf 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -63,6 +63,7 @@ typedef enum
     DO_PROCLANG,
     DO_CAST,
     DO_TABLE_DATA,
+    DO_PARTITIONED_DATA,
     DO_SEQUENCE_SET,
     DO_DUMMY_TYPE,
     DO_TSPARSER,
@@ -399,6 +400,10 @@ typedef struct _attrDefInfo
     bool        separate;        /* true if must dump as separate item */
 } AttrDefInfo;

+/*
+ * TableDataInfo is used for DO_TABLE_DATA, DO_PARTITIONED_DATA,
+ * DO_REFRESH_MATVIEW, and DO_SEQUENCE_SET objTypes.
+ */
 typedef struct _tableDataInfo
 {
     DumpableObject dobj;
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 0b0977788f1..927cf7f7daa 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -129,6 +129,7 @@ static const int dbObjectTypePriority[] =
     [DO_PROCLANG] = PRIO_PROCLANG,
     [DO_CAST] = PRIO_CAST,
     [DO_TABLE_DATA] = PRIO_TABLE_DATA,
+    [DO_PARTITIONED_DATA] = PRIO_TABLE_DATA,
     [DO_SEQUENCE_SET] = PRIO_SEQUENCE_SET,
     [DO_DUMMY_TYPE] = PRIO_DUMMY_TYPE,
     [DO_TSPARSER] = PRIO_TSPARSER,
@@ -1233,13 +1234,15 @@ repairDependencyLoop(DumpableObject **loop,
     }

     /*
-     * If all the objects are TABLE_DATA items, what we must have is a
-     * circular set of foreign key constraints (or a single self-referential
-     * table).  Print an appropriate complaint and break the loop arbitrarily.
+     * If all the objects are TABLE_DATA or PARTITIONED_DATA items, what we
+     * must have is a circular set of foreign key constraints (or a single
+     * self-referential table).  Print an appropriate complaint and break the
+     * loop arbitrarily.
      */
     for (i = 0; i < nLoop; i++)
     {
-        if (loop[i]->objType != DO_TABLE_DATA)
+        if (loop[i]->objType != DO_TABLE_DATA &&
+            loop[i]->objType != DO_PARTITIONED_DATA)
             break;
     }
     if (i >= nLoop)
@@ -1433,6 +1436,11 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
                      "TABLE DATA %s  (ID %d OID %u)",
                      obj->name, obj->dumpId, obj->catId.oid);
             return;
+        case DO_PARTITIONED_DATA:
+            snprintf(buf, bufsize,
+                     "PARTITIONED DATA %s  (ID %d OID %u)",
+                     obj->name, obj->dumpId, obj->catId.oid);
+            return;
         case DO_SEQUENCE_SET:
             snprintf(buf, bufsize,
                      "SEQUENCE SET %s  (ID %d OID %u)",

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Tom Lane

Date:

14 April, 20:44:06

I wrote:
> Here's a draft patch for this.  It seems to fix the problem in
> light testing.

I realized that the "repro" I had for this isn't testing the same
thing that Dimitrios is seeing; what it is exposing looks more like
a bug or at least a behavioral change due to the v18 work to record
not-null constraints in pg_constraint [1].  So my patch may fix his
problem or it may not.  It would be good to have a reproducer that
fails (not necessarily every time) in v17 or earlier.

In addition to that uncertainty, pushing the patch now would get in
the way of identifying what's really going on at [1].  So I'm going
to sit on it for now, and maybe it's going to turn into v19 material.

            regards, tom lane

[1] https://www.postgresql.org/message-id/1280408.1744650810%40sss.pgh.pa.us

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Ashutosh Bapat

Date:

15 April, 13:38:56

On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Here's a draft patch for this.  It seems to fix the problem in
> > light testing.
>
> I realized that the "repro" I had for this isn't testing the same
> thing that Dimitrios is seeing; what it is exposing looks more like
> a bug or at least a behavioral change due to the v18 work to record
> not-null constraints in pg_constraint [1].  So my patch may fix his
> problem or it may not.  It would be good to have a reproducer that
> fails (not necessarily every time) in v17 or earlier.
>

I tried to reproduce the problem using your script on v17, but could't
get either deadlock or constraint violation error.

>
> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
> command might fail outright if we've loaded data for the referencing
> table but not the referenced table.

There's a comment in getConstraints()
/*
* Restoring an FK that points to a partitioned table requires that
* all partition indexes have been attached beforehand. Ensure that
* happens by making the constraint depend on each index partition
* attach object.
*/
FK constraint addition will wait for the child indexes in referenced
partitioned table to be attached, which in turn wait for data to be
loaded in the child tables. So it doesn't look like we will see ADD
constraint failing. I may be missing something though.

>  It could deadlock against other
> parallel restore jobs, as reported in [1] (and which I find not
> too terribly hard to reproduce here).

> Even if it doesn't fail, if
> it completes before we load data for the referencing table, we'll
> have to do retail FK checks, greatly slowing that data load.

FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
constraint is created before data is loaded into the referencing
table.

pg_restore: processing data for table "public.c22"
... snip
pg_restore: launching item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
pg_restore: creating FK CONSTRAINT "public.parent2 parent2_ref_fkey"
pg_restore: finished item 2626 TABLE DATA c22
... snip
pg_restore: launching item 2625 TABLE DATA c21
pg_restore: processing data for table "public.c21"
pg_restore: finished item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
... snip
pg_restore: finished item 2625 TABLE DATA c21

I tried applying your patch on v17 to see whether it causes the FK
creation to wait, but the patch doesn't apply cleanly.

--
Best Wishes,
Ashutosh Bapat

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Dimitrios Apostolou

Date:

15 April, 19:02:40

On Mon, 14 Apr 2025, Tom Lane wrote:

> I wrote:
>> Here's a draft patch for this.  It seems to fix the problem in
>> light testing.
>
> I realized that the "repro" I had for this isn't testing the same
> thing that Dimitrios is seeing; what it is exposing looks more like
> a bug or at least a behavioral change due to the v18 work to record
> not-null constraints in pg_constraint [1].  So my patch may fix his
> problem or it may not.  It would be good to have a reproducer that
> fails (not necessarily every time) in v17 or earlier.

Thank you for your work on it.

I only got the "ERROR: deadlock detected" message once, with pg_restore
compiled from master branch. My dump is too large to test it many times on
v17 so I can't tell if it occurs there.

In general I believe that dependency resolution is not optimal, either
there is a deadlock bug or not. It can definitely be improved as work
(mostly post-data) is not parallelized as much as it can.

Anyway if I get the deadlock on v17 I'll update the initial thread.

Thanks,
Dimitris

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Tom Lane

Date:

15 April, 19:18:41

Dimitrios Apostolou <jimis@gmx.net> writes:
> I only got the "ERROR: deadlock detected" message once, with pg_restore 
> compiled from master branch.

Oh, hmm ... do you recall just when on the master branch?  I'm
wondering if it was before or after 14e87ffa5 (2024-11-08).
If it was after, then it might have been subject to the same
issue with not-null constraints that I ran into.

            regards, tom lane

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Dimitrios Apostolou

Date:

15 April, 19:35:48

On 15 April 2025 18:18:41 CEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Dimitrios Apostolou <jimis@gmx.net> writes:
>> I only got the "ERROR: deadlock detected" message once, with pg_restore
>> compiled from master branch.
>
>Oh, hmm ... do you recall just when on the master branch?  I'm
>wondering if it was before or after 14e87ffa5 (2024-11-08).
>If it was after, then it might have been subject to the same
>issue with not-null constraints that I ran into.
>
>            regards, tom lane

Definitely after.

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Tom Lane

Date:

15 April, 21:49:22

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
>> command might fail outright if we've loaded data for the referencing
>> table but not the referenced table.

> There's a comment in getConstraints()
> /*
> * Restoring an FK that points to a partitioned table requires that
> * all partition indexes have been attached beforehand. Ensure that
> * happens by making the constraint depend on each index partition
> * attach object.
> */

Ah, that is an excellent point which I missed.  And the INDEX ATTACH
objects have dependencies on the leaf tables, which *will* get
repointed to their TABLE DATA objects by repoint_table_dependencies.
So by the time we are ready to restore the FK CONSTRAINT object,
we are certain to have loaded all the data of the referenced table.
But there's nothing delaying the constraint till after the referencing
table's data is loaded.

>> Even if it doesn't fail, if
>> it completes before we load data for the referencing table, we'll
>> have to do retail FK checks, greatly slowing that data load.

> FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
> constraint is created before data is loaded into the referencing
> table.

Yes, I reproduced that as well.  That squares with the above
analysis.

So at this point we have:

#1: ADD CONSTRAINT failure because of missing referenced data:
not possible after all.

#2: Deadlock between parallel restore jobs: possible in HEAD, but
it seems likely to be a bug introduced by the not-null-constraint
work rather than being pg_restore's fault.  We have no evidence
that such a deadlock can happen in released branches, and the lack
of field reports suggests that it can't.

#3: Restoring the FK constraint before referencing data is loaded:
this seems to be possible, and it's a performance problem, but
no more than that.

So now I withdraw the suggestion that this patch needs to be
back-patched.  We may not even need it in v18, if another fix
for #2 is found.  Fixing #3 would be a desirable thing to do
in v19, but if that's the only thing at stake then it's not
something to break feature freeze for.

For the moment I'll mark this CF entry as meant for v19.
We can resurrect consideration of it for v18 if there's not
a better way to fix the deadlock problem.

            regards, tom lane

Re: Fundamental scheduling bug in parallel restore of partitioned tables

From

Ashutosh Bapat

Date:

16 April, 12:47:54

On Wed, Apr 16, 2025 at 12:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> This is disastrous for assorted reasons.  The ALTER ADD CONSTRAINT
> >> command might fail outright if we've loaded data for the referencing
> >> table but not the referenced table.
>
> > There's a comment in getConstraints()
> > /*
> > * Restoring an FK that points to a partitioned table requires that
> > * all partition indexes have been attached beforehand. Ensure that
> > * happens by making the constraint depend on each index partition
> > * attach object.
> > */
>
> Ah, that is an excellent point which I missed.  And the INDEX ATTACH
> objects have dependencies on the leaf tables, which *will* get
> repointed to their TABLE DATA objects by repoint_table_dependencies.
> So by the time we are ready to restore the FK CONSTRAINT object,
> we are certain to have loaded all the data of the referenced table.
> But there's nothing delaying the constraint till after the referencing
> table's data is loaded.

Yes.

>
> So at this point we have:
>
> #1: ADD CONSTRAINT failure because of missing referenced data:
> not possible after all.
>

check

> #2: Deadlock between parallel restore jobs: possible in HEAD, but
> it seems likely to be a bug introduced by the not-null-constraint
> work rather than being pg_restore's fault.  We have no evidence
> that such a deadlock can happen in released branches, and the lack
> of field reports suggests that it can't.
>

I agree. I ran that repro several times against v17 and never saw a
deadlock, even after changing the number of partitions, sizes of
partitions, adding more tables etc.

Creating an FK constraint either happens before or after the loading
data in one of the partitions. That might point towards either
constraint creation or data load waiting for the other. But I have not
seen an evidence that a full deadlock chain is possible.

> #3: Restoring the FK constraint before referencing data is loaded:
> this seems to be possible, and it's a performance problem, but
> no more than that.
>

Yes. And once we fix this, there won't be waiting between constraint
creation and data load, so the remote possibility of a deadlock also
vanishes.

> So now I withdraw the suggestion that this patch needs to be
> back-patched.  We may not even need it in v18, if another fix
> for #2 is found.  Fixing #3 would be a desirable thing to do
> in v19, but if that's the only thing at stake then it's not
> something to break feature freeze for.
>
> For the moment I'll mark this CF entry as meant for v19.
> We can resurrect consideration of it for v18 if there's not
> a better way to fix the deadlock problem.

+1.


--
Best Wishes,
Ashutosh Bapat