Home > mailing lists

Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Parallel Inserts in CREATE TABLE AS
Date	September 24, 2020 02:41:28
Msg-id	20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de Whole thread Raw
In response to	Parallel Inserts in CREATE TABLE AS (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses	Re: Parallel Inserts in CREATE TABLE AS
List	pgsql-hackers

Tree view

Hi,

On 2020-09-23 17:20:20 +0530, Bharath Rupireddy wrote:
> The idea of this patch is to allow the leader and each worker insert the
> tuples in parallel if the SELECT part of the CTAS is parallelizable.

Cool!

> The design:

I think it'd be good if you could explain a bit more why you think this
safe to do in the way you have done it.

E.g. from a quick scroll through the patch, there's not even a comment
explaining that the only reason there doesn't need to be code dealing
with xid assignment because we already did the catalog changes to create
the table. But how does that work for SELECT INTO? Are you prohibiting
that? ...

> Pass the into clause, object id, command id from the leader to
> workers, so that each worker can create its own CTAS dest
> receiver. Leader inserts it's share of tuples if instructed to do, and
> so are workers. Each worker writes atomically it's number of inserted
> tuples into a shared memory variable, the leader combines this with
> it's own number of inserted tuples and shares to the client.
> 
> Below things are still pending. Thoughts are most welcome:
> 1. How better we can lift the "cannot insert tuples in a parallel worker"
> from heap_prepare_insert() for only CTAS cases or for that matter parallel
> copy? How about having a variable in any of the worker global contexts and
> use that? Of course, we can remove this restriction entirely in case we
> fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.

I have mentioned before that I think it'd be good if we changed the
insert APIs to have a more 'scan' like structure. I am thinking of
something like

TableInsertScan* table_begin_insert(Relation);
table_tuple_insert(TableInsertScan *is, other, args);
table_multi_insert(TableInsertScan *is, other, args);
table_end_insert(TableInsertScan *);

that'd then replace the BulkInsertStateData logic we have right now. But
more importantly it'd allow an AM to optimize operations across multiple
inserts, which is important for column stores.

And for the purpose of your question, we could then have a
table_insert_allow_parallel(TableInsertScan *);
or an additional arg to table_begin_insert().

> 3. Need to restrict parallel inserts, if CTAS tries to create temp/global
> tables as the workers will not have access to those tables. Need to analyze
> whether to allow parallelism if CTAS has prepared statements or with no
> data.

In which case does CTAS not create a table? You definitely need to
ensure that the table is created before your workers are started, and
there needs to be in a different CommandId.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Amit Kapila
Date: 24 September 2020, 02:38:18
Subject: Re: Parallel INSERT (INTO ... SELECT ...)

From: Michael Paquier
Date: 24 September 2020, 02:53:14
Subject: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers

Previous

Next