Re: Parallel Inserts in CREATE TABLE AS - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel Inserts in CREATE TABLE AS |
Date | |
Msg-id | 20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de Whole thread Raw |
In response to | Parallel Inserts in CREATE TABLE AS (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Parallel Inserts in CREATE TABLE AS
|
List | pgsql-hackers |
Hi, On 2020-09-23 17:20:20 +0530, Bharath Rupireddy wrote: > The idea of this patch is to allow the leader and each worker insert the > tuples in parallel if the SELECT part of the CTAS is parallelizable. Cool! > The design: I think it'd be good if you could explain a bit more why you think this safe to do in the way you have done it. E.g. from a quick scroll through the patch, there's not even a comment explaining that the only reason there doesn't need to be code dealing with xid assignment because we already did the catalog changes to create the table. But how does that work for SELECT INTO? Are you prohibiting that? ... > Pass the into clause, object id, command id from the leader to > workers, so that each worker can create its own CTAS dest > receiver. Leader inserts it's share of tuples if instructed to do, and > so are workers. Each worker writes atomically it's number of inserted > tuples into a shared memory variable, the leader combines this with > it's own number of inserted tuples and shares to the client. > > Below things are still pending. Thoughts are most welcome: > 1. How better we can lift the "cannot insert tuples in a parallel worker" > from heap_prepare_insert() for only CTAS cases or for that matter parallel > copy? How about having a variable in any of the worker global contexts and > use that? Of course, we can remove this restriction entirely in case we > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY. I have mentioned before that I think it'd be good if we changed the insert APIs to have a more 'scan' like structure. I am thinking of something like TableInsertScan* table_begin_insert(Relation); table_tuple_insert(TableInsertScan *is, other, args); table_multi_insert(TableInsertScan *is, other, args); table_end_insert(TableInsertScan *); that'd then replace the BulkInsertStateData logic we have right now. But more importantly it'd allow an AM to optimize operations across multiple inserts, which is important for column stores. And for the purpose of your question, we could then have a table_insert_allow_parallel(TableInsertScan *); or an additional arg to table_begin_insert(). > 3. Need to restrict parallel inserts, if CTAS tries to create temp/global > tables as the workers will not have access to those tables. Need to analyze > whether to allow parallelism if CTAS has prepared statements or with no > data. In which case does CTAS not create a table? You definitely need to ensure that the table is created before your workers are started, and there needs to be in a different CommandId. Greetings, Andres Freund
pgsql-hackers by date: