Re: Parallel INSERT (INTO ... SELECT ...) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel INSERT (INTO ... SELECT ...) |
Date | |
Msg-id | 20200924022117.x22cdky5fy5cwljh@alap3.anarazel.de Whole thread Raw |
In response to | Parallel INSERT (INTO ... SELECT ...) (Greg Nancarrow <gregn4422@gmail.com>) |
Responses |
Re: Parallel INSERT (INTO ... SELECT ...)
Re: Parallel INSERT (INTO ... SELECT ...) |
List | pgsql-hackers |
Hi, On 2020-09-22 14:55:21 +1000, Greg Nancarrow wrote: > Following on from Dilip Kumar's POC patch for allowing parallelism of > the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC > patch for allowing parallelism of both the INSERT and SELECT parts, > where it can be allowed. Cool! I think it'd be good if you outlined what your approach is to make this safe. > For cases where it can't be allowed (e.g. INSERT into a table with > foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE > ...") it at least allows parallelism of the SELECT part. I think it'd be good to do this part separately and first, independent of whether the insert part can be parallelized. > Obviously I've had to update the planner and executor and > parallel-worker code to make this happen, hopefully not breaking too > many things along the way. Hm, it looks like you've removed a fair bit of checks, it's not clear to me why that's safe in each instance. > - Currently only for "INSERT INTO ... SELECT ...". To support "INSERT > INTO ... VALUES ..." would need additional Table AM functions for > dividing up the INSERT work amongst the workers (currently only exists > for scans). Hm, not entirely following. What precisely are you thinking of here? I doubt it's really worth adding parallelism support for INSERT ... VALUES, the cost of spawning workers will almost always higher than the benefit. > @@ -116,7 +117,7 @@ toast_save_datum(Relation rel, Datum value, > TupleDesc toasttupDesc; > Datum t_values[3]; > bool t_isnull[3]; > - CommandId mycid = GetCurrentCommandId(true); > + CommandId mycid = GetCurrentCommandId(!IsParallelWorker()); > struct varlena *result; > struct varatt_external toast_pointer; > union Hm? Why do we need this in the various places you have made this change? > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index 1585861..94c8507 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -2049,11 +2049,6 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, > * inserts in general except for the cases where inserts generate a new > * CommandId (eg. inserts into a table having a foreign key column). > */ > - if (IsParallelWorker()) > - ereport(ERROR, > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > - errmsg("cannot insert tuples in a parallel worker"))); > - I'm afraid that this weakens our checks more than I'd like. What if this ends up being invoked from inside C code? > @@ -822,19 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, > isdead = false; > break; > case HEAPTUPLE_INSERT_IN_PROGRESS: > - > /* > * Since we hold exclusive lock on the relation, normally the > * only way to see this is if it was inserted earlier in our > * own transaction. However, it can happen in system > * catalogs, since we tend to release write lock before commit > - * there. Give a warning if neither case applies; but in any > - * case we had better copy it. > + * there. In any case we had better copy it. > */ > - if (!is_system_catalog && > - !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data))) > - elog(WARNING, "concurrent insert in progress within table \"%s\"", > - RelationGetRelationName(OldHeap)); > + > /* treat as live */ > isdead = false; > break; > @@ -1434,16 +1429,11 @@ heapam_index_build_range_scan(Relation heapRelation, > * the only way to see this is if it was inserted earlier > * in our own transaction. However, it can happen in > * system catalogs, since we tend to release write lock > - * before commit there. Give a warning if neither case > - * applies. > + * before commit there. > */ > xwait = HeapTupleHeaderGetXmin(heapTuple->t_data); > if (!TransactionIdIsCurrentTransactionId(xwait)) > { > - if (!is_system_catalog) > - elog(WARNING, "concurrent insert in progress within table \"%s\"", > - RelationGetRelationName(heapRelation)); > - > /* > * If we are performing uniqueness checks, indexing > * such a tuple could lead to a bogus uniqueness Huh, I don't think this should be necessary? > diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c > index a4944fa..9d3f100 100644 > --- a/src/backend/access/transam/varsup.c > +++ b/src/backend/access/transam/varsup.c > @@ -53,13 +53,6 @@ GetNewTransactionId(bool isSubXact) > TransactionId xid; > > /* > - * Workers synchronize transaction state at the beginning of each parallel > - * operation, so we can't account for new XIDs after that point. > - */ > - if (IsInParallelMode()) > - elog(ERROR, "cannot assign TransactionIds during a parallel operation"); > - > - /* > * During bootstrap initialization, we return the special bootstrap > * transaction id. > */ Same thing, this code cannot just be allowed to be reachable. What prevents you from assigning two different xids from different workers etc? > @@ -577,13 +608,6 @@ AssignTransactionId(TransactionState s) > Assert(s->state == TRANS_INPROGRESS); > > /* > - * Workers synchronize transaction state at the beginning of each parallel > - * operation, so we can't account for new XIDs at this point. > - */ > - if (IsInParallelMode() || IsParallelWorker()) > - elog(ERROR, "cannot assign XIDs during a parallel operation"); > - > - /* > * Ensure parent(s) have XIDs, so that a child always has an XID later > * than its parent. Mustn't recurse here, or we might get a stack > * overflow if we're at the bottom of a huge stack of subtransactions none Dito. Greetings, Andres Freund
pgsql-hackers by date: