Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers
From | Alexey Kondratov |
---|---|
Subject | Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling |
Date | |
Msg-id | 09534CC6-D594-4070-97BC-5AA93F107477@gmail.com Whole thread Raw |
In response to | Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
|
List | pgsql-hackers |
Thank you for your comments Peter, there are some points that I did not think about before.
On 9 Jun 2017, at 01:09, Peter Geoghegan <pg@bowt.ie> wrote:Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.
I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.
I am not going to start with "speculative insertion" right now, but it would be very
useful, if you give me a point, where to start. Maybe I will at least try to evaluate
the complexity of the problem.
I think that you need to more formally identify what errors your new
COPY error handling will need to swallow.
...
My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types.
Initially I was thinking only about malformed rows, e.g. less or extra columns.
Honestly, I did not know that there are so many levels and ways where error
can occur. So currently (and especially after your comments) I prefer to focus
only on the following list of errors:
1) File format issues
a. Less columns than needed
b. Extra columns
2) I am doubt about type mismatch. It is possible to imagine a situation when,
e.g. some integers are exported as int, and some as "int", but I am not sure
that is is a common situation.
3) Some constraint violations, e.g. unique index.
First appeared to be easy achievable without subtransactions. I have created a
proof of concept version of copy, where the errors handling is turned on by default.
Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec)
or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files.
It throws warnings instead of errors for malformed lines with less/extra columns
and reports line number.
Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH
around heap_form_tuple, since it is not yet inserted into the heap.
But third is questionable without subtransactions, since even if we check
constraints once, there maybe various before/after triggers which can modify
tuple, so it will not satisfy them. Corresponding comment inside copy.c states:
"Note that a BR trigger might modify tuple such that the partition constraint is
no satisfied, so we need to check in that case." Thus, there are maybe different
situations here, as I understand. However, it a point where "speculative insertion"
is able to help.
These three cases should cover most real-life scenarios.
Is there really much value in ignoring errors due to invalid encoding?
Now, I have some doubts about it too. If there is an encoding problem,
it is probably about the whole file, not only a few rows.
Alexey
Attachment
pgsql-hackers by date: