Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row |
Date | |
Msg-id | 1462d79784b2475f1c714c65a6f25652@oss.nttdata.com Whole thread Raw |
In response to | Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row (Yugo NAGATA <nagata@sraoss.co.jp>) |
List | pgsql-hackers |
On 2024-11-13 22:02, Yugo NAGATA wrote: > On Tue, 12 Nov 2024 17:38:25 +0900 > torikoshia <torikoshia@oss.nttdata.com> wrote: > >> On 2024-11-12 14:17, Yugo Nagata wrote: >> > On Tue, 12 Nov 2024 14:03:50 +0900 >> > Yugo Nagata <nagata@sraoss.co.jp> wrote: >> > >> >> On Tue, 12 Nov 2024 01:27:53 +0500 >> >> Kirill Reshke <reshkekirill@gmail.com> wrote: >> >> >> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote: >> >> > > >> >> > > On 2024-11-09 21:55, Kirill Reshke wrote: >> >> > > >> >> > > Thanks for working on this! >> >> > >> >> > Thanks for reviewing the v7 patch series! >> >> > >> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com> >> >> > > > wrote: >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote: >> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of >> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because >> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors . >> >> > > >> > I do find this behavior to be consistent. >> >> > > >> >> >> > > >> +1 >> >> > > >> >> >> > > >> >> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all >> >> > > >> > replacements, as if REJECT LIMIT is inf. >> >> > > >> >> >> > > >> +1 >> >> > > > >> >> > > > After thinking for a while, I'm now more opposed to this approach. I >> >> > > > think we should count rows with erroneous data as errors only if >> >> > > > null substitution for these rows failed, not the total number of rows >> >> > > > which were modified. >> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with >> >> > > > the limit. This is actually simpler approach IMHO. What do You think? >> >> > > >> >> > > IMHO I prefer the previous interpretation. >> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT >> >> > > is used to specify how many malformed rows are acceptable in the >> >> > > "original" data source. >> >> >> >> I also prefer the previous version. >> >> >> >> > I do like the first version of interpretation, but I have a struggle >> >> > with it. According to this interpretation, we will fail COPY command >> >> > if the number >> >> > of malformed data rows exceeds the limit, not the number of rejected >> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl >> >> > substitution) >> >> I feel your concern is valid. >> Currently 'reject' can occur only when converting a column's input >> value >> to its data type, but if we introduce set_to_null option 'reject' also >> occurs when inserting null, i.e. not null constraint. > > I can suppose "reject" means failure of COPY command here, that is, a > reject > of executing the command, not an error of data input. If so, we can > interpret > REJECT_LIMIT as "the number of malformed rows allowed before the COPY > command > is REJECTed" (not the number of rejected rows). In this case, I think > we don't > have to rename the option name. > > Of course, if there is more proper name that makes it easy for users to > understand the behaviour of the option, renaming should be nice. > >> >> The documentation says that REJECT_LIMIT "Specifies the maximum number >> >> of errors", >> >> and there are no wording "reject" in the description, so I wonder it >> >> is unclear >> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use >> >> ERROR_LIMIT >> >> since it is supposed to be used with ON_ERROR. >> >> >> >> Alternatively, if we emphasize that errors are handled other than >> >> terminating >> >> the command,perhaps MALFORMED_LIMIT as proposed above or >> >> TOLERANCE_LIMIT may be >> >> good, for example. >> > >> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a >> > limit on >> > the number of rows with any malformed value allowed before the COPY >> > command is >> > rejected", we would not have to rename it. >> >> The meaning of REJECT_LIMIT is what you described, and I think Kirill >> worries about cases when malformed rows are accepted(=not REJECTed) >> with >> null substitution. REJECT_LIMIT counts this case as REJECTed. > > I am a bit confused. Me too:) Let me explain my understanding. I believe there are now two candidates that count as REJECT_LIMIT number: (1) error converting a column's input value into its data type(soft error) (2) NULL substitution failure(this comes from the proposed patch) And I understood Kirill's idea to be the following: 1st idea: REJECT_LIMIT counts (1) 2nd idea: REJECT_LIMIT counts (2) And I've agreed with the 1st one. > You mean "REJECT" is raising a soft error of data > input here instead of terminating COPY? Yes. -- Regards, -- Atsushi Torikoshi Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
pgsql-hackers by date: