Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row - Mailing list pgsql-hackers

From torikoshia
Subject Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date
Msg-id 1462d79784b2475f1c714c65a6f25652@oss.nttdata.com
Whole thread Raw
In response to Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row  (Yugo NAGATA <nagata@sraoss.co.jp>)
List pgsql-hackers
On 2024-11-13 22:02, Yugo NAGATA wrote:
> On Tue, 12 Nov 2024 17:38:25 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
>> On 2024-11-12 14:17, Yugo Nagata wrote:
>> > On Tue, 12 Nov 2024 14:03:50 +0900
>> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
>> >
>> >> On Tue, 12 Nov 2024 01:27:53 +0500
>> >> Kirill Reshke <reshkekirill@gmail.com> wrote:
>> >>
>> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia@oss.nttdata.com> wrote:
>> >> > >
>> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> >> > >
>> >> > > Thanks for working on this!
>> >> >
>> >> > Thanks for reviewing the v7 patch series!
>> >> >
>> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao.fujii@oss.nttdata.com>
>> >> > > > wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> >> > > >> > I do find this behavior to be consistent.
>> >> > > >>
>> >> > > >> +1
>> >> > > >>
>> >> > > >>
>> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> >> > > >> > replacements, as if REJECT LIMIT is inf.
>> >> > > >>
>> >> > > >> +1
>> >> > > >
>> >> > > > After thinking for a while, I'm now more opposed to this approach. I
>> >> > > > think we should count rows with erroneous data as errors only if
>> >> > > > null substitution for these rows failed, not the total number of rows
>> >> > > > which were modified.
>> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> >> > >
>> >> > > IMHO I prefer the previous interpretation.
>> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> >> > > is used to specify how many malformed rows are acceptable in the
>> >> > > "original" data source.
>> >>
>> >> I also prefer the previous version.
>> >>
>> >> > I do like the first version of interpretation, but I have a struggle
>> >> > with it. According to this interpretation, we will fail COPY command
>> >> > if the number
>> >> > of malformed data rows exceeds the limit, not the number of rejected
>> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
>> >> > substitution)
>> 
>> I feel your concern is valid.
>> Currently 'reject' can occur only when converting a column's input 
>> value
>> to its data type, but if we introduce set_to_null option 'reject' also
>> occurs when inserting null, i.e. not null constraint.
> 
> I can suppose "reject" means failure of COPY command here, that is, a 
> reject
> of executing the command, not an error of data input. If so, we can 
> interpret
> REJECT_LIMIT as "the number of malformed rows allowed before the COPY 
> command
> is REJECTed" (not the number of rejected rows). In this case, I think 
> we don't
> have to rename the option name.
> 
> Of course, if there is more proper name that makes it easy for users to
> understand the behaviour of the option, renaming should be nice.
> 
>> >> The documentation says that REJECT_LIMIT "Specifies the maximum number
>> >> of errors",
>> >> and there are no wording "reject" in the description, so I wonder it
>> >> is unclear
>> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use
>> >> ERROR_LIMIT
>> >> since it is supposed to be used with ON_ERROR.
>> >>
>> >> Alternatively, if we emphasize that errors are handled other than
>> >> terminating
>> >> the command,perhaps MALFORMED_LIMIT as proposed above or
>> >> TOLERANCE_LIMIT may be
>> >> good, for example.
>> >
>> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a
>> > limit on
>> > the number of rows with any malformed value allowed before the COPY
>> > command is
>> > rejected", we would not have to rename it.
>> 
>> The meaning of REJECT_LIMIT is what you described, and I think Kirill
>> worries about cases when malformed rows are accepted(=not REJECTed) 
>> with
>> null substitution. REJECT_LIMIT counts this case as REJECTed.
> 
> I am a bit confused.

Me too:)
Let me explain my understanding.

I believe there are now two candidates that count as REJECT_LIMIT 
number:

   (1) error converting a column's input value into its data type(soft 
error)
   (2) NULL substitution failure(this comes from the proposed patch)

And I understood Kirill's idea to be the following:

   1st idea: REJECT_LIMIT counts (1)
   2nd idea: REJECT_LIMIT counts (2)

And I've agreed with the 1st one.

> You mean "REJECT" is raising a soft error of data
> input here instead of terminating COPY?

Yes.


-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: UUID v7
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Improve the error message for logical replication of regular column to generated column.