Thread: Add new COPY option REJECT_LIMIT

Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can 
add another option which specifies the maximum tolerable number of soft 
errors.

I remember this was discussed in [1], and feel it would be useful when 
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


[1] 
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add new COPY option REJECT_LIMIT

From
"David G. Johnston"
Date:
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.

I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


I'm opposed to adding this particular feature.

When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.

I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it.  The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.

David J.

Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-01-27 00:20, David G. Johnston wrote:

Thanks for your comments!

> On Fri, Jan 26, 2024 at 2:49 AM torikoshia
> <torikoshia@oss.nttdata.com> wrote:
> 
>> Hi,
>> 
>> 9e2d870 enabled the COPY command to skip soft error, and I think we
>> can
>> add another option which specifies the maximum tolerable number of
>> soft
>> errors.
>> 
>> I remember this was discussed in [1], and feel it would be useful
>> when
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> 
>> Attached a patch for this.
>> 
>> What do you think?
> 
> I'm opposed to adding this particular feature.
> 
> When implementing this kind of business rule I'd need the option to
> specify a percentage, not just an absolute value.

Yeah, it seems useful for some cases.
Actually, Greenplum enables to specify not only the max number of bad 
rows but also its percentage[1].

I may be wrong, but considering some dataloaders support something like 
reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports 
PARSE_ERRORS[3]), specifying the "number" of the bad row might also be 
useful.

I think we can implement reject_limit specified by percentage simply 
calculating the ratio of skipped and processed at the end of CopyFrom() 
like this:

   if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + 
skipped) > cstate->opts.reject_limit_percent)
                   ereport(ERROR,
                                   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
                                    errmsg("exceeded the ratio specified 
by ..


> I would focus on trying to put the data required to make this kind of
> determination into a place where applications implementing such
> business rules and monitoring can readily get at it.  The "ERRORS TO"
> and maybe a corresponding "STATS TO" option where a table can be
> specified for the system to place the problematic data and stats about
> the copy itself.

It'd be nice to have such informative tables, but I believe the benefit 
of reject_limit is it fails the entire loading when the threshold is 
exceeded.
I imagine when we just have error and stats information tables for COPY, 
users have to delete the rows when they confirmed too many errors in 
these tables.



[1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html
[2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/01/26 18:49, torikoshia wrote:
> Hi,
> 
> 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the
maximumtolerable number of soft errors.
 
> 
> I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to
howdirty it can be.
 
> 
> Attached a patch for this.
> 
> What do you think?

The patch no longer applies cleanly to HEAD. Could you update it?

I think the REJECT_LIMIT feature is useful. Allowing it to be set as either the absolute number of skipped rows or a
percentageof the total input rows is a good idea.
 

However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. REJECT_LIMIT seems to
coverthe same cases. For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and REJECT_LIMIT=0 can act like
ON_ERROR=stop.

Therefore, having both ON_ERROR and REJECT_LIMIT might be confusing.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-07-03 02:07, Fujii Masao wrote:
Thanks for your comments!

> On 2024/01/26 18:49, torikoshia wrote:
>> Hi,
>> 
>> 9e2d870 enabled the COPY command to skip soft error, and I think we 
>> can add another option which specifies the maximum tolerable number of 
>> soft errors.
>> 
>> I remember this was discussed in [1], and feel it would be useful when 
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> 
>> Attached a patch for this.
>> 
>> What do you think?
> 
> The patch no longer applies cleanly to HEAD. Could you update it?

I'm going to update it after discussing the option format as described 
below.

> 
> I think the REJECT_LIMIT feature is useful. Allowing it to be set as
> either the absolute number of skipped rows or a percentage of the
> total input rows is a good idea.
> 
> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
> option is still necessary. REJECT_LIMIT seems to cover the same cases.
> For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and
> REJECT_LIMIT=0 can act like ON_ERROR=stop.

I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
I also think it's easy to understand that REJECT_LIMIT=0 is 
ON_ERROR=stop.
However, expressing REJECT_LIMIT='infinity' needs some definition like 
"setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I 
think this might not so intuitive.

Also, since it seems Snowflake and Redshift have both options equivalent 
to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY 
might not be surprising:
- Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 
'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
- Redshift has MAXERROR and IGNOREALLERRORS options[2]

BTW after seeing Snowflake makes SKIP_FILE_num one of the options of 
ON_ERROR,  I'm a bit wondering whether REJECT_LIMIT also should be the 
same.


[1] 
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions
[2] 
https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/07/04 12:05, torikoshia wrote:
> I'm going to update it after discussing the option format as described below.

Thanks!

> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
> I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop.
> However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'",
doesn'tit? If so, I think this might not so intuitive.
 

How about allowing REJECT_LIMIT to accept the keywords "infinity", "unlimited",
or "all" in addition to a number? This way, users can specify one of these
keywords instead of -1 to ignore all errors. The server code would then
internally set the REJECT_LIMIT to -1 or another appropriate value when
these keywords are used, but users wouldn't need to worry about this detail.

If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.


> Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of
themin PostgreSQL COPY might not be surprising:
 
> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
> - Redshift has MAXERROR and IGNOREALLERRORS options[2]

Ok, so here's a summary of the options and their behaviors:

To ignore all errors and continue to the end:

- Snowflake: ON_ERROR=CONTINUE
- Redshift: IGNOREALLERRORS
- Postgres (with your patch): ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=all

To fail as soon as an error is found:

- Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
- Redshift: MAXERROR=0 (default)
- Postgres (with your patch): ON_ERROR=stop (default)
- Postgres (with my idea): IGNORE_ERRORS=0 (default)

To fail when NNN or more errors are found:

- Snowflake: ON_ERROR=SKIP_FILE_NNN
- Redshift: MAXERROR=NNN
- Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=NNN

This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-07-05 12:59, Fujii Masao wrote:
> On 2024/07/04 12:05, torikoshia wrote:
>> I'm going to update it after discussing the option format as described 
>> below.
> 
> Thanks!
> 
>> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
>> I also think it's easy to understand that REJECT_LIMIT=0 is 
>> ON_ERROR=stop.
>> However, expressing REJECT_LIMIT='infinity' needs some definition like 
>> "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I 
>> think this might not so intuitive.
> 
> How about allowing REJECT_LIMIT to accept the keywords "infinity", 
> "unlimited",
> or "all" in addition to a number? This way, users can specify one of 
> these
> keywords instead of -1 to ignore all errors. The server code would then
> internally set the REJECT_LIMIT to -1 or another appropriate value when
> these keywords are used, but users wouldn't need to worry about this 
> detail.

Agreed.

> If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
> might be more intuitive and easier to understand than REJECT_LIMIT.

I feel that 'infinite' and 'unlimited' are unfamiliar values for 
PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would 
be a better parameter name as your suggestion.

>> Also, since it seems Snowflake and Redshift have both options 
>> equivalent to REJECT_LIMIT and ON_ERROR, having both of them in 
>> PostgreSQL COPY might not be surprising:
>> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 
>> 'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
>> - Redshift has MAXERROR and IGNOREALLERRORS options[2]
> 
> Ok, so here's a summary of the options and their behaviors:
> 
> To ignore all errors and continue to the end:
> 
> - Snowflake: ON_ERROR=CONTINUE
> - Redshift: IGNOREALLERRORS
> - Postgres (with your patch): ON_ERROR=ignore
> - Postgres (with my idea): IGNORE_ERRORS=all
> 
> To fail as soon as an error is found:
> 
> - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
> - Redshift: MAXERROR=0 (default)
> - Postgres (with your patch): ON_ERROR=stop (default)
> - Postgres (with my idea): IGNORE_ERRORS=0 (default)
> 
> To fail when NNN or more errors are found:
> 
> - Snowflake: ON_ERROR=SKIP_FILE_NNN
> - Redshift: MAXERROR=NNN
> - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
> - Postgres (with my idea): IGNORE_ERRORS=NNN

Thanks for the summary.

> This makes me think it might be better to treat REJECT_LIMIT as
> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
> if we adopt your patch. Since ON_ERROR=stop is the default,
> users could set the maximum number of allowed errors by specifying
> only REJECT_LIMIT. Otherwise, they would need to specify both
> ON_ERROR=ignore and REJECT_LIMIT.

That makes sense.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-07-03 02:07, Fujii Masao wrote:
> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR 
> option is still necessary.

I remembered another reason for the necessity of ON_ERROR.

ON_ERROR defines how to behave when encountering an error and it just 
accepts 'ignore' and 'stop' currently, but is expected to support other 
options such as saving details of errors to a table[1].
ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine 
REJECT_LIMIT would not replace future options of ON_ERROR.

Considering this and the option we want to add this time is to specify 
an upper limit on the number or ratio of errors, the name of this option 
like "reject_limit" seems better than "ignore_errors".

On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com> 
wrote:
> On 2024-07-05 12:59, Fujii Masao wrote:
>> On 2024/07/04 12:05, torikoshia wrote:
>>> I'm going to update it after discussing the option format as 
>>> described
>>> below.

Updated the patch.
0001 sets limit by the absolute number of error rows and 0002 sets limit 
by ratio of the error.

>> If we choose "all" as the keyword, renaming the option to 
>> IGNORE_ERRORS
>> might be more intuitive and easier to understand than REJECT_LIMIT.

> I feel that 'infinite' and 'unlimited' are unfamiliar values for
> PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would
> be a better parameter name as your suggestion.

As described above, attached patch adopts REJECT_LIMIT, so it uses 
"infinity".

>> This makes me think it might be better to treat REJECT_LIMIT as
>> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
>> if we adopt your patch. Since ON_ERROR=stop is the default,
>> users could set the maximum number of allowed errors by specifying
>> only REJECT_LIMIT. Otherwise, they would need to specify both
>> ON_ERROR=ignore and REJECT_LIMIT.

> That makes sense.

On my second thought, whatever value ON_ERROR is specified(e.g. ignore, 
stop, table), it seems fine to use REJECT_LIMIT.
I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, 
meaning it ignores errors until it reaches REJECT_LIMIT and stops when 
it exceeds the REJECT_LIMIT.
And REJECT_LIMIT seems orthogonal to 'table',  which specifies where to 
save error details.

Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR 
option value.


[1] 
https://www.postgresql.org/message-id/flat/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/07/17 22:21, torikoshia wrote:
> On 2024-07-03 02:07, Fujii Masao wrote:
>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary.
> 
> I remembered another reason for the necessity of ON_ERROR.
> 
> ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is
expectedto support other options such as saving details of errors to a table[1].
 

Wouldn't it be better to separate the option specifying where
error details are output from the ON_ERROR option
(which determines behavior when encountering errors)?
"table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

From
Junwang Zhao
Date:
Hi Torikoshia,

On Wed, Jul 17, 2024 at 9:21 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2024-07-03 02:07, Fujii Masao wrote:
> > However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
> > option is still necessary.
>
> I remembered another reason for the necessity of ON_ERROR.
>
> ON_ERROR defines how to behave when encountering an error and it just
> accepts 'ignore' and 'stop' currently, but is expected to support other
> options such as saving details of errors to a table[1].
> ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine
> REJECT_LIMIT would not replace future options of ON_ERROR.
>
> Considering this and the option we want to add this time is to specify
> an upper limit on the number or ratio of errors, the name of this option
> like "reject_limit" seems better than "ignore_errors".
>
> On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com>
> wrote:
> > On 2024-07-05 12:59, Fujii Masao wrote:
> >> On 2024/07/04 12:05, torikoshia wrote:
> >>> I'm going to update it after discussing the option format as
> >>> described
> >>> below.
>
> Updated the patch.
> 0001 sets limit by the absolute number of error rows and 0002 sets limit
> by ratio of the error.

In patch 0002, the ratio is calculated by the already skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the
first 100 rows will fail the entire command, this might surprise the user.

This case can be resolved by 0001 *reject_limit 10*, so I think the *by ratio*
is less useful.

>
> >> If we choose "all" as the keyword, renaming the option to
> >> IGNORE_ERRORS
> >> might be more intuitive and easier to understand than REJECT_LIMIT.
>
> > I feel that 'infinite' and 'unlimited' are unfamiliar values for
> > PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would
> > be a better parameter name as your suggestion.
>
> As described above, attached patch adopts REJECT_LIMIT, so it uses
> "infinity".
>
> >> This makes me think it might be better to treat REJECT_LIMIT as
> >> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
> >> if we adopt your patch. Since ON_ERROR=stop is the default,
> >> users could set the maximum number of allowed errors by specifying
> >> only REJECT_LIMIT. Otherwise, they would need to specify both
> >> ON_ERROR=ignore and REJECT_LIMIT.
>
> > That makes sense.
>
> On my second thought, whatever value ON_ERROR is specified(e.g. ignore,
> stop, table), it seems fine to use REJECT_LIMIT.
> I feel REJECT_LIMIT has both "ignore" and "stop" characteristics,
> meaning it ignores errors until it reaches REJECT_LIMIT and stops when
> it exceeds the REJECT_LIMIT.
> And REJECT_LIMIT seems orthogonal to 'table',  which specifies where to
> save error details.
>
> Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR
> option value.
>
>
> [1]
> https://www.postgresql.org/message-id/flat/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com
>
> --
> Regards,
>
> --
> Atsushi Torikoshi
> NTT DATA Group Corporation



--
Regards
Junwang Zhao



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/07/19 22:03, Fujii Masao wrote:
> 
> 
> On 2024/07/17 22:21, torikoshia wrote:
>> On 2024-07-03 02:07, Fujii Masao wrote:
>>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary.
>>
>> I remembered another reason for the necessity of ON_ERROR.
>>
>> ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is
expectedto support other options such as saving details of errors to a table[1].
 
> 
> Wouldn't it be better to separate the option specifying where
> error details are output from the ON_ERROR option
> (which determines behavior when encountering errors)?
> "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.

I still find it odd to accept "table" as a value for ON_ERROR. However,
"set_to_null" or "replace-column" proposed in [1] seem valid for
ON_ERROR. So, I'm okay with keeping the ON_ERROR option.


> On my second thought, whatever value ON_ERROR is specified(e.g. ignore, stop, table), it seems fine to use
REJECT_LIMIT.
> I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, meaning it ignores errors until it reaches
REJECT_LIMITand stops when it exceeds the REJECT_LIMIT.
 

ON_ERROR specifies how to handle errors, and "stop" means to fail
the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect,
and the command should fail immediately upon encountering an error.

As in your original proposal, I now think REJECT_LIMIT should only
apply when ON_ERROR=ignore. The command would ignore errors and
continue processing, but if the number of errors exceeds REJECT_LIMIT,
the command should fail. Thought?

BTW if "set_to_null" is supported someday, REJECT_LIMIT can also
apply. The command would cinsert NULL into the target table upon
encountering errors and continue, but fail if the number of errors
exceed REJECT_LIMIT.

Regards,

[1] https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=BP3d1_aSfe_+Q@mail.gmail.com

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for the comment.

> In patch 0002, the ratio is calculated by the already skipped/processed
> rows, but what if a user wants to copy 1000 rows, and he/she can 
> tolerate
> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row 
> in the
> first 100 rows will fail the entire command, this might surprise the 
> user.

Since the ratio is calculated after all data is processed, the case "one 
bad row in the first 100 rows will fail the entire command" doesn't 
happen:


   =# \! wc -l 1000rows-with-10err.data
       1000 1000rows-with-10err.data

   =# COPY t1 from '1000rows-with-10err.data'  with (log_verbosity 
verbose, reject_limit 0.01);
   NOTICE:  skipping row due to data type incompatibility at line 10 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 11 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 12 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 13 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 14 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 15 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 16 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 17 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 18 for 
column i: "a"
   NOTICE:  skipping row due to data type incompatibility at line 19 for 
column i: "a"
   NOTICE:  10 rows were skipped due to data type incompatibility
   COPY 990


On 2024-07-20 02:08, Fujii Masao wrote:
> On 2024/07/19 22:03, Fujii Masao wrote:
>> 
>> 
>> On 2024/07/17 22:21, torikoshia wrote:
>>> On 2024-07-03 02:07, Fujii Masao wrote:
>>>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR 
>>>> option is still necessary.
>>> 
>>> I remembered another reason for the necessity of ON_ERROR.
>>> 
>>> ON_ERROR defines how to behave when encountering an error and it just 
>>> accepts 'ignore' and 'stop' currently, but is expected to support 
>>> other options such as saving details of errors to a table[1].
>> 
>> Wouldn't it be better to separate the option specifying where
>> error details are output from the ON_ERROR option
>> (which determines behavior when encountering errors)?
>> "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.
> 
> I still find it odd to accept "table" as a value for ON_ERROR. However,
> "set_to_null" or "replace-column" proposed in [1] seem valid for
> ON_ERROR. So, I'm okay with keeping the ON_ERROR option.

Agreed.

>> On my second thought, whatever value ON_ERROR is specified(e.g. 
>> ignore, stop, table), it seems fine to use REJECT_LIMIT.
>> I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, 
>> meaning it ignores errors until it reaches REJECT_LIMIT and stops when 
>> it exceeds the REJECT_LIMIT.
> 
> ON_ERROR specifies how to handle errors, and "stop" means to fail
> the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect,
> and the command should fail immediately upon encountering an error.
> 
> As in your original proposal, I now think REJECT_LIMIT should only
> apply when ON_ERROR=ignore. The command would ignore errors and
> continue processing, but if the number of errors exceeds REJECT_LIMIT,
> the command should fail. Thought?

Makes sense.
Updated the patch.

> BTW if "set_to_null" is supported someday, REJECT_LIMIT can also
> apply. The command would cinsert NULL into the target table upon
> encountering errors and continue, but fail if the number of errors
> exceed REJECT_LIMIT.

Agreed.


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachment

Re: Add new COPY option REJECT_LIMIT

From
Kirill Reshke
Date:
Hi! Nice feature.

Few comments:

> +      When a positive integer value is specified, <command>COPY</command> limits
> +      the maximum tolerable number of errors while converting a column's input
> +      value into its data type.

If nothing is specified, then the maximum tolerable number of errors
is one, right? Should we state  this explicitly in the documentation?

> +COPY x from stdin with (on_error ignore, reject_limit 0);
How about a test where reject_limit is a string, but not
(case-intensively) 'infinity'?

> + CopyRejectLimits reject_limits; /* thresholds of reject_limit */

Why are there multiple thresholds? Can we have only one?

Other places look good to me.



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/07/22 21:37, torikoshia wrote:
> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
> Thanks for the comment.
> 
>> In patch 0002, the ratio is calculated by the already skipped/processed
>> rows, but what if a user wants to copy 1000 rows, and he/she can tolerate
>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the
>> first 100 rows will fail the entire command, this might surprise the user.
> 
> Since the ratio is calculated after all data is processed, the case "one bad row in the first 100 rows will fail the
entirecommand" doesn't happen:
 

Yes, but is this the desired behavior when using the ratio threshold?
I was thinking that COPY should fail once the error ratio (errors vs.
total rows in the input file) reaches the threshold. Since it's
difficult to know the total number of rows in the input file,
implementing this seems not easy, though.

> Updated the patch.

Thanks for updating the patch!


+      This option must be used with <literal>ON_ERROR</literal> to be set to
+      other than <literal>stop</literal>.

Regarding the ON_ERROR option, now it only has two values.
Instead of saying "other than stop," we should simply use "ignore"
for clarity and intuition?


+      When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
+      the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.

For the INFINITY value, the description is a bit unclear.
As I understand it, INFINITY is the default for REJECT_LIMIT.
So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is used.


In line with my previous suggestion, if we support only REJECT_LIMIT
without ON_ERROR, having INFINITY makes sense. However,
with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
Users can just set ON_ERROR=ignore to ignore all errors,
making INFINITY unnecessary for REJECT_LIMIT. This is open for
discussion, but I believe it's better to remove INFINITY from
the REJECT_LIMIT options.


+        else if (strcmp(defel->defname, "reject_limit") == 0)
+        {
+            if (reject_limit_specified)
+                errorConflictingDefElem(defel, pstate);
+            if (!opts_out->on_error)
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                         errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));

Using "ignore" instead of "other than stop" in the error message
is clearer and more intuitive.


Checking if ON_ERROR and REJECT_LIMIT are specified should be
done after the foreach() processing of the options. Otherwise,
if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can occur.

---------------
=# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
ERROR:  REJECT_LIMIT requires ON_ERROR to be set to other than stop
---------------


+                ereport(ERROR,
+                        (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+                         errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"",
+                                (long long) cstate->opts.reject_limits.num_err)));

The error message isn't clear about what exceeded REJECT_LIMIT.
How about saying "skipped more than REJECT_LIMIT rows" or something instead?


+/*
+ * A struct to hold reject_limit options, in a parsed form.
+ * More values to be added in another patch.
+ */

The latter comment doesn't seem necessary or helpful.


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao 
<masao.fujii@oss.nttdata.com> wrote:

Thanks for your review.

> On 2024/07/22 21:37, torikoshia wrote:
>> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> 
>> wrote:
>> Thanks for the comment.
>> 
>>> In patch 0002, the ratio is calculated by the already 
>>> skipped/processed
>>> rows, but what if a user wants to copy 1000 rows, and he/she can 
>>> tolerate
>>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad 
>>> row in the
>>> first 100 rows will fail the entire command, this might surprise the 
>>> user.
>> 
>> Since the ratio is calculated after all data is processed, the case 
>> "one bad row in the first 100 rows will fail the entire command" 
>> doesn't happen:
> 
> Yes, but is this the desired behavior when using the ratio threshold?
> I was thinking that COPY should fail once the error ratio (errors vs.
> total rows in the input file) reaches the threshold.

Users might expect like you.
However, implementing it would need a row number counting feature as you 
pointed out, and it seems an overkill.

Describing the difference between ratio and number in the manual might 
help, but
it might be better to make REJECT_LIMIT support only the number of 
errors and leave it to the user to calculate the number from the ratio.

I'd like to hear if anyone has an opinion on the need for supporting 
ratio.
I remember David prefers ratio[1].

> +      This option must be used with <literal>ON_ERROR</literal> to be 
> set to
> +      other than <literal>stop</literal>.
> 
> Regarding the ON_ERROR option, now it only has two values.
> Instead of saying "other than stop," we should simply use "ignore"
> for clarity and intuition?

I'll Modify it.
The reason for the roundabout wording was the expectation that on_error 
would support values other than these in the future, but as you point 
out, there are currently only two.


> +      When specified <literal>INFINITY</literal>, 
> <command>COPY</command> ignores all
> +      the errors. This is a synonym for <literal>ON_ERROR</literal> 
> <literal>ignore</literal>.
> 
> For the INFINITY value, the description is a bit unclear.
> As I understand it, INFINITY is the default for REJECT_LIMIT.
> So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
> COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is 
> used.
> 
> 
> In line with my previous suggestion, if we support only REJECT_LIMIT
> without ON_ERROR, having INFINITY makes sense. However,
> with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
> Users can just set ON_ERROR=ignore to ignore all errors,
> making INFINITY unnecessary for REJECT_LIMIT. This is open for
> discussion, but I believe it's better to remove INFINITY from
> the REJECT_LIMIT options.

Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in 
the next patch.

> +               else if (strcmp(defel->defname, "reject_limit") == 0)
> +               {
> +                       if (reject_limit_specified)
> +                               errorConflictingDefElem(defel, pstate);
> +                       if (!opts_out->on_error)
> +                               ereport(ERROR,
> +                                               
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                                errmsg("REJECT_LIMIT 
> requires ON_ERROR to be set to other than stop")));
> 
> Using "ignore" instead of "other than stop" in the error message
> is clearer and more intuitive.

Agreed.


> Checking if ON_ERROR and REJECT_LIMIT are specified should be
> done after the foreach() processing of the options. Otherwise,
> if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can 
> occur.
> 
> ---------------
> =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
> ERROR:  REJECT_LIMIT requires ON_ERROR to be set to other than stop
> ---------------

Ugh, I'll modify it.

> +                               ereport(ERROR,
> +                                               
> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                                                errmsg("exceeded the 
> number specified by REJECT_LIMIT \"%lld\"",
> +                                                               (long 
> long) cstate->opts.reject_limits.num_err)));
> 
> The error message isn't clear about what exceeded REJECT_LIMIT.
> How about saying "skipped more than REJECT_LIMIT rows" or something 
> instead?

Agreed.

> +/*
> + * A struct to hold reject_limit options, in a parsed form.
> + * More values to be added in another patch.
> + */
> 
> The latter comment doesn't seem necessary or helpful.

Agreed.

[1] 
https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-07-23 02:06, Kirill Reshke wrote:

Thanks for your review.

> Few comments:
> 
>> +      When a positive integer value is specified, 
>> <command>COPY</command> limits
>> +      the maximum tolerable number of errors while converting a 
>> column's input
>> +      value into its data type.
> 
> If nothing is specified, then the maximum tolerable number of errors
> is one, right? Should we state  this explicitly in the documentation?

REJECT_LIMIT now can be used wonly when on_error=ignore, I think the 
default(when only on_error=ignore is specified) is unlimited.
Anyway, I'm going to add a description about the default.

>> +COPY x from stdin with (on_error ignore, reject_limit 0);
> How about a test where reject_limit is a string, but not
> (case-intensively) 'infinity'?

Considering the discussion in[1], I'm now going to remove 'infinity'.

>> + CopyRejectLimits reject_limits; /* thresholds of reject_limit */
> 
> Why are there multiple thresholds? Can we have only one?

This is because I thought it'd be more convenient to support both the 
number and the ratio of error, but I'm also beginning to think that it 
might be better to support only the number of cases, as discussed in 
[1].

[1]https://www.postgresql.org/message-id/5f807fcf3a36df7ba41464ab40b5c37d%40oss.nttdata.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/09/24 14:25, torikoshia wrote:
> Updated the patch.

Thanks for updating the patch!

+    REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> }

The curly braces {} seem unnecessary here.

+      When a positive integer value is specified, <command>COPY</command> limits
+      the maximum tolerable number of errors while converting a column's input
+      value into its data type.
+      If input data caused more errors than the specified value, entire
+      <command>COPY</command> fails.
+      Otherwise, <command>COPY</command> discards the input row and continues
+      with the next one.
+      This option must be used with <literal>ON_ERROR</literal> to be set to
+      <literal>ignore</literal>.
+      Just setting <literal>ON_ERROR</literal> to <literal>ignore</literal>
+      tolerates unlimited number of errors.

Regarding the description of REJECT_LIMIT, how about clarifying what the option specifies up front, like this:

----------------
Specifies the maximum number of errors tolerated while converting a column's
input value to its data type, when on_error is set to "ignore." If the input
causes more errors than the specified value, the COPY command fails,
even with on_error set to "ignore." This value must be positive and used with
on_error="ignore". If not specified, on_error="ignore" allows an unlimited
number of errors, meaning COPY will skip all erroneous data.
----------------

+            ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("number for REJECT_LIMIT must be greater than zero")));
+    }
+    else
+        ereport(ERROR,
+            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+             errmsg("value for REJECT_LIMIT must be positive integer")));

The phrases "number for" and "value for" seem unnecessary in the error messages.
Also, "positive number" should be "a positive number." Would it be better to display
the actual specified value in the error message, like: errmsg("REJECT_LIMIT (%s) must
be a positive number", value)?

Lastly, during my testing, if nothing is specified for REJECT_LIMIT
(e.g., COPY ... WITH (..., REJECT_LIMIT)), the COPY command caused a segmentation fault.


>> I'd like to hear if anyone has an opinion on the need for supporting ratio.
> 
> Since there was no opinion about it, attached a patch only for REJECT_LIMIT specifying number.

+1

> As there are no opposing opinions, removed 'INFINITY' as well.

+1

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add new COPY option REJECT_LIMIT

From
jian he
Date:
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ int64 reject_limit;
+
+ if (def->arg == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("REJECT_LIMIT requires a positive integer")));
+
+ if (nodeTag(def->arg) == T_Integer)
+ {
+ reject_limit = defGetInt64(def);
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%lld) must be greater than zero",
+ (long long) reject_limit)));
+ }
+ else
+ {
+ char   *sval = defGetString(def);
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%s) must be a positive integer",
+ sval)));
+ }
+
+ return reject_limit;
+}
there will be an integer overflow issue?

Can you try the following?

static int64
defGetCopyRejectLimitOptions(DefElem *def)
{
    int64                    reject_limit;
    reject_limit = defGetInt64(def);
    if (reject_limit <= 0)
        ereport(ERROR,
            (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                errmsg("REJECT_LIMIT (%lld) must be greater than zero",
                    (long long) reject_limit)));
}




REJECT_LIMIT <replaceable class="parameter">integer</replaceable>
i think, you want REJECT_LIMIT be bigint?
so here it should be
REJECT_LIMIT <replaceable class="parameter">bigint</replaceable>\
?



Re: Add new COPY option REJECT_LIMIT

From
jian he
Date:
> I'm wondering if we can use the wording maxerror as in the attached
> patch.
>

+   <varlistentry>
+    <term><literal>REJECT_LIMIT</literal></term>
+    <listitem>
+     <para>
+      Specifies the maximum number of errors tolerated while converting a
+      column's input value to its data type, when
<literal>ON_ERROR</literal> is
+      set to <literal>ignore</literal>.
+      If the input causes more errors than the specified value, the
<command>COPY</command>
+      command fails, even with <literal>ON_ERROR</literal> set to
<literal>ignore</literal>.
+      This clause must be used with
<literal>ON_ERROR</literal>=<literal>ignore</literal>
+      and <replaceable class="parameter">maxerror</replaceable> must
be positive.
+      If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal>
+      allows an unlimited number of errors, meaning
<command>COPY</command> will
+      skip all erroneous data.
+     </para>
+    </listitem>
+   </varlistentry>

mentioning <replaceable class="parameter">maxerror</replaceable> is a
bigint type
or explicitly mentioning the maximum allowed value of "maxerror" would be great.

other than that, it looks good to me.



Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/09/30 21:08, torikoshia wrote:
> Since defGetInt64() checks not only whether the input value exceeds the range of bigint but also input value is
specified,attached v6 patch checks them by directly calling defGetInt64().
 

Thanks for updating the patch! But the patch no longer applied cleanly
to the master branch. Could you rebase it?


+    if (opts_out->reject_limit && !opts_out->on_error)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+        /*- translator: first and second %s are the names of COPY
+         * option, e.g. ON_ERROR, third is the value of the COPY option,
+         * e.g. IGNORE */
+                 errmsg("COPY %s requires %s to be set to %s",
+                         "REJECT_LIMIT", "ON_ERROR", "IGNORE")));

This check might be better moved to other place, for example at
the bottom of ProcessCopyOptions(). I noticed a comment in the code:
"Check for incompatible options (must do these two before inserting defaults)."
You added your condition after this comment and before inserting the defaults,
but it's not related to that process. So, moving this check to other place
seems more appropriate.

While reviewing, I also noticed that the check for
"opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
is similarly placed before setting the defaults, which might not
be correct. This check should probably be moved as well.
Additionally, the comment mentioning "must do these two" should be
updated to "must do these three." These changes should be handled
in a separate patch.


+            if (cstate->opts.reject_limit &&

Wouldn't it be clearer to use cstate->opts.reject_limit > 0 for better readability?


+                cstate->num_errors > cstate->opts.reject_limit)
+                ereport(ERROR,
+                        (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                         errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",",
+                                (long long) cstate->opts.reject_limit)));

To make the error message clearer, similar to the NOTICE about
skipped rows, how about rewording it to:
"skipped more than REJECT_LIMIT (%lld) rows due to data type incompatibility"?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add new COPY option REJECT_LIMIT

From
Fujii Masao
Date:

On 2024/10/07 21:51, torikoshia wrote:
>> While reviewing, I also noticed that the check for
>> "opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
>> is similarly placed before setting the defaults, which might not
>> be correct. This check should probably be moved as well.
>> Additionally, the comment mentioning "must do these two" should be
>> updated to "must do these three." These changes should be handled
>> in a separate patch.
> 
> Agreed and attached 0002 patch.

Thanks for updating the 0001 patch and creating the 0002 patch! I've pushed both.


> Also considering when REJECT_LIMIT is specified to 1, attached patch uses errmsg_plural() instead of errmsg.

I don't think errmsg_plural() is needed here since, when 1 is specified,
"rows" should follow "more than REJECT_LIMIT (1)". No?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add new COPY option REJECT_LIMIT

From
torikoshia
Date:
On 2024-10-08 18:39, Fujii Masao wrote:
> On 2024/10/07 21:51, torikoshia wrote:
>>> While reviewing, I also noticed that the check for
>>> "opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
>>> is similarly placed before setting the defaults, which might not
>>> be correct. This check should probably be moved as well.
>>> Additionally, the comment mentioning "must do these two" should be
>>> updated to "must do these three." These changes should be handled
>>> in a separate patch.
>> 
>> Agreed and attached 0002 patch.
> 
> Thanks for updating the 0001 patch and creating the 0002 patch! I've
> pushed both.

Thanks a lot!

> 
> 
>> Also considering when REJECT_LIMIT is specified to 1, attached patch 
>> uses errmsg_plural() instead of errmsg.
> 
> I don't think errmsg_plural() is needed here since, when 1 is 
> specified,
> "rows" should follow "more than REJECT_LIMIT (1)". No?

You are right.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation