Thread: Add new COPY option REJECT_LIMIT
Hi, 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximum tolerable number of soft errors. I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to how dirty it can be. Attached a patch for this. What do you think? [1] https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached a patch for this.
What do you think?
I'm opposed to adding this particular feature.
When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.
I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it. The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.
David J.
On 2024-01-27 00:20, David G. Johnston wrote: Thanks for your comments! > On Fri, Jan 26, 2024 at 2:49 AM torikoshia > <torikoshia@oss.nttdata.com> wrote: > >> Hi, >> >> 9e2d870 enabled the COPY command to skip soft error, and I think we >> can >> add another option which specifies the maximum tolerable number of >> soft >> errors. >> >> I remember this was discussed in [1], and feel it would be useful >> when >> loading 'dirty' data but there is a limit to how dirty it can be. >> >> Attached a patch for this. >> >> What do you think? > > I'm opposed to adding this particular feature. > > When implementing this kind of business rule I'd need the option to > specify a percentage, not just an absolute value. Yeah, it seems useful for some cases. Actually, Greenplum enables to specify not only the max number of bad rows but also its percentage[1]. I may be wrong, but considering some dataloaders support something like reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports PARSE_ERRORS[3]), specifying the "number" of the bad row might also be useful. I think we can implement reject_limit specified by percentage simply calculating the ratio of skipped and processed at the end of CopyFrom() like this: if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + skipped) > cstate->opts.reject_limit_percent) ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), errmsg("exceeded the ratio specified by .. > I would focus on trying to put the data required to make this kind of > determination into a place where applications implementing such > business rules and monitoring can readily get at it. The "ERRORS TO" > and maybe a corresponding "STATS TO" option where a table can be > specified for the system to place the problematic data and stats about > the copy itself. It'd be nice to have such informative tables, but I believe the benefit of reject_limit is it fails the entire loading when the threshold is exceeded. I imagine when we just have error and stats information tables for COPY, users have to delete the rows when they confirmed too many errors in these tables. [1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html [2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html [3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024/01/26 18:49, torikoshia wrote: > Hi, > > 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximumtolerable number of soft errors. > > I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to howdirty it can be. > > Attached a patch for this. > > What do you think? The patch no longer applies cleanly to HEAD. Could you update it? I think the REJECT_LIMIT feature is useful. Allowing it to be set as either the absolute number of skipped rows or a percentageof the total input rows is a good idea. However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. REJECT_LIMIT seems to coverthe same cases. For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and REJECT_LIMIT=0 can act like ON_ERROR=stop. Therefore, having both ON_ERROR and REJECT_LIMIT might be confusing. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-07-03 02:07, Fujii Masao wrote: Thanks for your comments! > On 2024/01/26 18:49, torikoshia wrote: >> Hi, >> >> 9e2d870 enabled the COPY command to skip soft error, and I think we >> can add another option which specifies the maximum tolerable number of >> soft errors. >> >> I remember this was discussed in [1], and feel it would be useful when >> loading 'dirty' data but there is a limit to how dirty it can be. >> >> Attached a patch for this. >> >> What do you think? > > The patch no longer applies cleanly to HEAD. Could you update it? I'm going to update it after discussing the option format as described below. > > I think the REJECT_LIMIT feature is useful. Allowing it to be set as > either the absolute number of skipped rows or a percentage of the > total input rows is a good idea. > > However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR > option is still necessary. REJECT_LIMIT seems to cover the same cases. > For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and > REJECT_LIMIT=0 can act like ON_ERROR=stop. I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop. However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I think this might not so intuitive. Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY might not be surprising: - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] - Redshift has MAXERROR and IGNOREALLERRORS options[2] BTW after seeing Snowflake makes SKIP_FILE_num one of the options of ON_ERROR, I'm a bit wondering whether REJECT_LIMIT also should be the same. [1] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions [2] https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024/07/04 12:05, torikoshia wrote: > I'm going to update it after discussing the option format as described below. Thanks! > I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. > I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop. > However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'", doesn'tit? If so, I think this might not so intuitive. How about allowing REJECT_LIMIT to accept the keywords "infinity", "unlimited", or "all" in addition to a number? This way, users can specify one of these keywords instead of -1 to ignore all errors. The server code would then internally set the REJECT_LIMIT to -1 or another appropriate value when these keywords are used, but users wouldn't need to worry about this detail. If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS might be more intuitive and easier to understand than REJECT_LIMIT. > Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of themin PostgreSQL COPY might not be surprising: > - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] > - Redshift has MAXERROR and IGNOREALLERRORS options[2] Ok, so here's a summary of the options and their behaviors: To ignore all errors and continue to the end: - Snowflake: ON_ERROR=CONTINUE - Redshift: IGNOREALLERRORS - Postgres (with your patch): ON_ERROR=ignore - Postgres (with my idea): IGNORE_ERRORS=all To fail as soon as an error is found: - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE - Redshift: MAXERROR=0 (default) - Postgres (with your patch): ON_ERROR=stop (default) - Postgres (with my idea): IGNORE_ERRORS=0 (default) To fail when NNN or more errors are found: - Snowflake: ON_ERROR=SKIP_FILE_NNN - Redshift: MAXERROR=NNN - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore - Postgres (with my idea): IGNORE_ERRORS=NNN This makes me think it might be better to treat REJECT_LIMIT as an additional option for ON_ERROR=stop instead of ON_ERROR=ignore if we adopt your patch. Since ON_ERROR=stop is the default, users could set the maximum number of allowed errors by specifying only REJECT_LIMIT. Otherwise, they would need to specify both ON_ERROR=ignore and REJECT_LIMIT. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-07-05 12:59, Fujii Masao wrote: > On 2024/07/04 12:05, torikoshia wrote: >> I'm going to update it after discussing the option format as described >> below. > > Thanks! > >> I agree that it's possible to use only REJECT_LIMIT without ON_ERROR. >> I also think it's easy to understand that REJECT_LIMIT=0 is >> ON_ERROR=stop. >> However, expressing REJECT_LIMIT='infinity' needs some definition like >> "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I >> think this might not so intuitive. > > How about allowing REJECT_LIMIT to accept the keywords "infinity", > "unlimited", > or "all" in addition to a number? This way, users can specify one of > these > keywords instead of -1 to ignore all errors. The server code would then > internally set the REJECT_LIMIT to -1 or another appropriate value when > these keywords are used, but users wouldn't need to worry about this > detail. Agreed. > If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS > might be more intuitive and easier to understand than REJECT_LIMIT. I feel that 'infinite' and 'unlimited' are unfamiliar values for PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would be a better parameter name as your suggestion. >> Also, since it seems Snowflake and Redshift have both options >> equivalent to REJECT_LIMIT and ON_ERROR, having both of them in >> PostgreSQL COPY might not be surprising: >> - Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | >> 'SKIP_FILE_num%' | ABORT_STATEMENT"[1] >> - Redshift has MAXERROR and IGNOREALLERRORS options[2] > > Ok, so here's a summary of the options and their behaviors: > > To ignore all errors and continue to the end: > > - Snowflake: ON_ERROR=CONTINUE > - Redshift: IGNOREALLERRORS > - Postgres (with your patch): ON_ERROR=ignore > - Postgres (with my idea): IGNORE_ERRORS=all > > To fail as soon as an error is found: > > - Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE > - Redshift: MAXERROR=0 (default) > - Postgres (with your patch): ON_ERROR=stop (default) > - Postgres (with my idea): IGNORE_ERRORS=0 (default) > > To fail when NNN or more errors are found: > > - Snowflake: ON_ERROR=SKIP_FILE_NNN > - Redshift: MAXERROR=NNN > - Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore > - Postgres (with my idea): IGNORE_ERRORS=NNN Thanks for the summary. > This makes me think it might be better to treat REJECT_LIMIT as > an additional option for ON_ERROR=stop instead of ON_ERROR=ignore > if we adopt your patch. Since ON_ERROR=stop is the default, > users could set the maximum number of allowed errors by specifying > only REJECT_LIMIT. Otherwise, they would need to specify both > ON_ERROR=ignore and REJECT_LIMIT. That makes sense. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024-07-03 02:07, Fujii Masao wrote: > However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR > option is still necessary. I remembered another reason for the necessity of ON_ERROR. ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is expected to support other options such as saving details of errors to a table[1]. ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine REJECT_LIMIT would not replace future options of ON_ERROR. Considering this and the option we want to add this time is to specify an upper limit on the number or ratio of errors, the name of this option like "reject_limit" seems better than "ignore_errors". On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > On 2024-07-05 12:59, Fujii Masao wrote: >> On 2024/07/04 12:05, torikoshia wrote: >>> I'm going to update it after discussing the option format as >>> described >>> below. Updated the patch. 0001 sets limit by the absolute number of error rows and 0002 sets limit by ratio of the error. >> If we choose "all" as the keyword, renaming the option to >> IGNORE_ERRORS >> might be more intuitive and easier to understand than REJECT_LIMIT. > I feel that 'infinite' and 'unlimited' are unfamiliar values for > PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would > be a better parameter name as your suggestion. As described above, attached patch adopts REJECT_LIMIT, so it uses "infinity". >> This makes me think it might be better to treat REJECT_LIMIT as >> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore >> if we adopt your patch. Since ON_ERROR=stop is the default, >> users could set the maximum number of allowed errors by specifying >> only REJECT_LIMIT. Otherwise, they would need to specify both >> ON_ERROR=ignore and REJECT_LIMIT. > That makes sense. On my second thought, whatever value ON_ERROR is specified(e.g. ignore, stop, table), it seems fine to use REJECT_LIMIT. I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, meaning it ignores errors until it reaches REJECT_LIMIT and stops when it exceeds the REJECT_LIMIT. And REJECT_LIMIT seems orthogonal to 'table', which specifies where to save error details. Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR option value. [1] https://www.postgresql.org/message-id/flat/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
On 2024/07/17 22:21, torikoshia wrote: > On 2024-07-03 02:07, Fujii Masao wrote: >> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. > > I remembered another reason for the necessity of ON_ERROR. > > ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is expectedto support other options such as saving details of errors to a table[1]. Wouldn't it be better to separate the option specifying where error details are output from the ON_ERROR option (which determines behavior when encountering errors)? "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi Torikoshia, On Wed, Jul 17, 2024 at 9:21 PM torikoshia <torikoshia@oss.nttdata.com> wrote: > > On 2024-07-03 02:07, Fujii Masao wrote: > > However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR > > option is still necessary. > > I remembered another reason for the necessity of ON_ERROR. > > ON_ERROR defines how to behave when encountering an error and it just > accepts 'ignore' and 'stop' currently, but is expected to support other > options such as saving details of errors to a table[1]. > ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine > REJECT_LIMIT would not replace future options of ON_ERROR. > > Considering this and the option we want to add this time is to specify > an upper limit on the number or ratio of errors, the name of this option > like "reject_limit" seems better than "ignore_errors". > > On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com> > wrote: > > On 2024-07-05 12:59, Fujii Masao wrote: > >> On 2024/07/04 12:05, torikoshia wrote: > >>> I'm going to update it after discussing the option format as > >>> described > >>> below. > > Updated the patch. > 0001 sets limit by the absolute number of error rows and 0002 sets limit > by ratio of the error. In patch 0002, the ratio is calculated by the already skipped/processed rows, but what if a user wants to copy 1000 rows, and he/she can tolerate 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the first 100 rows will fail the entire command, this might surprise the user. This case can be resolved by 0001 *reject_limit 10*, so I think the *by ratio* is less useful. > > >> If we choose "all" as the keyword, renaming the option to > >> IGNORE_ERRORS > >> might be more intuitive and easier to understand than REJECT_LIMIT. > > > I feel that 'infinite' and 'unlimited' are unfamiliar values for > > PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would > > be a better parameter name as your suggestion. > > As described above, attached patch adopts REJECT_LIMIT, so it uses > "infinity". > > >> This makes me think it might be better to treat REJECT_LIMIT as > >> an additional option for ON_ERROR=stop instead of ON_ERROR=ignore > >> if we adopt your patch. Since ON_ERROR=stop is the default, > >> users could set the maximum number of allowed errors by specifying > >> only REJECT_LIMIT. Otherwise, they would need to specify both > >> ON_ERROR=ignore and REJECT_LIMIT. > > > That makes sense. > > On my second thought, whatever value ON_ERROR is specified(e.g. ignore, > stop, table), it seems fine to use REJECT_LIMIT. > I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, > meaning it ignores errors until it reaches REJECT_LIMIT and stops when > it exceeds the REJECT_LIMIT. > And REJECT_LIMIT seems orthogonal to 'table', which specifies where to > save error details. > > Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR > option value. > > > [1] > https://www.postgresql.org/message-id/flat/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com > > -- > Regards, > > -- > Atsushi Torikoshi > NTT DATA Group Corporation -- Regards Junwang Zhao
On 2024/07/19 22:03, Fujii Masao wrote: > > > On 2024/07/17 22:21, torikoshia wrote: >> On 2024-07-03 02:07, Fujii Masao wrote: >>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. >> >> I remembered another reason for the necessity of ON_ERROR. >> >> ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is expectedto support other options such as saving details of errors to a table[1]. > > Wouldn't it be better to separate the option specifying where > error details are output from the ON_ERROR option > (which determines behavior when encountering errors)? > "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop. I still find it odd to accept "table" as a value for ON_ERROR. However, "set_to_null" or "replace-column" proposed in [1] seem valid for ON_ERROR. So, I'm okay with keeping the ON_ERROR option. > On my second thought, whatever value ON_ERROR is specified(e.g. ignore, stop, table), it seems fine to use REJECT_LIMIT. > I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, meaning it ignores errors until it reaches REJECT_LIMITand stops when it exceeds the REJECT_LIMIT. ON_ERROR specifies how to handle errors, and "stop" means to fail the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect, and the command should fail immediately upon encountering an error. As in your original proposal, I now think REJECT_LIMIT should only apply when ON_ERROR=ignore. The command would ignore errors and continue processing, but if the number of errors exceeds REJECT_LIMIT, the command should fail. Thought? BTW if "set_to_null" is supported someday, REJECT_LIMIT can also apply. The command would cinsert NULL into the target table upon encountering errors and continue, but fail if the number of errors exceed REJECT_LIMIT. Regards, [1] https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=BP3d1_aSfe_+Q@mail.gmail.com -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote: Thanks for the comment. > In patch 0002, the ratio is calculated by the already skipped/processed > rows, but what if a user wants to copy 1000 rows, and he/she can > tolerate > 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row > in the > first 100 rows will fail the entire command, this might surprise the > user. Since the ratio is calculated after all data is processed, the case "one bad row in the first 100 rows will fail the entire command" doesn't happen: =# \! wc -l 1000rows-with-10err.data 1000 1000rows-with-10err.data =# COPY t1 from '1000rows-with-10err.data' with (log_verbosity verbose, reject_limit 0.01); NOTICE: skipping row due to data type incompatibility at line 10 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 11 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 12 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 13 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 14 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 15 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 16 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 17 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 18 for column i: "a" NOTICE: skipping row due to data type incompatibility at line 19 for column i: "a" NOTICE: 10 rows were skipped due to data type incompatibility COPY 990 On 2024-07-20 02:08, Fujii Masao wrote: > On 2024/07/19 22:03, Fujii Masao wrote: >> >> >> On 2024/07/17 22:21, torikoshia wrote: >>> On 2024-07-03 02:07, Fujii Masao wrote: >>>> However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR >>>> option is still necessary. >>> >>> I remembered another reason for the necessity of ON_ERROR. >>> >>> ON_ERROR defines how to behave when encountering an error and it just >>> accepts 'ignore' and 'stop' currently, but is expected to support >>> other options such as saving details of errors to a table[1]. >> >> Wouldn't it be better to separate the option specifying where >> error details are output from the ON_ERROR option >> (which determines behavior when encountering errors)? >> "table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop. > > I still find it odd to accept "table" as a value for ON_ERROR. However, > "set_to_null" or "replace-column" proposed in [1] seem valid for > ON_ERROR. So, I'm okay with keeping the ON_ERROR option. Agreed. >> On my second thought, whatever value ON_ERROR is specified(e.g. >> ignore, stop, table), it seems fine to use REJECT_LIMIT. >> I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, >> meaning it ignores errors until it reaches REJECT_LIMIT and stops when >> it exceeds the REJECT_LIMIT. > > ON_ERROR specifies how to handle errors, and "stop" means to fail > the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect, > and the command should fail immediately upon encountering an error. > > As in your original proposal, I now think REJECT_LIMIT should only > apply when ON_ERROR=ignore. The command would ignore errors and > continue processing, but if the number of errors exceeds REJECT_LIMIT, > the command should fail. Thought? Makes sense. Updated the patch. > BTW if "set_to_null" is supported someday, REJECT_LIMIT can also > apply. The command would cinsert NULL into the target table upon > encountering errors and continue, but fail if the number of errors > exceed REJECT_LIMIT. Agreed. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Attachment
Hi! Nice feature. Few comments: > + When a positive integer value is specified, <command>COPY</command> limits > + the maximum tolerable number of errors while converting a column's input > + value into its data type. If nothing is specified, then the maximum tolerable number of errors is one, right? Should we state this explicitly in the documentation? > +COPY x from stdin with (on_error ignore, reject_limit 0); How about a test where reject_limit is a string, but not (case-intensively) 'infinity'? > + CopyRejectLimits reject_limits; /* thresholds of reject_limit */ Why are there multiple thresholds? Can we have only one? Other places look good to me.
On 2024/07/22 21:37, torikoshia wrote: > On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote: > Thanks for the comment. > >> In patch 0002, the ratio is calculated by the already skipped/processed >> rows, but what if a user wants to copy 1000 rows, and he/she can tolerate >> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the >> first 100 rows will fail the entire command, this might surprise the user. > > Since the ratio is calculated after all data is processed, the case "one bad row in the first 100 rows will fail the entirecommand" doesn't happen: Yes, but is this the desired behavior when using the ratio threshold? I was thinking that COPY should fail once the error ratio (errors vs. total rows in the input file) reaches the threshold. Since it's difficult to know the total number of rows in the input file, implementing this seems not easy, though. > Updated the patch. Thanks for updating the patch! + This option must be used with <literal>ON_ERROR</literal> to be set to + other than <literal>stop</literal>. Regarding the ON_ERROR option, now it only has two values. Instead of saying "other than stop," we should simply use "ignore" for clarity and intuition? + When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all + the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>. For the INFINITY value, the description is a bit unclear. As I understand it, INFINITY is the default for REJECT_LIMIT. So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT, COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is used. In line with my previous suggestion, if we support only REJECT_LIMIT without ON_ERROR, having INFINITY makes sense. However, with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant. Users can just set ON_ERROR=ignore to ignore all errors, making INFINITY unnecessary for REJECT_LIMIT. This is open for discussion, but I believe it's better to remove INFINITY from the REJECT_LIMIT options. + else if (strcmp(defel->defname, "reject_limit") == 0) + { + if (reject_limit_specified) + errorConflictingDefElem(defel, pstate); + if (!opts_out->on_error) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop"))); Using "ignore" instead of "other than stop" in the error message is clearer and more intuitive. Checking if ON_ERROR and REJECT_LIMIT are specified should be done after the foreach() processing of the options. Otherwise, if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can occur. --------------- =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore'); ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop --------------- + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"", + (long long) cstate->opts.reject_limits.num_err))); The error message isn't clear about what exceeded REJECT_LIMIT. How about saying "skipped more than REJECT_LIMIT rows" or something instead? +/* + * A struct to hold reject_limit options, in a parsed form. + * More values to be added in another patch. + */ The latter comment doesn't seem necessary or helpful. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: Thanks for your review. > On 2024/07/22 21:37, torikoshia wrote: >> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> >> wrote: >> Thanks for the comment. >> >>> In patch 0002, the ratio is calculated by the already >>> skipped/processed >>> rows, but what if a user wants to copy 1000 rows, and he/she can >>> tolerate >>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad >>> row in the >>> first 100 rows will fail the entire command, this might surprise the >>> user. >> >> Since the ratio is calculated after all data is processed, the case >> "one bad row in the first 100 rows will fail the entire command" >> doesn't happen: > > Yes, but is this the desired behavior when using the ratio threshold? > I was thinking that COPY should fail once the error ratio (errors vs. > total rows in the input file) reaches the threshold. Users might expect like you. However, implementing it would need a row number counting feature as you pointed out, and it seems an overkill. Describing the difference between ratio and number in the manual might help, but it might be better to make REJECT_LIMIT support only the number of errors and leave it to the user to calculate the number from the ratio. I'd like to hear if anyone has an opinion on the need for supporting ratio. I remember David prefers ratio[1]. > + This option must be used with <literal>ON_ERROR</literal> to be > set to > + other than <literal>stop</literal>. > > Regarding the ON_ERROR option, now it only has two values. > Instead of saying "other than stop," we should simply use "ignore" > for clarity and intuition? I'll Modify it. The reason for the roundabout wording was the expectation that on_error would support values other than these in the future, but as you point out, there are currently only two. > + When specified <literal>INFINITY</literal>, > <command>COPY</command> ignores all > + the errors. This is a synonym for <literal>ON_ERROR</literal> > <literal>ignore</literal>. > > For the INFINITY value, the description is a bit unclear. > As I understand it, INFINITY is the default for REJECT_LIMIT. > So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT, > COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is > used. > > > In line with my previous suggestion, if we support only REJECT_LIMIT > without ON_ERROR, having INFINITY makes sense. However, > with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant. > Users can just set ON_ERROR=ignore to ignore all errors, > making INFINITY unnecessary for REJECT_LIMIT. This is open for > discussion, but I believe it's better to remove INFINITY from > the REJECT_LIMIT options. Agreed. Unless there are opposing opinions, I'm going to remove 'INFINITY' in the next patch. > + else if (strcmp(defel->defname, "reject_limit") == 0) > + { > + if (reject_limit_specified) > + errorConflictingDefElem(defel, pstate); > + if (!opts_out->on_error) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("REJECT_LIMIT > requires ON_ERROR to be set to other than stop"))); > > Using "ignore" instead of "other than stop" in the error message > is clearer and more intuitive. Agreed. > Checking if ON_ERROR and REJECT_LIMIT are specified should be > done after the foreach() processing of the options. Otherwise, > if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can > occur. > > --------------- > =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore'); > ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop > --------------- Ugh, I'll modify it. > + ereport(ERROR, > + > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("exceeded the > number specified by REJECT_LIMIT \"%lld\"", > + (long > long) cstate->opts.reject_limits.num_err))); > > The error message isn't clear about what exceeded REJECT_LIMIT. > How about saying "skipped more than REJECT_LIMIT rows" or something > instead? Agreed. > +/* > + * A struct to hold reject_limit options, in a parsed form. > + * More values to be added in another patch. > + */ > > The latter comment doesn't seem necessary or helpful. Agreed. [1] https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024-07-23 02:06, Kirill Reshke wrote: Thanks for your review. > Few comments: > >> + When a positive integer value is specified, >> <command>COPY</command> limits >> + the maximum tolerable number of errors while converting a >> column's input >> + value into its data type. > > If nothing is specified, then the maximum tolerable number of errors > is one, right? Should we state this explicitly in the documentation? REJECT_LIMIT now can be used wonly when on_error=ignore, I think the default(when only on_error=ignore is specified) is unlimited. Anyway, I'm going to add a description about the default. >> +COPY x from stdin with (on_error ignore, reject_limit 0); > How about a test where reject_limit is a string, but not > (case-intensively) 'infinity'? Considering the discussion in[1], I'm now going to remove 'infinity'. >> + CopyRejectLimits reject_limits; /* thresholds of reject_limit */ > > Why are there multiple thresholds? Can we have only one? This is because I thought it'd be more convenient to support both the number and the ratio of error, but I'm also beginning to think that it might be better to support only the number of cases, as discussed in [1]. [1]https://www.postgresql.org/message-id/5f807fcf3a36df7ba41464ab40b5c37d%40oss.nttdata.com -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
On 2024/09/24 14:25, torikoshia wrote: > Updated the patch. Thanks for updating the patch! + REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> } The curly braces {} seem unnecessary here. + When a positive integer value is specified, <command>COPY</command> limits + the maximum tolerable number of errors while converting a column's input + value into its data type. + If input data caused more errors than the specified value, entire + <command>COPY</command> fails. + Otherwise, <command>COPY</command> discards the input row and continues + with the next one. + This option must be used with <literal>ON_ERROR</literal> to be set to + <literal>ignore</literal>. + Just setting <literal>ON_ERROR</literal> to <literal>ignore</literal> + tolerates unlimited number of errors. Regarding the description of REJECT_LIMIT, how about clarifying what the option specifies up front, like this: ---------------- Specifies the maximum number of errors tolerated while converting a column's input value to its data type, when on_error is set to "ignore." If the input causes more errors than the specified value, the COPY command fails, even with on_error set to "ignore." This value must be positive and used with on_error="ignore". If not specified, on_error="ignore" allows an unlimited number of errors, meaning COPY will skip all erroneous data. ---------------- + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("number for REJECT_LIMIT must be greater than zero"))); + } + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("value for REJECT_LIMIT must be positive integer"))); The phrases "number for" and "value for" seem unnecessary in the error messages. Also, "positive number" should be "a positive number." Would it be better to display the actual specified value in the error message, like: errmsg("REJECT_LIMIT (%s) must be a positive number", value)? Lastly, during my testing, if nothing is specified for REJECT_LIMIT (e.g., COPY ... WITH (..., REJECT_LIMIT)), the COPY command caused a segmentation fault. >> I'd like to hear if anyone has an opinion on the need for supporting ratio. > > Since there was no opinion about it, attached a patch only for REJECT_LIMIT specifying number. +1 > As there are no opposing opinions, removed 'INFINITY' as well. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
+/* + * Extract REJECT_LIMIT value from a DefElem. + */ +static int64 +defGetCopyRejectLimitOptions(DefElem *def) +{ + int64 reject_limit; + + if (def->arg == NULL) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("REJECT_LIMIT requires a positive integer"))); + + if (nodeTag(def->arg) == T_Integer) + { + reject_limit = defGetInt64(def); + if (reject_limit <= 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT (%lld) must be greater than zero", + (long long) reject_limit))); + } + else + { + char *sval = defGetString(def); + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT (%s) must be a positive integer", + sval))); + } + + return reject_limit; +} there will be an integer overflow issue? Can you try the following? static int64 defGetCopyRejectLimitOptions(DefElem *def) { int64 reject_limit; reject_limit = defGetInt64(def); if (reject_limit <= 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("REJECT_LIMIT (%lld) must be greater than zero", (long long) reject_limit))); } REJECT_LIMIT <replaceable class="parameter">integer</replaceable> i think, you want REJECT_LIMIT be bigint? so here it should be REJECT_LIMIT <replaceable class="parameter">bigint</replaceable>\ ?
> I'm wondering if we can use the wording maxerror as in the attached > patch. > + <varlistentry> + <term><literal>REJECT_LIMIT</literal></term> + <listitem> + <para> + Specifies the maximum number of errors tolerated while converting a + column's input value to its data type, when <literal>ON_ERROR</literal> is + set to <literal>ignore</literal>. + If the input causes more errors than the specified value, the <command>COPY</command> + command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>. + This clause must be used with <literal>ON_ERROR</literal>=<literal>ignore</literal> + and <replaceable class="parameter">maxerror</replaceable> must be positive. + If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal> + allows an unlimited number of errors, meaning <command>COPY</command> will + skip all erroneous data. + </para> + </listitem> + </varlistentry> mentioning <replaceable class="parameter">maxerror</replaceable> is a bigint type or explicitly mentioning the maximum allowed value of "maxerror" would be great. other than that, it looks good to me.
On 2024/09/30 21:08, torikoshia wrote: > Since defGetInt64() checks not only whether the input value exceeds the range of bigint but also input value is specified,attached v6 patch checks them by directly calling defGetInt64(). Thanks for updating the patch! But the patch no longer applied cleanly to the master branch. Could you rebase it? + if (opts_out->reject_limit && !opts_out->on_error) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + /*- translator: first and second %s are the names of COPY + * option, e.g. ON_ERROR, third is the value of the COPY option, + * e.g. IGNORE */ + errmsg("COPY %s requires %s to be set to %s", + "REJECT_LIMIT", "ON_ERROR", "IGNORE"))); This check might be better moved to other place, for example at the bottom of ProcessCopyOptions(). I noticed a comment in the code: "Check for incompatible options (must do these two before inserting defaults)." You added your condition after this comment and before inserting the defaults, but it's not related to that process. So, moving this check to other place seems more appropriate. While reviewing, I also noticed that the check for "opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP" is similarly placed before setting the defaults, which might not be correct. This check should probably be moved as well. Additionally, the comment mentioning "must do these two" should be updated to "must do these three." These changes should be handled in a separate patch. + if (cstate->opts.reject_limit && Wouldn't it be clearer to use cstate->opts.reject_limit > 0 for better readability? + cstate->num_errors > cstate->opts.reject_limit) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",", + (long long) cstate->opts.reject_limit))); To make the error message clearer, similar to the NOTICE about skipped rows, how about rewording it to: "skipped more than REJECT_LIMIT (%lld) rows due to data type incompatibility"? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024/10/07 21:51, torikoshia wrote: >> While reviewing, I also noticed that the check for >> "opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP" >> is similarly placed before setting the defaults, which might not >> be correct. This check should probably be moved as well. >> Additionally, the comment mentioning "must do these two" should be >> updated to "must do these three." These changes should be handled >> in a separate patch. > > Agreed and attached 0002 patch. Thanks for updating the 0001 patch and creating the 0002 patch! I've pushed both. > Also considering when REJECT_LIMIT is specified to 1, attached patch uses errmsg_plural() instead of errmsg. I don't think errmsg_plural() is needed here since, when 1 is specified, "rows" should follow "more than REJECT_LIMIT (1)". No? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On 2024-10-08 18:39, Fujii Masao wrote: > On 2024/10/07 21:51, torikoshia wrote: >>> While reviewing, I also noticed that the check for >>> "opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP" >>> is similarly placed before setting the defaults, which might not >>> be correct. This check should probably be moved as well. >>> Additionally, the comment mentioning "must do these two" should be >>> updated to "must do these three." These changes should be handled >>> in a separate patch. >> >> Agreed and attached 0002 patch. > > Thanks for updating the 0001 patch and creating the 0002 patch! I've > pushed both. Thanks a lot! > > >> Also considering when REJECT_LIMIT is specified to 1, attached patch >> uses errmsg_plural() instead of errmsg. > > I don't think errmsg_plural() is needed here since, when 1 is > specified, > "rows" should follow "more than REJECT_LIMIT (1)". No? You are right. -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation