Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers
From | jian he |
---|---|
Subject | Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) |
Date | |
Msg-id | CACJufxHi53OpGYPAe6SdCb4m=-+H8L+7LDbUWvTiJp=V4YYEqA@mail.gmail.com Whole thread Raw |
In response to | Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) |
List | pgsql-hackers |
On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > I've read this thread and the latest patch. IIUC with SAVE_ERROR > option, COPY FROM creates an error table for the target table and > writes error information there. > > While I agree that the final shape of this feature would be something > like that design, I'm concerned some features are missing in order to > make this feature useful in practice. For instance, error logs are > inserted to error tables without bounds, meaning that users who want > to tolerate errors during COPY FROM will have to truncate or drop the > error tables periodically, or the database will grow with error logs > without limit. Ideally such maintenance work should be done by the > database. There might be some users who want to log such conversion > errors in server logs to avoid such maintenance work. I think we > should provide an option for where to write, at least. Also, since the > error tables are normal user tables internally, error logs are also > replicated to subscribers if there is a publication FOR ALL TABLES, > unlike system catalogs. I think some users would not like such > behavior. save the error metadata to system catalogs would be more expensive, please see below explanation. I have no knowledge of publications. but i feel there is a feature request: publication FOR ALL TABLES exclude regex_pattern. Anyway, that would be another topic. > Looking at SAVE_ERROR feature closely, I think it consists of two > separate features. That is, it enables COPY FROM to load data while > (1) tolerating errors and (2) logging errors to somewhere (i.e., an > error table). If we implement only (1), it would be like COPY FROM > tolerate errors infinitely and log errors to /dev/null. The user > cannot see the error details but I guess it could still help some > cases as Andres mentioned[1] (it might be a good idea to send the > number of rows successfully loaded in a NOTICE message if some rows > could not be loaded). Then with (2), COPY FROM can log error > information to somewhere such as tables and server logs and the user > can select it. So I'm thinking we may be able to implement this > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would > be to support logging destinations such as server logs and tables. > > Regards, > > [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com > feature incrementally. The first step would be something like an > option to ignore all errors or an option to specify the maximum number > of errors to tolerate before raising an ERROR. The second step would I don't think "specify the maximum number of errors to tolerate before raising an ERROR." is very useful.... QUOTE from [1] MAXERROR [AS] error_count If the load returns the error_count number of errors or greater, the load fails. If the load returns fewer errors, it continues and returns an INFO message that states the number of rows that could not be loaded. Use this parameter to allow loads to continue when certain rows fail to load into the table because of formatting errors or other inconsistencies in the data. Set this value to 0 or 1 if you want the load to fail as soon as the first error occurs. The AS keyword is optional. The MAXERROR default value is 0 and the limit is 100000. The actual number of errors reported might be greater than the specified MAXERROR because of the parallel nature of Amazon Redshift. If any node in the Amazon Redshift cluster detects that MAXERROR has been exceeded, each node reports all of the errors it has encountered. END OF QUOTE option MAXERROR error_count. iiuc, it fails while validating line error_count + 1, else it raises a notice, tells you how many rows have errors. * case when error_count is small, and the copy fails, it only tells you that at least the error_count line has malformed data. but what if the actual malformed rows are very big. In this case, this failure error message is not that helpful. * case when error_count is very big, and the copy does not fail. then the actual malformed data rows are very big (still less than error_count). but there is no error report, you don't know which line has an error. Either way, if the file has a large portion of malformed rows, then the MAXERROR option does not make sense. so maybe we don't need a threshold for tolerating errors. however, we can have an option, not actually copy to the table, but only validate, similar to NOLOAD in [1] why we save the error: * if only a small portion of malformed rows then saving the error metadata would be cheap. * if a large portion of malformed rows then copy will be slow but we saved the error metadata. Now you can fix it based on this error metadata. I think saving errors to a regular table or text file seems sane, but not to a catalog table. * for a text file with M rows, N fields, contrived corner case would be (M-2) * N errors, the last 2 rows have the duplicate keys, violate primary key constraint. In this case, we first insert (M-2) * N rows to the catalog table then because of errors we undo it. I think it will be expensive. * error meta info is not as important as other pg_catalog tables. log format is quite verbose, save_error to log seems not so good, I guess. I suppose we can specify an ERRORFILE directory. similar implementation [2], demo in [3] it will generate 2 files, one file shows the malform line content as is, another file shows the error info. Let's assume we save the error info to a table: Since the previous thread says one copy operation may create one error table is not a good idea, looking back, I agree. Similar to [4] I come with the following logic/ideas: * save_error table name be COPY_ERRORS, shema be the same as copy from destination table. * one COPY_ERRORS table saves all COPY FROM generated error metadata * if save_error specified, before do COPY FROM, first check if the table COPY_ERRORS exists, if not then create one? Or raise an error saying that COPY_ERRORS does not exist, cannot save_error? * COPY_ERRORS table owner be current database owner? * Only the table owner is allowed to INSERT/DELETE/UPDATE, others are not allowed to INSERT/DELETE/UPDATE. while doing copy error happened, record the userid, then switch COPY_ERRORS owner execute the insert command * the user who is doing COPY FROM operation is allowed solely to view (select) the errored row they generated. COPY_ERRORS table would be: userid oid /* the user who is doing this operation */ error_time timestamptz /* when this error happened. not 100% sure this column is needed */ filename text /* the copy from source */ table_name text /* the copy from destination */ lineno bigint /* the error line number */ line text /* the whole line raw content */ colname text -- Field with the error. raw_field_value text --- The value for the field that leads to the error. err_message text -- same as ErrorData->message err_detail text --same as ErrorData->detail errorcode text --transformed errcode, example "22P02" [1] https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html [2] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver16 [3] https://www.sqlshack.com/working-with-line-numbers-and-errors-using-bulk-insert/ [4] https://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html
pgsql-hackers by date: