Re: New "raw" COPY format - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: New "raw" COPY format |
Date | |
Msg-id | CAD21AoA1P1JNrHxHRmzJE=GDaUEm79pYYDvG9nPzQaeJLXiwsw@mail.gmail.com Whole thread Raw |
In response to | Re: New "raw" COPY format (Tatsuo Ishii <ishii@postgresql.org>) |
List | pgsql-hackers |
On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel@compiler.org> wrote: > > On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote: > > On Mon, Oct 28, 2024, at 08:56, jian he wrote: > >> /* Check force_quote */ > >> - if (!opts_out->csv_mode && (opts_out->force_quote || > >> opts_out->force_quote_all)) > >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote || > >> + opts_out->force_quote_all)) > >> ereport(ERROR, > >> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > >> /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */ > >> > >> maybe this has a code indentation issue. > >> since "if" and "opts_out" in the same column position. > > > > Thanks for review. > > > > I've fixed the indentation issues. > > I've now installed pgindent, and will use it from hereon, to avoid this class of problems. > > New version where all three patches are now indented using pgindent. Thank you for updating the patch. Here are review comments on the v15 0002 patch: When testing the patch with an empty delimiter, I got the following failure: postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw', delimiter ''); TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line: 1173, PID: 903898 --- - else + else if (cstate->opts.format == COPY_FORMAT_TEXT) fldct = CopyReadAttributesText(cstate); + else + { + elog(ERROR, "unexpected COPY format: %d", cstate->opts.format); + pg_unreachable(); + } Since we already check the incompatible options with COPY_FORMAT_RAW and default_print, I think it's better to add an assertion to make sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead of using elog(ERROR). --- +/* + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode + */ +static bool +CopyReadLineRawText(CopyFromState cstate) This function has a lot of duplication with CopyReadLineText(). I think it's better to modify CopyReadLineText() to support 'raw' format, rather than adding a separate function. --- + bool read_entire_file = (cstate->opts.delim == NULL); + int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0; I think we can use 'delim_len == 0' instead of read_entire_file. --- + if (read_entire_file) + { + /* Continue until EOF if reading entire file */ + input_buf_ptr++; + continue; + } In the case where we're reading the entire file as a single tuple, we don't need to advance the input_buf_ptr one by one. Instead, input_buf_ptr can jump to copy_buf_len, which is faster. --- + /* Check for delimiter, possibly multi-byte */ + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1); + if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim, + delim_len) == 0) + { + cstate->eol_type = EOL_CUSTOM; + input_buf_ptr += delim_len; + break; + } + input_buf_ptr++; Similar to the above comment, I think we don't need to check the char one by one. I guess that it would be faster if we locate the delimiter string in the intput_buf (e.g. using strstr()), and then move input_buf_ptr to the detected position. --- + /* Copy the entire line into attribute_buf */ + memcpy(cstate->attribute_buf.data, cstate->line_buf.data, + cstate->line_buf.len); + cstate->attribute_buf.data[cstate->line_buf.len] = '\0'; + cstate->attribute_buf.len = cstate->line_buf.len; The CopyReadAttributesRaw() just copies line_buf data to attirbute_buf, which seems to be a waste. I think we can have attribute_buf point to the line_buf. That way, we can skip the whole step 4 that is described in the comment on top o f copyfromparse.c: * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf * 1. 2. 3. 4. --- +static int +CopyReadAttributesRaw(CopyFromState cstate) +{ + /* Enforce single column requirement */ + if (cstate->max_fields != 1) + { + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("COPY with format 'raw' must specify exactly one column"))); + } This check should have already been done in BeginCopyFrom(). Is there any case where max_fields gets to != 1 during reading the input? --- It's a bit odd to me to use the delimiter as a EOL marker in raw format, but probably it's okay. --- - if (cstate->opts.format != COPY_FORMAT_BINARY) + if (cstate->opts.format == COPY_FORMAT_RAW && + cstate->opts.delim != NULL) + { + /* Output the user-specified delimiter between rows */ + CopySendString(cstate, cstate->opts.delim); + } + else if (cstate->opts.format == COPY_FORMAT_TEXT || + cstate->opts.format == COPY_FORMAT_CSV) Since it sends the delimiter as a string, even if we specify the delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10). For example, postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with (format 'raw', delimiter '\n'); {"j": 1}\npostgres(1:904427)=# I think there is a similar problem in COPY FROM; if we set a delimiter to '\n' when doing COPY FROM in raw format, it expects the string '\n' as a line termination but not ASCII LF(10). I think that input data normally doesn't use the string '\n' as a line termination. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: