Re: New "raw" COPY format - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: New "raw" COPY format
Date
Msg-id CAD21AoA1P1JNrHxHRmzJE=GDaUEm79pYYDvG9nPzQaeJLXiwsw@mail.gmail.com
Whole thread Raw
In response to Re: New "raw" COPY format  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
On Mon, Oct 28, 2024 at 3:21 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Oct 28, 2024, at 10:30, Joel Jacobson wrote:
> > On Mon, Oct 28, 2024, at 08:56, jian he wrote:
> >>   /* Check force_quote */
> >> - if (!opts_out->csv_mode && (opts_out->force_quote ||
> >> opts_out->force_quote_all))
> >> + if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote ||
> >> + opts_out->force_quote_all))
> >>   ereport(ERROR,
> >>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >>   /*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
> >>
> >> maybe this has a code indentation issue.
> >> since "if" and "opts_out" in the same column position.
> >
> > Thanks for review.
> >
> > I've fixed the indentation issues.
>
> I've now installed pgindent, and will use it from hereon, to avoid this class of problems.
>
> New version where all three patches are now indented using pgindent.

Thank you for updating the patch. Here are review comments on the v15
0002 patch:

When testing the patch with an empty delimiter, I got the following failure:

postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
delimiter '');
TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
1173, PID: 903898

---
-           else
+           else if (cstate->opts.format == COPY_FORMAT_TEXT)
                fldct = CopyReadAttributesText(cstate);
+           else
+           {
+               elog(ERROR, "unexpected COPY format: %d", cstate->opts.format);
+               pg_unreachable();
+           }

Since we already check the incompatible options with COPY_FORMAT_RAW
and default_print, I think it's better to add an assertion to make
sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
of using elog(ERROR).

---
+/*
+ * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
+ */
+static bool
+CopyReadLineRawText(CopyFromState cstate)

This function has a lot of duplication with CopyReadLineText(). I
think it's better to modify CopyReadLineText() to support 'raw'
format, rather than adding a separate function.

---
+   bool        read_entire_file = (cstate->opts.delim == NULL);
+   int         delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 0;

I think we can use 'delim_len == 0' instead of read_entire_file.

---
+       if (read_entire_file)
+       {
+           /* Continue until EOF if reading entire file */
+           input_buf_ptr++;
+           continue;
+       }

In the case where we're reading the entire file as a single tuple, we
don't need to advance the input_buf_ptr one by one. Instead,
input_buf_ptr can jump to copy_buf_len, which is faster.

---
+           /* Check for delimiter, possibly multi-byte */
+           IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
+           if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim,
+                       delim_len) == 0)
+           {
+               cstate->eol_type = EOL_CUSTOM;
+               input_buf_ptr += delim_len;
+               break;
+           }
+           input_buf_ptr++;

Similar to the above comment, I think we don't need to check the char
one by one. I guess that it would be faster if we locate the delimiter
string in the intput_buf (e.g. using strstr()), and then move
input_buf_ptr to the detected position.

---
+   /* Copy the entire line into attribute_buf */
+   memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
+          cstate->line_buf.len);
+   cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
+   cstate->attribute_buf.len = cstate->line_buf.len;

The CopyReadAttributesRaw() just copies line_buf data to
attirbute_buf, which seems to be a waste. I think we can have
attribute_buf point to the line_buf. That way, we can skip the whole
step 4 that is described in the comment on top o f copyfromparse.c:

* [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
*                1.          2.            3.           4.

---
+static int
+CopyReadAttributesRaw(CopyFromState cstate)
+{
+   /* Enforce single column requirement */
+   if (cstate->max_fields != 1)
+   {
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("COPY with format 'raw' must specify exactly
one column")));
+   }

This check should have already been done in BeginCopyFrom(). Is there
any case where max_fields gets to != 1 during reading the input?

---
It's a bit odd to me to use the delimiter as a EOL marker in raw
format, but probably it's okay.

---
-           if (cstate->opts.format != COPY_FORMAT_BINARY)
+           if (cstate->opts.format == COPY_FORMAT_RAW &&
+               cstate->opts.delim != NULL)
+           {
+               /* Output the user-specified delimiter between rows */
+               CopySendString(cstate, cstate->opts.delim);
+           }
+           else if (cstate->opts.format == COPY_FORMAT_TEXT ||
+                    cstate->opts.format == COPY_FORMAT_CSV)

Since it sends the delimiter as a string, even if we specify the
delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
For example,

postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
(format 'raw', delimiter '\n');
{"j": 1}\npostgres(1:904427)=#

I think there is a similar problem in COPY FROM; if we set a delimiter
to '\n' when doing COPY FROM in raw format, it expects the string '\n'
as a line termination but not ASCII LF(10). I think that input data
normally doesn't use the string '\n' as a line termination.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Bernd Helmle
Date:
Subject: Re: [PATCH] Add sortsupport for range types and btree_gist
Next
From: Alexander Lakhin
Date:
Subject: Re: Better error reporting from extension scripts (Was: Extend ALTER OPERATOR)