Thread: Fixing backslash dot for COPY FROM...CSV
Hi, PFA a patch that attempts to fix the bug that \. on a line by itself is handled incorrectly by COPY FROM ... CSV. This issue has been discussed several times previously, for instance in [1] and [2], and mentioned in the doc for \copy in commit 42d3125. There's one case that works today: when the line is part of a multi-line quoted section, and the data is read from a file, not from the client. In other situations, an error is raised or the data is cut at the point of \. without an error. The patch addresses that issue in the server and in psql, except for the case of inlined data, where \. cannot be both valid data and an EOF marker at the same time, so it keeps treating it as an EOF marker. [1] https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org [2] https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On Tue, 19 Dec 2023 at 02:06, Daniel Verite <daniel@manitou-mail.org> wrote: > > Hi, > > PFA a patch that attempts to fix the bug that \. on a line > by itself is handled incorrectly by COPY FROM ... CSV. > This issue has been discussed several times previously, > for instance in [1] and [2], and mentioned in the > doc for \copy in commit 42d3125. > > There's one case that works today: when > the line is part of a multi-line quoted section, > and the data is read from a file, not from the client. > In other situations, an error is raised or the data is cut at > the point of \. without an error. > > The patch addresses that issue in the server and in psql, > except for the case of inlined data, where \. cannot be > both valid data and an EOF marker at the same time, so > it keeps treating it as an EOF marker. I noticed that these tests are passing without applying patch too: +++ b/src/test/regress/sql/copy.sql @@ -38,6 +38,17 @@ copy copytest2 from :'filename' csv quote '''' escape E'\\'; select * from copytest except select * from copytest2; +--- test unquoted .\ as data inside CSV + +truncate copytest2; + +insert into copytest2(test) values('line1'), ('\.'), ('line2'); +copy (select test from copytest2 order by test collate "C") to :'filename' csv; +-- get the data back in with copy +truncate copytest2; +copy copytest2(test) from :'filename' csv; +select test from copytest2 order by test collate "C"; I was not sure if this was intentional. Can we add a test which fails in HEAD and passes with the patch applied. Regards, Vignesh
vignesh C wrote: > I noticed that these tests are passing without applying patch too: > +insert into copytest2(test) values('line1'), ('\.'), ('line2'); > +copy (select test from copytest2 order by test collate "C") to :'filename' > csv; > +-- get the data back in with copy > +truncate copytest2; > +copy copytest2(test) from :'filename' csv; > +select test from copytest2 order by test collate "C"; > > I was not sure if this was intentional. Can we add a test which fails > in HEAD and passes with the patch applied. Thanks for checking this out. Indeed, that was not intentional. I've been using static files in my tests and forgot that if the data was produced with COPY OUT, it would quote backslash-dot so that COPY IN could reload it without problem. PFA an updated version that uses \qecho to produce the data instead of COPY OUT. This test on unpatched HEAD shows that copytest2 is missing 2 rows after COPY IN. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On Tue, 19 Dec 2023 at 16:57, Daniel Verite <daniel@manitou-mail.org> wrote: > > vignesh C wrote: > > > I noticed that these tests are passing without applying patch too: > > > +insert into copytest2(test) values('line1'), ('\.'), ('line2'); > > +copy (select test from copytest2 order by test collate "C") to :'filename' > > csv; > > +-- get the data back in with copy > > +truncate copytest2; > > +copy copytest2(test) from :'filename' csv; > > +select test from copytest2 order by test collate "C"; > > > > I was not sure if this was intentional. Can we add a test which fails > > in HEAD and passes with the patch applied. > > Thanks for checking this out. > Indeed, that was not intentional. I've been using static files > in my tests and forgot that if the data was produced with > COPY OUT, it would quote backslash-dot so that COPY IN could > reload it without problem. > > PFA an updated version that uses \qecho to produce the > data instead of COPY OUT. This test on unpatched HEAD > shows that copytest2 is missing 2 rows after COPY IN. Thanks for the updated patch, any reason why this is handled only in csv. postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; COPY 1 postgres=# select * from test1; c1 ------- line1 (1 row) postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out' csv; COPY 1 postgres=# select * from test1; c1 ------- line1 \. line2 (3 rows) As the documentation at [1] says: An end-of-data marker is not necessary when reading from a file, since the end of file serves perfectly well; it is needed only when copying data to or from client applications using pre-3.0 client protocol. [1] - https://www.postgresql.org/docs/devel/sql-copy.html Regards, Vignesh
vignesh C wrote: > Thanks for the updated patch, any reason why this is handled only in csv. > postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; > COPY 1 > postgres=# select * from test1; > c1 > ------- > line1 > (1 row) I believe it's safer to not change anything to the normal "non-csv" text mode. The current doc says that \. will not be taken as data in this format. From https://www.postgresql.org/docs/current/sql-copy.html : Any other backslashed character that is not mentioned in the above table will be taken to represent itself. However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (\.) or the null string (\N by default). These strings will be recognized before any other backslash processing is done. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, 22 Dec 2023 at 01:17, Daniel Verite <daniel@manitou-mail.org> wrote: > > vignesh C wrote: > > > Thanks for the updated patch, any reason why this is handled only in csv. > > postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; > > COPY 1 > > postgres=# select * from test1; > > c1 > > ------- > > line1 > > (1 row) > > I believe it's safer to not change anything to the normal "non-csv" > text mode. > The current doc says that \. will not be taken as data in this format. > From https://www.postgresql.org/docs/current/sql-copy.html : > > Any other backslashed character that is not mentioned in the above > table will be taken to represent itself. However, beware of adding > backslashes unnecessarily, since that might accidentally produce a > string matching the end-of-data marker (\.) or the null string (\N > by default). These strings will be recognized before any other > backslash processing is done. Thanks for the clarification. Then let's keep it as you have implemented. Regards, Vignesh
Hi, The CI patch tester fails on this patch, because it has a label at the end of a C block, which I'm learning is a C23 feature that happens to be supported by gcc 11 [1], but is not portable. PFA an update fixing this, plus removing an obsolete chunk in the COPY documentation that v2 left out. [1] https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite <daniel@manitou-mail.org> wrote: > PFA a patch that attempts to fix the bug that \. on a line > by itself is handled incorrectly by COPY FROM ... CSV. > This issue has been discussed several times previously, > for instance in [1] and [2], and mentioned in the > doc for \copy in commit 42d3125. Those links unfortunately seem not to be entirely specific to this issue. Other, related things seem to be discussed there, and it's not obvious that everyone agrees on what to do, or really that anyone agrees on what to do. The best link that I found for this exact issue is https://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org but the thread isn't very conclusive and is so old that any conclusions reached then might no longer be considered valid today. And I guess the reason I mention is that, supposing your patch were technically perfect in every way, would everyone agree that it ought to be committed? If Alice is a user with a CSV file that might contain \. on a line by itself within a quoted CSV field, then Alice is currently sad because she can't necessarily load all of her CSV files. The patch would fix that, and make her happy. On the other hand, if Bob is a user with a CSV-ish file that definitely doesn't contain \. on a line by itself within a quoted CSV field but might have been truncated in the middle of a quoted field, maybe Bob will be sad if this patch gets committed, because he will no longer be able to append \n\.\n to whatever junk he's got in the file and let the server sort out whether to throw an error. I have to admit that it seems more likely that there are people in the world with Alice's problem rather than people in the world with Bob's problem. We'd probably make more people happy with the change than sad. But it is a definitional change, I think, and that's a little scary, and maybe somebody will think that's a reason why we should change nothing here. Part of my hesitancy, I suppose, is that I don't understand why we even have this strange convention of making \. terminate the input in the first place -- I mean, why wouldn't that be done in some kind of out-of-band way, rather than including a special marker in the data? I can't help feeling a bit nervous about this first documentation hunk: --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable> format, <literal>\.</literal>, the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a <literal>\.</literal> data value appearing as a lone entry on a line is automatically - quoted on output, and on input, if quoted, is not interpreted as the - end-of-data marker. If you are loading a file created by another - application that has a single unquoted column and might have a - value of <literal>\.</literal>, you might need to quote that value in the - input file. + quoted on output. </para> <note> It doesn't feel right to me to just replace all of this text with nothing. That leaves us documenting only the behavior on output, whereas the previous text documents both the output behavior (we quote it) and the input behavior (it has to be quoted to avoid being taken as the EOF marker). /* - * In CSV mode, we only recognize \. alone on a line. This is because - * \. is a valid CSV data value. + * In CSV mode, backslash is a normal character. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) I don't think that the update comment accurately describes the behavior. If I understand correctly what you're trying to fix, I'd expect the new behavior to be that we only recognize \. alone on a line and even then only if we're not inside a quoting string, but that's not what the revised comment says. Instead, it claims that backslash is just a normal character, but if that were true, the whole if-statement wouldn't exist at all, since its purpose is to provide special-handling for backslashes -- and indeed the patch does not change that, since the if-statement is still looking for a backslash and doing something special if one is found. Hmm. Looking at the rest of the patch, it seems like you're removing the logic that prevents us from interpreting \. lksdghksdhgjskdghjs as an end-of-file while in CSV mode. But I would have thought based on what problem you're trying to fix that you would have wanted to keep that logic and further restrict it so that it only applies when not within a quoted string. Maybe I'm misunderstanding what bug you're trying to fix? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas wrote: > Part of my hesitancy, I suppose, is that I don't > understand why we even have this strange convention of making \. > terminate the input in the first place -- I mean, why wouldn't that be > done in some kind of out-of-band way, rather than including a special > marker in the data? The v3 protocol added the out-of-band method, but the v2 protocol did not have it, and as far as I understand, this is the reason why CopyReadLineText() must interpret \. as an end-of-data marker. The v2 protocol was removed in pg14 https://www.postgresql.org/docs/release/14.0/ <quote> Remove server and libpq support for the version 2 wire protocol (Heikki Linnakangas) This was last used as the default in PostgreSQL 7.3 (released in 2002). </quote> Also I hadnt' noticed this before, but the current doc has this mention that is relevant to this patch: https://www.postgresql.org/docs/current/protocol-changes.html "Summary of Changes since Protocol 2.0" <quote> COPY data is now encapsulated into CopyData and CopyDone messages. There is a well-defined way to recover from errors during COPY. The special “\.” last line is not needed anymore, and is not sent during COPY OUT. (It is still recognized as a terminator during COPY IN, but its use is deprecated and will eventually be removed.) </quote> What the present patch does is essentially, for the server-side part, stop recognizing "\." as as terminator, like this paragraph says, but it does that for CSV only, not for TEXT. > Hmm. Looking at the rest of the patch, it seems like you're removing > the logic that prevents us from interpreting > > \. lksdghksdhgjskdghjs > > as an end-of-file while in CSV mode. But I would have thought based on > what problem you're trying to fix that you would have wanted to keep > that logic and further restrict it so that it only applies when not > within a quoted string. > > Maybe I'm misunderstanding what bug you're trying to fix? The fix is that \. is no longer recognized as special in CSV, whether alone on a line or not, and whether in a quoted section or not. It's always interpreted as data, like it would have been in the first place, I imagine, if the v2 protocol could have handled it. This is why the patch consists mostly of removing code and simplifying comments. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Robert Haas wrote: > Those links unfortunately seem not to be entirely specific to this > issue. Other, related things seem to be discussed there, and it's not > obvious that everyone agrees on what to do, or really that anyone > agrees on what to do. The best link that I found for this exact issue > is > https://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org > but the thread isn't very conclusive and is so old that any > conclusions reached then might no longer be considered valid today. To refresh the problem statement, 4 cases that need fixing as of HEAD can be distinguished: #1. copy csv from file, single column, no quoting involved. COPY will stop at \. and ignore the rest of the file without any error or warning. $ cat >/tmp/file.csv <<EOF line1 \. line2 EOF $ psql <<EOF create table contents(t text); copy contents from '/tmp/file.csv' (format csv); table contents; EOF Results in t ------- line1 (1 row) The bug is that a single row is imported instead of the 3 rows of the file. #2. Same as the previous case, but with file_fdw $ psql <<EOF CREATE EXTENSION file_fdw; CREATE FOREIGN TABLE csv_data(line text) SERVER myserver OPTIONS (filename '/tmp/file.csv', format 'csv'); TABLE csv_data; EOF Results in: line ------- line1 (1 row) The bug is that rows 2 and 3 are missing, as in case #1. #3. \copy csv from file with \. inside a quoted multi-line section This is the case that the above linked report mentioned, resulting in a failure to import. In addition to being legal CSV, these contents can be produced by Postgres itself exporting in CSV. $ cat >/tmp/file-quoted.csv <<EOF line1 " \. " line2 EOF $ psql <<EOF create table contents(t text); \copy contents from '/tmp/file-quoted.csv' csv; EOF Results in an error: ERROR: unterminated CSV quoted field CONTEXT: COPY contents, line 4: "" \. " The expected result is that it imports 3 rows without error. #4. \copy csv from file, single column, no quoting involved This is the same case as #1 except it uses the client-server protocol. $ cat >/tmp/file.csv <<EOF line1 \. line2 EOF $ psql <<EOF create table contents(t text); \copy contents from '/tmp/file.csv' (format csv); TABLE contents; EOF Results in t ------- line1 (1 row) As in case #1, a single row is imported instead of 3 rows. The proposed patch addresses these cases by making the sequence \. non-special in CSV (in fact backslash itself is a normal character in CSV). It does not address the cases when the data is embedded after the COPY command or typed interactively in psql, since these cases require an explicit end-of-data marker, and CSV does not have the concept of an end-of-data marker. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > Robert Haas wrote: >> Those links unfortunately seem not to be entirely specific to this >> issue. Other, related things seem to be discussed there, and it's not >> obvious that everyone agrees on what to do, or really that anyone >> agrees on what to do. > The proposed patch addresses these cases by making the sequence > \. non-special in CSV (in fact backslash itself is a normal character in > CSV). > It does not address the cases when the data is embedded after > the COPY command or typed interactively in psql, since these cases > require an explicit end-of-data marker, and CSV does not have > the concept of an end-of-data marker. I've looked over this patch and I generally agree that this is a reasonable solution. While it's barely possible that somebody out there is depending on the current behavior of \. in CSV mode, it seems considerably more likely that people who run into it would consider it a bug. In any case, the patch isn't proposed for back-patch, and as cross-version incompatibilities go this seems like a pretty small one. I concur with Robert's doubts about some of the doc changes though. In particular, since we're not changing the behavior for non-CSV mode, we shouldn't remove any statements that apply to non-CSV mode. I'm also wondering why the patch adds a test for "PQprotocolVersion(conn) >= 3" in handleCopyIn. As already noted upthread, we ripped out all support for protocol 2.0 some time ago, so that sure looks like dead code. regards, tom lane
Tom Lane wrote: > I've looked over this patch and I generally agree that this is a > reasonable solution. Thanks for reviewing this! > I'm also wondering why the patch adds a test for > "PQprotocolVersion(conn) >= 3" in handleCopyIn. I've removed this in the attached update. > I concur with Robert's doubts about some of the doc changes though. > In particular, since we're not changing the behavior for non-CSV > mode, we shouldn't remove any statements that apply to non-CSV mode. I don't quite understand the issues with the doc changes. Let me detail the changes. The first change is under <refsect2> <title>CSV Format</title> so it does no concern non-CSV modes. --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable> format, <literal>\.</literal>, the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a <literal>\.</literal> data value appearing as a lone entry on a line is automatically - quoted on output, and on input, if quoted, is not interpreted as the - end-of-data marker. If you are loading a file created by another - application that has a single unquoted column and might have a - value of <literal>\.</literal>, you might need to quote that value in the - input file. + quoted on output. </para> The part about quoting output is kept because the code still does that. About this bit: "and on input, if quoted, is not interpreted as the end-of-data marker." So the patch removes that part. The reason is \. is now not interpreted as EOD in both cases, quoted or unquoted, conforming to spec. Previously, what the reader should have understood by "if quoted, ..." is that it implies "if not quoted, then .\ will be interpreted as EOD even though that behavior does not conform to the CSV spec". If we documented the new behavior, it would be something like: when quoted, it works as expected, and when unquoted, it works as expected too. Isn't it simpler not to say anything? About the next sentence: "If you are loading a file created by another application that has a single unquoted column and might have a value of \., you might need to quote that value in the input file." It's removed as well because it's not longer necessary to do that. The other hunk is in psql doc: --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the <acronym>SQL</acronym> command might be preferable. - Also, because of this pass-through method, <literal>\copy - ... from</literal> in <acronym>CSV</acronym> mode will erroneously - treat a <literal>\.</literal> data value alone on a line as an - end-of-input marker. That behavior no longer happens, so this gets removed as well. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
"Daniel Verite" <daniel@manitou-mail.org> writes: > Tom Lane wrote: >> I've looked over this patch and I generally agree that this is a >> reasonable solution. > Thanks for reviewing this! While testing this, I tried running the tests with an updated server and non-updated psql, and not only did the new test case fail, but so did a bunch of existing ones. That's because existing psql will send the trailing "\." of inlined data to the server, and the updated server will now think that's data if it's in CSV mode. So this means that the patch introduces a rather serious cross-version compatibility problem. I doubt we can consider inlined CSV data to be a niche case that few people use; but it will fail every time if your psql is older than your server. Not sure what to do here. One idea is to install just the psql-side fix, which should break nothing now that version-2 protocol is dead, and then wait a few years before introducing the server-side change. That seems kind of sad though. An argument for not waiting is that psql may not be the only client that needs this behavioral adjustment, and if so there's going to be breakage anyway when we change the server; so we might as well get it over with. regards, tom lane
I wrote: > So this means that the patch introduces a rather serious cross-version > compatibility problem. I doubt we can consider inlined CSV data to be > a niche case that few people use; but it will fail every time if your > psql is older than your server. On third thought, that may not be as bad as I was thinking. We don't blink at the idea that an older psql's \d commands may malfunction with a newer server, and I think most users have internalized the idea that they want psql >= server. If the patch created an incompatibility with that direction, it'd be a problem, but I don't think it does. regards, tom lane
Tom Lane wrote: > Not sure what to do here. One idea is to install just the psql-side > fix, which should break nothing now that version-2 protocol is dead, > and then wait a few years before introducing the server-side change. > That seems kind of sad though. Wouldn't backpatching solve this? Then only the users who don't apply the minor updates would have non-matching server and psql. Initially I though that obsoleting the v2 protocol was a recent move, but reading older messages from the list I've got the impression that it was more or less in the pipeline since way before version 10. Also one of the cases the patch fixes, the one when imported data are silently truncated at the point of \., is quite nasty IMO. I can imagine an app where user-supplied data would be appended row-by-row into a CSV file, and would be processed periodically by batch. Under some conditions, in particular if newlines in the first column are allowed, a malevolent user could submit a \. sequence to cause the batch to miss the rest of the data without any error being raised. [1] https://www.postgresql.org/message-id/11648.1403147417%40sss.pgh.pa.us Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > Tom Lane wrote: >> Not sure what to do here. One idea is to install just the psql-side >> fix, which should break nothing now that version-2 protocol is dead, >> and then wait a few years before introducing the server-side change. >> That seems kind of sad though. > Wouldn't backpatching solve this? No, it'd just reduce the surface area a bit. People on less-than- the-latest-minor-release would still have the issue. In any case back-patching further than v14 would be a nonstarter, because we didn't remove protocol v2 support till then. However, the analogy to "\d commands might fail against a newer server" reduces my level of concern quite a lot: it's hard to draw much of a line separating that kind of issue from "inline COPY CSV will fail against a newer server". It's not like such failures won't be obvious and fairly easy to diagnose. regards, tom lane
After some more poking at this topic, I realize that there is already very strange and undocumented behavior for backslash-dot even in non-CSV mode. Create a file like this: $ cat eofdata foobar foobaz\. more \. yet more and try importing it with COPY: regression=# create table eofdata(f1 text); CREATE TABLE regression=# copy eofdata from '/home/tgl/pgsql/eofdata'; COPY 2 regression=# table eofdata; f1 -------- foobar foobaz (2 rows) That's what you get in 9.0 and earlier versions, and it's already not-as-documented, because we claim that only \. alone on a line is an EOF marker; we certainly don't suggest that what's in front of it will be taken as valid data. However, somebody broke it some more in 9.1, because 9.1 up to HEAD produce this result: regression=# create table eofdata(f1 text); CREATE TABLE regression=# copy eofdata from '/home/tgl/pgsql/eofdata'; COPY 3 regression=# table eofdata; f1 -------- foobar foobaz more (3 rows) So the current behavior is that \. that is on the end of a line, but is not the whole line, is silently discarded and we keep going. All versions throw "end-of-copy marker corrupt" if there is something after \. on the same line. This is sufficiently weird that I'm starting to come around to Daniel's original proposal that we just drop the server's recognition of \. altogether (which would allow removal of some dozens of lines of complicated and now known-buggy code). Alternatively, we could fix it so that \. at the end of a line draws "end-of-copy marker corrupt", which would at least make things consistent, but I'm not sure that has any great advantage. I surely don't want to document the current behavioral details as being the right thing that we're going to keep doing. Thoughts? regards, tom lane
I wrote: > So the current behavior is that \. that is on the end of a line, > but is not the whole line, is silently discarded and we keep going. > All versions throw "end-of-copy marker corrupt" if there is > something after \. on the same line. > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code). I experimented with that and soon ran into a nasty roadblock: it breaks dump/restore, because pg_dump includes a "\." line after COPY data whether or not it really needs one. Worse, that's implemented by including the "\." line into the archive format, so that existing dump files contain it. Getting rid of it would require an archive format version bump, plus some hackery to allow removal of the line when reading old dump files. While that's surely doable with enough effort, it's not the kind of thing to be undertaking with less than 2 days to feature freeze. Not to mention that I'm not sure we have consensus to do it at all. More fun stuff: PQgetline actually invents a "\." line when it sees server end-of-copy, and we tell users of that function to check for that not an out-of-band return value to detect EOF. It looks like we have no callers of that in the core distro, but do we want to deprecate it completely? So I feel like we need to put this patch on the shelf for the moment and come back to it early in v18. Although it seems reasonably clear what to do on the CSV side of things, it's very much less clear what to do about text-format handling of EOD markers, and I don't want to change one of those things in v17 and the other in v18. Also it seems like there are more dependencies on "\." than we realized. There could be an argument for applying just the psql change now, to remove its unnecessary sending of "\.". That won't break anything and it would give us at least one year's leg up on compatibility issues. regards, tom lane
Tom Lane wrote: > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code) FWIW my plan was to not change anything in the TEXT mode, but I wasn't aware it had this issue that you found when \. is not in a line by itself. > Alternatively, we could fix it so that \. at the end of a line draws > "end-of-copy marker corrupt" > which would at least make things consistent, but I'm not sure that has > any great advantage. I surely don't want to document the current > behavioral details as being the right thing that we're going to keep > doing. Agreed we don't want to document that, but also why doesn't \. in the contents represent just a dot (as opposed to being an error), just like \a is a? I mean if eofdata contains foobar\a foobaz\aother then we get after import: f1 -------------- foobara foobazaother (2 rows) Reading the current doc on the text format, I can't see why importing: foobar\. foobar\.other is not supposed to produce f1 -------------- foobar. foobaz.other (2 rows) I see these rules in [1] about backslash: #1. "End of data can be represented by a single line containing just backslash-period (\.)." foobar\. and foobar\.other do not match that so #1 does not describe how they're interpreted. #2. "Backslash characters (\) can be used in the COPY data to quote data characters that might otherwise be taken as row or column delimiters." Dot is not a column delimiter (it's forbidden anyway), so #2 does not apply. #3. "In particular, the following characters must be preceded by a backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the current delimiter character" Dot is not in that list so #3 does not apply. #4. "The following special backslash sequences are recognized by COPY FROM:" (followed by the table with \b \f, ...,) Dot is not mentioned. #5. "Any other backslashed character that is not mentioned in the above table will be taken to represent itself" Here we say that backslash dot represents a dot (unless other rules apply) foobar\. => foobar. foobar\.other => foobar.other #6. "However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (\.) or the null string (\N by default)." So we *recommend* not to use \. but as I understand it, the warning with the EOD marker is about accidentally creating a line that matches #1, that is, \. alone on a line. #7 "These strings will be recognized before any other backslash processing is done." TBH I don't understand what #7 implies. The order in backslash processing looks like an implementation detail that should not matter in understanding the format? Considering this, it seems to me that #5 says that backslash-dot represents a dot unless #1 applies, and the other #2 #3 #4 #6 #7 rules do not state anything that would contradict that. [1] https://www.postgresql.org/docs/current/sql-copy.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: > Tom Lane wrote: > > > This is sufficiently weird that I'm starting to come around to > > Daniel's original proposal that we just drop the server's recognition > > of \. altogether (which would allow removal of some dozens of lines of > > complicated and now known-buggy code) > > FWIW my plan was to not change anything in the TEXT mode, > but I wasn't aware it had this issue that you found when > \. is not in a line by itself. > > > Alternatively, we could fix it so that \. at the end of a line draws > > "end-of-copy marker corrupt" > > which would at least make things consistent, but I'm not sure that has > > any great advantage. I surely don't want to document the current > > behavioral details as being the right thing that we're going to keep > > doing. > > Agreed we don't want to document that, but also why doesn't \. in the > contents represent just a dot (as opposed to being an error), > just like \a is a? I looked into this and started to realize that \. is the only copy backslash command where we define the behavior only alone at the beginning of a line, and not in other circumstances. The \a example above suggests \. should be period in all other cases, as suggested, but I don't know the ramifications if that. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Bruce Momjian <bruce@momjian.us> writes: > On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: >> Agreed we don't want to document that, but also why doesn't \. in the >> contents represent just a dot (as opposed to being an error), >> just like \a is a? > I looked into this and started to realize that \. is the only copy > backslash command where we define the behavior only alone at the > beginning of a line, and not in other circumstances. The \a example > above suggests \. should be period in all other cases, as suggested, but > I don't know the ramifications if that. Here's the problem: if some client-side code thinks it's okay to quote "." as "\.", what is likely to happen when it's presented a data value consisting only of "."? It could very easily fall into the trap of producing an end-of-data marker. If we get rid of the whole concept of end-of-data markers, then it'd be totally reasonable to accept "\." as ".". But as long as we still have end-of-data markers, I think it's unwise to allow "\." to appear as anything but an end-of-data marker. It'd just add camouflage to the booby trap. regards, tom lane
Hi, I'm reviewing patches in Commitfest 2024-07 from top to bottom: https://commitfest.postgresql.org/48/ This is the 4th patch: https://commitfest.postgresql.org/48/4710/ FYI: https://commitfest.postgresql.org/48/4681/ is my patch. In <2726138.1712462833@sss.pgh.pa.us> "Re: Fixing backslash dot for COPY FROM...CSV" on Sun, 07 Apr 2024 00:07:13 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Here's the problem: if some client-side code thinks it's okay to > quote "." as "\.", what is likely to happen when it's presented > a data value consisting only of "."? It could very easily fall > into the trap of producing an end-of-data marker. > > If we get rid of the whole concept of end-of-data markers, then > it'd be totally reasonable to accept "\." as ".". But as long > as we still have end-of-data markers, I think it's unwise to allow > "\." to appear as anything but an end-of-data marker. It'd just > add camouflage to the booby trap. I read through this thread. It seems that this thread discuss 2 things: 1. \. in CSV mode 2. \. in non-CSV mode Recent messages discussed mainly 2. but how about create a separated thread for 2.? Because the original mail focused on 1. and it seems that we can handle them separately. Here are comments for the latest v4 patch: ---- diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index 961ae32694..a39818b193 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -620,20 +620,32 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) ... + if (copystream == pset.cur_cmd_source) + { + *fgresult='\0'; + buflen -= linelen; ---- Spaces around "=" are missing for the fgresult line. BTW, here is a diff after pgindent: ---- diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index a39818b1933..4ee8481998a 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -621,13 +621,13 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) if (buf[buflen - 1] == '\n') { /* - * When at the beginning of the line, check for EOF marker. - * If the marker is found and the data is inlined, + * When at the beginning of the line, check for EOF + * marker. If the marker is found and the data is inlined, * we must stop at this point. If not, the \. line can be - * sent to the server, and we let it decide whether - * it's an EOF or not depending on the format: in - * basic TEXT, \. is going to be interpreted as an EOF, in - * CSV, it will not. + * sent to the server, and we let it decide whether it's + * an EOF or not depending on the format: in basic TEXT, + * \. is going to be interpreted as an EOF, in CSV, it + * will not. */ if (at_line_begin && copystream == pset.cur_cmd_source) { @@ -635,15 +635,16 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) (linelen == 4 && memcmp(fgresult, "\\.\r\n", 4) == 0)) { copydone = true; + /* - * Remove the EOF marker from the data sent. - * In the case of CSV, the EOF marker must be + * Remove the EOF marker from the data sent. In + * the case of CSV, the EOF marker must be * removed, otherwise it would be interpreted by * the server as valid data. */ if (copystream == pset.cur_cmd_source) { - *fgresult='\0'; + *fgresult = '\0'; buflen -= linelen; } } ---- I also confirmed that the updated server and non-updated psql compatibility problem (the end-of-data "\." is inserted). It seems that it's difficult to solve without introducing incompatibility. How about introducing a new COPY option that controls whether "\." is ignored or not instead of this approach? Here is a migration idea: 1. Add the new COPY option to v18 2. psql detects whether a server has the new COPY option 2.1. If it's available, psql uses the option and doesn't send "\." 2.2. If it's not available, psql doesn't use the option and sends "\." 3. psql always doesn't sends "\." after v17 or earlier reach EOL Thanks, -- kou
Sutou Kouhei <kou@clear-code.com> writes: > I read through this thread. It seems that this thread > discuss 2 things: > 1. \. in CSV mode > 2. \. in non-CSV mode > Recent messages discussed mainly 2. but how about create a > separated thread for 2.? Because the original mail focused > on 1. and it seems that we can handle them separately. Well, we don't want to paint ourselves into a corner by considering only part of the problem. What I'm currently thinking is that we can't remove the special treatment of \. in text mode. It's not arguably a bug, because it's been part of the specification since day one; and there are too many compatibility risks in pg_dump and elsewhere. I think we should fix it so that \. that's not alone on a line throws an error, but I wouldn't go further than that. > How about introducing a new COPY option that controls > whether "\." is ignored or not instead of this approach? No thanks. Once we create such an option we'll never be able to get rid of it. regards, tom lane
Sutou Kouhei wrote: > BTW, here is a diff after pgindent: PFA a v5 with the cosmetic changes applied. > I also confirmed that the updated server and non-updated > psql compatibility problem (the end-of-data "\." is > inserted). It seems that it's difficult to solve without > introducing incompatibility. To clarify the compatibility issue, the attached bash script compares pre-patch and post-patch client/server combinations with different cases, submitted with different copy variants. case A: quoted backslash-dot sequence in CSV case B: unquoted backslash-dot sequence in CSV case C: CSV without backslash-dot case D: text with backslash-dot at the end case E: text without backslash-dot at the end The different ways to submit the data: copy from file \copy from file \copy from pstdin copy from stdin with embedded data after the command Also attaching the tables of results with the patch as it stands. "Failed" is when psql reports an error and "Data mismatch" is when it succeeds but with copied data differing from what was expected. Does your test has an equivalent in these results or is it a different case? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite =======================
Attachment
Hi, In <4ffb7317-7e92-4cbd-a542-1e478af6ad5d@manitou-mail.org> "Re: Fixing backslash dot for COPY FROM...CSV" on Wed, 31 Jul 2024 15:42:41 +0200, "Daniel Verite" <daniel@manitou-mail.org> wrote: >> BTW, here is a diff after pgindent: > > PFA a v5 with the cosmetic changes applied. Thanks for updating the patch. >> I also confirmed that the updated server and non-updated >> psql compatibility problem (the end-of-data "\." is >> inserted). It seems that it's difficult to solve without >> introducing incompatibility. > > To clarify the compatibility issue, the attached bash script > compares pre-patch and post-patch client/server combinations with > different cases, submitted with different copy variants. > > case A: quoted backslash-dot sequence in CSV > case B: unquoted backslash-dot sequence in CSV > case C: CSV without backslash-dot > case D: text with backslash-dot at the end > case E: text without backslash-dot at the end > > The different ways to submit the data: > copy from file > \copy from file > \copy from pstdin > copy from stdin with embedded data after the command > > Also attaching the tables of results with the patch as it stands. > "Failed" is when psql reports an error and "Data mismatch" > is when it succeeds but with copied data differing from what was > expected. > > Does your test has an equivalent in these results or is it a different > case? Sorry for not clarify my try. My try was: Case C +----------+----------------+ | method | patched-server+| | | unpatched-psql | +----------+----------------+ | embedded | Data mismatch | +----------+----------------+ I confirmed that this case is "Data mismatch". Thanks, -- kou
On Wed, 31 Jul 2024 at 15:42, Daniel Verite <daniel@manitou-mail.org> wrote: > > Sutou Kouhei wrote: > > > BTW, here is a diff after pgindent: > > PFA a v5 with the cosmetic changes applied. Thank you Daniel for working on it. I've tested the patch and it seems it works as expected. I have a couple of minor comments. It seems it isn't necessary to handle "\." within "CopyAttributeOutCSV()" (file "src/backend/commands/copyto.c") anymore. /* * Because '\.' can be a data value, quote it if it appears alone on a * line so it is not interpreted as the end-of-data marker. */ if (single_attr && strcmp(ptr, "\\.") == 0) use_quote = true; You might see the difference in the test "cooy2.sql". Without changes the output is: =# COPY testeoc TO stdout CSV; a\. \.b c\.d "\." Another thing is that the comparison "copystream == pset.cur_cmd_source" happens twice within "handleCopyIn()". TBH it is a bit confusing to me what is the real purpose of that check, but one of the comparisons looks unnecessary. -- Kind regards, Artur Supabase
"Daniel Verite" <daniel@manitou-mail.org> writes: > To clarify the compatibility issue, the attached bash script > compares pre-patch and post-patch client/server combinations with > different cases, submitted with different copy variants. > ... > Also attaching the tables of results with the patch as it stands. > "Failed" is when psql reports an error and "Data mismatch" > is when it succeeds but with copied data differing from what was > expected. Thanks for doing that --- that does indeed clarify matters. Obviously, the cases we need to worry about most are the "Data mismatch" ones, since those might lead to silent misbehavior. I modified your script to show me the exact nature of the mismatches. The results are: * For case B, unpatched both sides, the "mismatch" is that the copy stops at the \. line. We should really call this the "old expected behavior", since it's not impossible that somebody is relying on it. We're okay with breaking that expectation, but the question is whether there might be other surprises. * For case B, unpatched-server+patched-psql, the mismatches are exactly the same, i.e. the user sees the old behavior. That's OK. * For case B, patched-server+unpatched-psql, the mismatches are different: the copy absorbs the \. as a data line and then stops. So that's surprising and bad, as it's neither the old expected behavior nor the new intended behavior. However, there are two things that make me less worried about this than I might be. First, it's pretty likely that the server will not find "\." alone on a line to be valid input data for the COPY, so that the result will be an error not silently wrong data. Second, the case of patched-server+unpatched-psql is fairly uncommon, and people know to expect that an old psql might not behave perfectly against a newer server. The other direction of version mismatch is of far more concern: people do expect a new psql to work with an old server. * For case C, patched-server+unpatched-psql, the mismatch for embedded data is again that the copy absorbs the \. as a data line and then stops. Again, this isn't great but the mitigating factors are the same as above. In all other cases, the results are the same or strictly better than before. So on the whole I think we are good on the compatibility front and I'm ready to press ahead with the v5 behavior. However, I've not looked at Artur's coding suggestions downthread. Do you want to review that and see if you want to adopt any of those changes? regards, tom lane
"Daniel Verite" <daniel@manitou-mail.org> writes: > [ v6-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patch ] I did some more work on the docs and comments, and pushed that. Returning to my upthread thought that >>> I think we should fix it so that \. that's not alone on a line >>> throws an error, but I wouldn't go further than that. here's a quick follow-on patch to make that happen. It could probably do with a test case to demonstrate the error, but I didn't bother yet pending approval that we want to do this. (This passes check-world as it stands, indicating we have no existing test that expects this case to work.) Also, I used the same error message "end-of-copy marker corrupt" that we have for the case of junk following the marker, but I can't say that I think much of that phrasing. What do people think of "end-of-copy marker is not alone on its line", instead? regards, tom lane diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index a280efe23f..2ea9679b3c 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -1426,13 +1426,17 @@ CopyReadLineText(CopyFromState cstate) } /* - * Transfer only the data before the \. into line_buf, then - * discard the data and the \. sequence. + * If there is any data on this line before the \., complain. + */ + if (cstate->line_buf.len > 0 || + prev_raw_ptr > cstate->input_buf_index) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("end-of-copy marker corrupt"))); + + /* + * Discard the \. and newline, then report EOF. */ - if (prev_raw_ptr > cstate->input_buf_index) - appendBinaryStringInfo(&cstate->line_buf, - cstate->input_buf + cstate->input_buf_index, - prev_raw_ptr - cstate->input_buf_index); cstate->input_buf_index = input_buf_ptr; result = true; /* report EOF */ break;
Tom Lane wrote: > > [ v6-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patch ] > > I did some more work on the docs and comments, and pushed that. Thanks! > Returning to my upthread thought that > > >>> I think we should fix it so that \. that's not alone on a line > >>> throws an error, but I wouldn't go further than that. > > here's a quick follow-on patch to make that happen. It could > probably do with a test case to demonstrate the error, but > I didn't bother yet pending approval that we want to do this. +1 for fixing. In particular, the behavior that silently drops data after a misplaced marker seems hard to argue for: postgres=# \copy foo(c1,c2) from stdin Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. >> a b >> c \. >> d e >> \. COPY 2 postgres=# table foo; c1 | c2 ----+---- a | b c | (2 rows) > Also, I used the same error message "end-of-copy marker corrupt" > that we have for the case of junk following the marker, but > I can't say that I think much of that phrasing. What do people > think of "end-of-copy marker is not alone on its line", instead? Yes, it's more precise and more helpful. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > Tom Lane wrote: >> Returning to my upthread thought that >>> I think we should fix it so that \. that's not alone on a line >>> throws an error, but I wouldn't go further than that. >> here's a quick follow-on patch to make that happen. It could >> probably do with a test case to demonstrate the error, but >> I didn't bother yet pending approval that we want to do this. > +1 for fixing. It's been on the thread for awhile now without objections, so I'll go ahead and make that happen. regards, tom lane