Thread: Should CSV parsing be stricter about mid-field quotes?
Hi hackers,I've come across an unexpected behavior in our CSV parser that I'd like tobring up for discussion.% cat example.csvid,rating,review1,5,"Great product, will buy again."2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"% psqlCREATE TABLE reviews (id int, rating int, review text);\COPY reviews FROM example.csv WITH CSV HEADER;SELECT * FROM reviews;This gives:id | rating | review----+--------+-------------------------------------------------------------1 | 5 | Great product, will buy again.2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet(2 rows)The parser currently accepts quoting within an unquoted field. This can lead todata misinterpretation when the quote is part of the field data (e.g.,for inches, like in the example).Our CSV output rules quote an entire field or not at all. But the import offields with mid-field quotes might lead to surprising and undetected outcomes.I think we should throw a parsing error for unescaped mid-field quotes,and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-fieldquotes are necessary. The error message could suggest this option when itencounters an unescaped mid-field quote.I think the convenience of not having to use an extra option doesn't outweighthe risk of undetected data integrity issues.Thoughts?
/Joel
On Thu, 11 May 2023 at 10:04, Joel Jacobson <joel@compiler.org> wrote: > > The parser currently accepts quoting within an unquoted field. This can lead to > data misinterpretation when the quote is part of the field data (e.g., > for inches, like in the example). I think you're thinking about it differently than the parser. I think the parser is treating this the way, say, the shell treats quotes. That is, it sees a quoted "I bought this for my 6" followed by an unquoted "a laptop but it didn't fit my 8" followed by a quoted " tablet". So for example, in that world you might only quote commas and newlines so you might print something like 1,2,I bought this for my "6"" laptop " but it "didn't" fit my "8""" laptop The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only allows fully quoted or fully unquoted fields and there can only be escaped double-doublequote characters in quoted fields and no doublequote characters in unquoted fields. But it also says Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2. So the real question is are there tools out there that generate entries like this and what are their intentions? > I think we should throw a parsing error for unescaped mid-field quotes, > and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field > quotes are necessary. The error message could suggest this option when it > encounters an unescaped mid-field quote. > > I think the convenience of not having to use an extra option doesn't outweigh > the risk of undetected data integrity issues. It's also a pretty annoying experience to get a message saying "error, turn this option on to not get an error". I get what you're saying too, which is more of a risk depends on whether turning off the error is really the right thing most of the time or is just causing data to be read incorrectly. -- greg
Maybe this is unexpected by you, but it's not by me. What other sane interpretation of that data could there be? And what CSV producer outputs such horrible content? As you've noted, ours certainly does not. Our rules are clear: quotes within quotes must be escaped (default escape is by doubling the quote char). Allowing partial fields to be quoted was a deliberate decision when CSV parsing was implemented, because examples have been seen in the wild.
So I don't think our behaviour is broken or needs fixing. As mentioned by Greg, this is an example of the adage about being liberal in what you accept.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Maybe this is unexpected by you, but it's not by me. What other sane interpretation of that data could there be? And what CSV producer outputs such horrible content? As you've noted, ours certainly does not. Our rules are clear: quotes within quotes must be escaped (default escape is by doubling the quote char). Allowing partial fields to be quoted was a deliberate decision when CSV parsing was implemented, because examples have been seen in the wild.
So I don't think our behaviour is broken or needs fixing. As mentioned by Greg, this is an example of the adage about being liberal in what you accept.
Maybe this is unexpected by you, but it's not by me. What other sane interpretation of that data could there be? And what CSV producer outputs such horrible content? As you've noted, ours certainly does not. Our rules are clear: quotes within quotes must be escaped (default escape is by doubling the quote char). Allowing partial fields to be quoted was a deliberate decision when CSV parsing was implemented, because examples have been seen in the wild.
So I don't think our behaviour is broken or needs fixing. As mentioned by Greg, this is an example of the adage about being liberal in what you accept.
I'm pretty reluctant to change something that's been working as designed for almost 20 years, and about which we have hitherto had zero complaints that I recall.
I could see an argument for a STRICT mode which would disallow partially quoted fields, although I'd like some evidence that we're dealing with a real problem here. Is there really a CSV producer that produces output like that you showed in your example? And if so has anyone objected to them about the insanity of that?
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > I could see an argument for a STRICT mode which would disallow partially > quoted fields, although I'd like some evidence that we're dealing with a > real problem here. Is there really a CSV producer that produces output > like that you showed in your example? And if so has anyone objected to > them about the insanity of that? I think you'd want not just "some evidence" but "compelling evidence". Any such option is going to add cycles into the low-level input parser for COPY, which we know is a hot spot and we've expended plenty of sweat on. Adding a speed penalty that will be paid by the 99.99% of users who don't have an issue here is going to be a hard sell. regards, tom lane
On Sat, 13 May 2023 at 09:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > I could see an argument for a STRICT mode which would disallow partially > > quoted fields, although I'd like some evidence that we're dealing with a > > real problem here. Is there really a CSV producer that produces output > > like that you showed in your example? And if so has anyone objected to > > them about the insanity of that? > > I think you'd want not just "some evidence" but "compelling evidence". > Any such option is going to add cycles into the low-level input parser > for COPY, which we know is a hot spot and we've expended plenty of > sweat on. Adding a speed penalty that will be paid by the 99.99% > of users who don't have an issue here is going to be a hard sell. Well I'm not sure that follows. Joel specifically claimed that an implementation that didn't accept inputs like this would actually be simpler and that might mean it would actually be faster. And I don't think you have to look very hard for inputs like this -- plenty of people generate CSV files from simple templates or script outputs that don't understand escaping quotation marks at all. Outputs like that will be fine as long as there's no doublequotes in the inputs but then one day someone will enter a doublequote in a form somewhere and blammo. So I guess the real question is whether accepting inputs with unescapted quotes and interpreting them the way we do is really the best interpretation. Is the user best served by a) assuming they intended to quote part of the field and not quote part of it b) assume they failed to escape the quotation mark or c) assume something's gone wrong and the input is entirely untrustworthy. -- greg
On Sat, 13 May 2023 at 09:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:Andrew Dunstan <andrew@dunslane.net> writes:I could see an argument for a STRICT mode which would disallow partially quoted fields, although I'd like some evidence that we're dealing with a real problem here. Is there really a CSV producer that produces output like that you showed in your example? And if so has anyone objected to them about the insanity of that?I think you'd want not just "some evidence" but "compelling evidence". Any such option is going to add cycles into the low-level input parser for COPY, which we know is a hot spot and we've expended plenty of sweat on. Adding a speed penalty that will be paid by the 99.99% of users who don't have an issue here is going to be a hard sell.Well I'm not sure that follows. Joel specifically claimed that an implementation that didn't accept inputs like this would actually be simpler and that might mean it would actually be faster. And I don't think you have to look very hard for inputs like this -- plenty of people generate CSV files from simple templates or script outputs that don't understand escaping quotation marks at all. Outputs like that will be fine as long as there's no doublequotes in the inputs but then one day someone will enter a doublequote in a form somewhere and blammo.
The procedure described is plain wrong, and I don't have too much sympathy for people who implement it. Parsing CSV files might be a mild PITN, but constructing them is pretty darn simple. Something like this perl fragment should do it:
do {
s/"/""/g;
$_ = qq{"$_"} if /[",\r\n]/;
} foreach @fields;
print join(',',@fields),"\n";
And if people do follow the method you describe then their input with unescaped quotes will be rejected 999 times out of 1000. It's only cases where the field happens to have an even number of embedded quotes, like Joel's somewhat contrived example, that the input will be accepted.
So I guess the real question is whether accepting inputs with unescapted quotes and interpreting them the way we do is really the best interpretation. Is the user best served by a) assuming they intended to quote part of the field and not quote part of it b) assume they failed to escape the quotation mark or c) assume something's gone wrong and the input is entirely untrustworthy.
As I said earlier, I'm quite reluctant to break things that might have been working happily for people for many years, in order to accommodate people who can't do the minimum required to produce correct CSVs. I have no idea how many are relying on it, but I would be slightly surprised if the number were zero.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
You can use CSV mode pretty reliably for TSV files. The trick is to use a quoting char that shouldn't appear, such as E'\x01' as well as setting the delimiter to E'\t'. Yes, it's far from obvious.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, May 17, 2023, at 19:42, Andrew Dunstan wrote:> You can use CSV mode pretty reliably for TSV files. The trick is to use a> quoting char that shouldn't appear, such as E'\x01' as well as setting the> delimiter to E'\t'. Yes, it's far from obvious.I've been using that trick myself many times in the past, but thanks to thisdeep-dive into this topic, it looks to me like TEXT would be a better formatfit when dealing with unquoted TSV files, or?OTOH, one would then need to inspect the TSV file doesn't contain \. on an emptyline...I was about to suggest we perhaps should consider adding a TSV format, thatis like TEXT excluding the PostgreSQL specific things like \. and \N,but then I tested exporting TSV from Numbers on Mac and Google Sheets,and I can see there are incompatible differences. Numbers quote fieldsthat contain double-quote marks, while Google Sheets doesn't.None of them (unsurpringly) uses midfield quoting though.Anyone using Excel that could try exporting the following example as CSV/TSV?CREATE TABLE t (a text, b text, c text, d text);INSERT INTO t (a, b, c, d)VALUES ('unquoted','a "quoted" string', 'field, with a comma', E'field\t with a tab');
Saving it as XLSX, and then having Excel save it as a TSV (versus opening a text file, and saving it back)
Kirk...
Attachment
On Thu, May 18, 2023, at 00:18, Kirk Wolak wrote:> Here you go. Not horrible handling. (I use DataGrip so I saved it from there> directly as TSV, just for an extra datapoint).>> FWIW, if you copy/paste in windows, the data, the field with the tab gets> split into another column in Excel. But saving it as a file, and opening it.> Saving it as XLSX, and then having Excel save it as a TSV (versus opening a> text file, and saving it back)Very useful, thanks.Interesting, DataGrip contrary to Excel doesn't quote fields with commas in TSV.All the DataGrip/Excel TSV variants uses quoting when necessary,contrary to Google Sheets's TSV-format, that doesn't quote fields at all.
DataGrip/Excel terminate also the last record with newline,while Google Sheets omit the newline for the last record,(which is bad, since then a streaming reader wouldn't knowif the last record is completed or not.)This makes me think we probably shouldn't add a new TSV format,since there is no consistency between vendors.It's impossible to deduce with certainty if a TSV-field thatbegins with a double quotation mark is quoted or unquoted.Two alternative ideas:1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in conjunctionwith `COPY ... WITH CSV`?Internally, it would just setquotec = '\0';`so it would't affect performance at all.2. How about adding a note on the complexities of dealing with TSV files in theCOPY documentation?/Joel
QUOTE NONE and DELIMITER NONE should work fine. NONE is already a keyword, so the disturbance should be minimal.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Joel Jacobson wrote: > I've been using that trick myself many times in the past, but thanks to this > deep-dive into this topic, it looks to me like TEXT would be a better format > fit when dealing with unquoted TSV files, or? > > OTOH, one would then need to inspect the TSV file doesn't contain \. on an > empty line... Note that this is the case for valid CSV contents, since backslash-dot on a line by itself is both an end-of-data marker for COPY FROM and a valid CSV line. Having this line in the data results in either an error or having the rest of the data silently discarded, depending on the context. There is some previous discussion about this in [1]. Since the TEXT format doesn't have this kind of problem, one solution is to filter the data through PROGRAM with an [untrusted CSV]->TEXT filter. This is to be preferred over direct CSV loading when strictness or robustness are more important than convenience. [1] https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Thu, May 18, 2023, at 18:48, Daniel Verite wrote: > Joel Jacobson wrote: >> OTOH, one would then need to inspect the TSV file doesn't contain \. on an >> empty line... > > Note that this is the case for valid CSV contents, since backslash-dot > on a line by itself is both an end-of-data marker for COPY FROM and a > valid CSV line. > Having this line in the data results in either an error or having the > rest of the data silently discarded, depending on the context. There > is some previous discussion about this in [1]. > Since the TEXT format doesn't have this kind of problem, one solution > is to filter the data through PROGRAM with an [untrusted CSV]->TEXT > filter. This is to be preferred over direct CSV loading when > strictness or robustness are more important than convenience. > > > [1] > https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org Thanks for sharing the old thread, very useful. I see I've failed miserably to understand all the details of the COPY command. Upon reading the thread, I'm still puzzled about one thing: Why does \. need to have a special meaning when using COPY FROM with files? I understand its necessity for STDIN, given that the end of input needs to be explicitly defined. However, for files, we have a known file size and the end-of-file can be detected without the need for special markers. Also, is the difference in how server-side COPY CSV is capable of dealing with \. but apparently not the client-side \COPY CSV documented somewhere? CREATE TABLE t (c text); INSERT INTO t (c) VALUES ('foo'), (E'\n\\.\n'), ('bar'); -- Works OK: COPY t TO '/tmp/t.csv' WITH CSV; TRUNCATE t; COPY t FROM '/tmp/t.csv' WITH CSV; -- Doesn't work: \COPY t TO '/tmp/t.csv' WITH CSV; TRUNCATE t; \COPY t FROM '/tmp/t.csv' WITH CSV; ERROR: unterminated CSV quoted field CONTEXT: COPY t, line 4: "" \. " /Joel
Joel Jacobson wrote: > I understand its necessity for STDIN, given that the end of input needs to > be explicitly defined. > However, for files, we have a known file size and the end-of-file can be > detected without the need for special markers. > > Also, is the difference in how server-side COPY CSV is capable of dealing > with \. but apparently not the client-side \COPY CSV documented somewhere? psql implements the client-side "\copy table from file..." with COPY table FROM STDIN ... COPY FROM file CSV somewhat differs as your example shows, but it still mishandle \. when unquoted. For instance, consider this file to load with COPY t FROM '/tmp/t.csv' WITH CSV $ cat /tmp/t.csv line 1 \. line 3 line 4 It results in having only "line 1" being imported. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, May 19, 2023, at 18:06, Daniel Verite wrote: > COPY FROM file CSV somewhat differs as your example shows, > but it still mishandle \. when unquoted. For instance, consider this > file to load with COPY t FROM '/tmp/t.csv' WITH CSV > $ cat /tmp/t.csv > line 1 > \. > line 3 > line 4 > > It results in having only "line 1" being imported. Hmm, this is a problem for one of the new use-cases I brought up that would be possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files, where each raw line should be imported "as is" into a single text column. Is there a valid reason why \. is needed for COPY FROM filename? It seems to me it would only be necessary for the COPY FROM STDIN case, since files have a natural end-of-file and a known file size. /Joel
Joel Jacobson wrote: > Is there a valid reason why \. is needed for COPY FROM filename? > It seems to me it would only be necessary for the COPY FROM STDIN case, > since files have a natural end-of-file and a known file size. Looking at CopyReadLineText() over at [1], I don't see a reason why the unquoted \. could not be handled with COPY FROM file. Even COPY FROM STDIN looks like it could be benefit, so that \copy from file csv would hopefully not choke or truncate the data. There's still the case when the CSV data is embedded in a psql script (psql is unable to know where it ends), but for that, "don't do that" might be a reasonable answer. [1] https://doxygen.postgresql.org/copyfromparse_8c.html#a90201f711221dd82d0c08deedd91e1b3 Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Joel Jacobson wrote:
> Is there a valid reason why \. is needed for COPY FROM filename?
> It seems to me it would only be necessary for the COPY FROM STDIN case,
> since files have a natural end-of-file and a known file size.
Looking at CopyReadLineText() over at [1], I don't see a reason why
the unquoted \. could not be handled with COPY FROM file.
Even COPY FROM STDIN looks like it could be benefit, so that
\copy from file csv would hopefully not choke or truncate the data.
There's still the case when the CSV data is embedded in a psql script
(psql is unable to know where it ends), but for that, "don't do that"
might be a reasonable answer.
Kirk Wolak wrote: > We do NOT do "CSV", we mimic pg_dump. pg_dump uses the text format (as opposed to csv), where \. on a line by itself cannot appear in the data, so there's no problem. The problem is limited to the csv format. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote: > On Fri, May 19, 2023, at 18:06, Daniel Verite wrote: > > COPY FROM file CSV somewhat differs as your example shows, > > but it still mishandle \. when unquoted. For instance, consider this > > file to load with COPY t FROM '/tmp/t.csv' WITH CSV > > $ cat /tmp/t.csv > > line 1 > > \. > > line 3 > > line 4 > > > > It results in having only "line 1" being imported. > > Hmm, this is a problem for one of the new use-cases I brought up that would be > possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files, > where each raw line should be imported "as is" into a single text column. > > Is there a valid reason why \. is needed for COPY FROM filename? No. > It seems to me it would only be necessary for the COPY FROM STDIN case, > since files have a natural end-of-file and a known file size. Right. Even for COPY FROM STDIN, it's not needed anymore since the removal of protocol v2. psql would still use it to find the end of inline COPY data, though. Here's another relevant thread: https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com
On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote: > On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote: >> On Fri, May 19, 2023, at 18:06, Daniel Verite wrote: >> > COPY FROM file CSV somewhat differs as your example shows, >> > but it still mishandle \. when unquoted. For instance, consider this >> > file to load with COPY t FROM '/tmp/t.csv' WITH CSV >> > $ cat /tmp/t.csv >> > line 1 >> > \. >> > line 3 >> > line 4 >> > >> > It results in having only "line 1" being imported. >> >> Hmm, this is a problem for one of the new use-cases I brought up that would be >> possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files, >> where each raw line should be imported "as is" into a single text column. >> >> Is there a valid reason why \. is needed for COPY FROM filename? > > No. > >> It seems to me it would only be necessary for the COPY FROM STDIN case, >> since files have a natural end-of-file and a known file size. > > Right. Even for COPY FROM STDIN, it's not needed anymore since the > removal of > protocol v2. psql would still use it to find the end of inline COPY > data, > though. Here's another relevant thread: > https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.com I was very pleased to see commit 7702337: Do not treat \. as an EOF marker in CSV mode for COPY IN. Great job! Thanks to this fix, maybe there is now interest to resume the discussion on the ideas discussed in this thread? Recap of ideas: 1. Stricter parsing, reject mid-field quotes The idea is to prevent balanced mid-field quotes from being silently removed. Example: % cat example.csv id,post 1,<p>Hello there!</p> 2,<a href="http://example.com">Click me!</a> % psql # \copy posts from example.csv with csv header; COPY 2 # SELECT * FROM posts; id | post ----+------------------------------------------ 1 | <p>Hello there!</p> 2 | <a href=http://example.com>Click me!</a> (2 rows) Note how the quotes around the URL disappeared. 2. Avoid needing hacks like using E'\x01' as quoting char. Introduce QUOTE NONE and DELIMITER NONE, to allow raw lines to be imported "as is" into a single text column. Best regards, Joel
On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:COPY FROM file CSV somewhat differs as your example shows, but it still mishandle \. when unquoted. For instance, consider this file to load with COPY t FROM '/tmp/t.csv' WITH CSV $ cat /tmp/t.csv line 1 \. line 3 line 4 It results in having only "line 1" being imported.Hmm, this is a problem for one of the new use-cases I brought up that would be possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files, where each raw line should be imported "as is" into a single text column. Is there a valid reason why \. is needed for COPY FROM filename?No.It seems to me it would only be necessary for the COPY FROM STDIN case, since files have a natural end-of-file and a known file size.Right. Even for COPY FROM STDIN, it's not needed anymore since the removal of protocol v2. psql would still use it to find the end of inline COPY data, though. Here's another relevant thread: https://postgr.es/m/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2%40beta.fastmail.comI was very pleased to see commit 7702337: Do not treat \. as an EOF marker in CSV mode for COPY IN. Great job! Thanks to this fix, maybe there is now interest to resume the discussion on the ideas discussed in this thread? Recap of ideas: 1. Stricter parsing, reject mid-field quotes The idea is to prevent balanced mid-field quotes from being silently removed. Example: % cat example.csv id,post 1,<p>Hello there!</p> 2,<a href="http://example.com">Click me!</a> % psql # \copy posts from example.csv with csv header; COPY 2 # SELECT * FROM posts; id | post ----+------------------------------------------ 1 | <p>Hello there!</p> 2 | <a href=http://example.com>Click me!</a> (2 rows) Note how the quotes around the URL disappeared. 2. Avoid needing hacks like using E'\x01' as quoting char. Introduce QUOTE NONE and DELIMITER NONE, to allow raw lines to be imported "as is" into a single text column.
As I think I previously indicated, I'm perfectly happy about 2, because it replaces a far from obvious hack, but I am at best dubious about 1.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote: > On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote: >> 2. Avoid needing hacks like using E'\x01' as quoting char. >> >> Introduce QUOTE NONE and DELIMITER NONE, >> to allow raw lines to be imported "as is" into a single text column. > > As I think I previously indicated, I'm perfectly happy about 2, because > it replaces a far from obvious hack, but I am at best dubious about 1. I've looked at how to implement this, and there is quite a lot of complexity having to do with quoting and escaping. Need guidance on what you think would be best to do: 2a) Should we aim to support all NONE combinations, at the cost of increasing the complexity at all code having to do with quoting, escaping and delimiters? 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case, useful to the real-life scenario we've identified, that is, importing raw log lines into a single column, which could then be handed by a much simpler and probably faster version of CopyReadAttributesCSV(), e.g. named CopyReadAttributesUnquotedUnDelimited() or maybe CopyReadAttributesRaw()? (We also need to modify CopyReadLineText(), but seems we only need a quote_none bool, to skip over the quoting code there, so don't think a separate function is warranted there.) I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape character is the same as the quote character, so if there isn't any quote character, then I think that would imply no escape character either. Can we think of any other valid, useful, realistic, and safe combinations of QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting to support? If not, then I think 2b looks more interesting, to reduce risk of accidental misuse, simpler implementation, and since it also should allow importing raw log files faster, thanks to the reduced complexity. Best regards, Joel
On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote: > On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote: >> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote: >>> 2. Avoid needing hacks like using E'\x01' as quoting char. >>> >>> Introduce QUOTE NONE and DELIMITER NONE, >>> to allow raw lines to be imported "as is" into a single text column. >> As I think I previously indicated, I'm perfectly happy about 2, because >> it replaces a far from obvious hack, but I am at best dubious about 1. > I've looked at how to implement this, and there is quite a lot of complexity > having to do with quoting and escaping. > > Need guidance on what you think would be best to do: > > 2a) Should we aim to support all NONE combinations, at the cost of increasing the > complexity at all code having to do with quoting, escaping and delimiters? > > 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case, > useful to the real-life scenario we've identified, that is, importing raw log > lines into a single column, which could then be handed by a much simpler and > probably faster version of CopyReadAttributesCSV(), > e.g. named CopyReadAttributesUnquotedUnDelimited() or > maybe CopyReadAttributesRaw()? > (We also need to modify CopyReadLineText(), but seems we only need a > quote_none bool, to skip over the quoting code there, so don't think a > separate function is warranted there.) > > I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape > character is the same as the quote character, so if there isn't any > quote character, then I think that would imply no escape character either. > > Can we think of any other valid, useful, realistic, and safe combinations of > QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting > to support? > > If not, then I think 2b looks more interesting, to reduce risk of accidental > misuse, simpler implementation, and since it also should allow importing > raw log files faster, thanks to the reduced complexity. > Off hand I can't think of a case other than 2b that would apply in the real world, although others might like to chime in here. If we're going to do that, let's find a shorter way to spell it. In fact, we should do that even if we go with 2a. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote: > > On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote: >> On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote: >>> On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote: >>>> 2. Avoid needing hacks like using E'\x01' as quoting char. >>>> >>>> Introduce QUOTE NONE and DELIMITER NONE, >>>> to allow raw lines to be imported "as is" into a single text column. >>> As I think I previously indicated, I'm perfectly happy about 2, because >>> it replaces a far from obvious hack, but I am at best dubious about 1. >> I've looked at how to implement this, and there is quite a lot of >> complexity >> having to do with quoting and escaping. >> >> Need guidance on what you think would be best to do: >> >> 2a) Should we aim to support all NONE combinations, at the cost of >> increasing the >> complexity at all code having to do with quoting, escaping and >> delimiters? >> >> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE >> ESCAPE NONE case, >> useful to the real-life scenario we've identified, that is, importing >> raw log >> lines into a single column, which could then be handed by a much >> simpler and >> probably faster version of CopyReadAttributesCSV(), >> e.g. named CopyReadAttributesUnquotedUnDelimited() or >> maybe CopyReadAttributesRaw()? >> (We also need to modify CopyReadLineText(), but seems we only need a >> quote_none bool, to skip over the quoting code there, so don't think a >> separate function is warranted there.) >> >> I think ESCAPE NONE should be implied from QUOTE NONE, since the >> default escape >> character is the same as the quote character, so if there isn't any >> quote character, then I think that would imply no escape character >> either. >> >> Can we think of any other valid, useful, realistic, and safe >> combinations of >> QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting >> to support? >> >> If not, then I think 2b looks more interesting, to reduce risk of >> accidental >> misuse, simpler implementation, and since it also should allow importing >> raw log files faster, thanks to the reduced complexity. >> > > > Off hand I can't think of a case other than 2b that would apply in the > real world, although others might like to chime in here. If we're > going to do that, let's find a shorter way to spell it. In fact, we > should do that even if we go with 2a. > > > At the very least you should not need to say ESCAPE NONE, since the default is to have ESCAPE the same as QUOTE, so QUOTE NONE should imply ESCAPE NONE. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Oct 9, 2024, at 14:45, Andrew Dunstan wrote: > On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote: >> On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote: >>> 2b) Should we aim to only support the QUOTE NONE DELIMITER NONE >>> ESCAPE NONE case, >>> useful to the real-life scenario we've identified, that is, importing >>> raw log >>> lines into a single column, ... >> Off hand I can't think of a case other than 2b that would apply in the >> real world, although others might like to chime in here. If we're >> going to do that, let's find a shorter way to spell it. In fact, we >> should do that even if we go with 2a. I agree a shorter way to spell it would be nice. But yet another new option would risk option creep IMO, especially since we already have quite a lot of options that can only be used for some formats and/or not together with other options. If there would have been an easy way to just hack this into the CSV mode, in a clean and understandable way, that wouldn't risk confuse users who are actually importing real CSV files, that would seem OK to me, but it seems difficult. However, I feel it might be too much of shoehorning to add this to the CSV mode. Also, since, if there is no delimiter, that by definition means there cannot be any "separated" values that the "S" in "CSV" means. I think it would be nicest to introduce a new "raw" FORMAT, that clearly get things right already at the top-level, instead of trying to customize any of the existing formats. Arbitrary unstructured text files, such as some log files, seems like a common big enough group of files, that users are probably already importing to PostgreSQL, using the various hacks we've discussed. So providing such users with a new FORMAT that precisely matches their use-case, seems like the simplest and most intuitive solution. With a new format, we could clearly define its purpose in the docs: Raw Format The "raw" format is used for importing and exporting files containing unstructured text, where each line is treated as a single field. This format is ideal when dealing with data that doesn't conform to a structured, tabular format and lacks delimiters. Key Characteristics: - No Field Delimiters: Each line is considered a complete value without any field separation. - Single Column Requirement: The COPY command must specify exactly one column when using the raw format. Specifying multiple columns will result in an error. - Literal Data Interpretation: All characters are taken literally. There is no special handling for quotes, backslashes, or escape sequences. - No NULL Distinction: Empty lines are imported as empty strings, not as NULL values. - No Headers or Metadata: The format does not support header rows or end-of-data markers; every line is treated as data. Notes: - Error Handling: An error will occur if you use the raw format without specifying exactly one column or if the table has multiple columns and no column list is provided. - Data Preservation: All characters, including whitespace and special characters, are preserved exactly as they appear in the file. /Joel
"Joel Jacobson" <joel@compiler.org> writes: > I think it would be nicest to introduce a new "raw" FORMAT, > that clearly get things right already at the top-level, > instead of trying to customize any of the existing formats. FWIW, I like this idea. It makes it much clearer how to have a fast parsing path. regards, tom lane
On 2024-10-09 We 11:58 AM, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> I think it would be nicest to introduce a new "raw" FORMAT, >> that clearly get things right already at the top-level, >> instead of trying to customize any of the existing formats. > FWIW, I like this idea. It makes it much clearer how to have a > fast parsing path. > > WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem to have an existing keyword that we could sanely reuse here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Joel Jacobson wrote: > - No Headers or Metadata: It's not clear why it's necessary to disable the HEADER option for this format? > The format does not support header rows or end-of-data markers; > every line is treated as data. With COPY FROM STDIN followed by inline data in a script, an end-of-data marker is required. That's also a problem for CSV except it's mitigated by the possibility of quoting (using "\." instead of \.) Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, Oct 11, 2024, at 15:04, Joel Jacobson wrote: > On Thu, Oct 10, 2024, at 10:37, Daniel Verite wrote: >> Joel Jacobson wrote: >> >>> - No Headers or Metadata: >> >> It's not clear why it's necessary to disable the HEADER option >> for this format? > > It's not necessary, no, just couldn't see a use-case, > since I only thought about the COPY FROM case > where one would be dealing with unstructured undelimited > text files, such as log files coming from some other system, > that I've never seen have header rows. > > However, thanks to your question, I see how a user > might want to use the raw format to export a text > column "as is" using COPY TO, in which case it would > be useful to use HEADER and then HEADER MATCH > for COPY FROM. > > I therefore think the HEADER option should be supported > for the new raw format. > >>> The format does not support header rows or end-of-data markers; >>> every line is treated as data. >> >> With COPY FROM STDIN followed by inline data in a script, >> an end-of-data marker is required. That's also a problem >> for CSV except it's mitigated by the possibility of quoting >> (using "\." instead of \.) > > Right. As long as \. won't have any special meaning for the raw format > except in the STDIN case, that seems fine. > > I haven't looked at that part of the code in detail yet though. > > As a preparatory step, I think we should replace the two > "binary" and "csv_mode" bool fields in CopyFormatOptions, > with a new "format" field of a new new CopyFormat enum type. > > If instead introducing another bool field, I think the code would > be too cluttered. I'm starting a new thread for this with a more suitable subject. /Joel