Thread: Nasty, propagating POLA violation in COPY CSV HEADER
Folks, A co-worker filed a bug against file_fdw where the columns in a FOREIGN TABLE were scrambled on SELECT. It turned out that this comes from the (yes, it's documented, but since it's documented in a place not obviously linked to the bug, it's pretty useless) "feature" of COPY CSV HEADER whereby the header line is totally ignored in COPY OUT. Rather than being totally ignored in the COPY OUT (CSV HEADER) case, the header line in should be parsed to establish which columns are where and rearranging the output if needed. I'm proposing to make the code change here: http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436 and a suitable doc change that talks about reading the header only for the purpose of matching column names to columns, and throwing away the output as before. What say? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > Rather than being totally ignored in the COPY OUT (CSV HEADER) case, > the header line in should be parsed to establish which columns are > where and rearranging the output if needed. This is not "fix a POLA violation". This is a non-backwards-compatible new feature, which would have to have a switch to turn it on. regards, tom lane
On 06/20/2012 11:02 AM, David Fetter wrote: > Folks, > > A co-worker filed a bug against file_fdw where the columns in a > FOREIGN TABLE were scrambled on SELECT. It turned out that this comes > from the (yes, it's documented, but since it's documented in a place > not obviously linked to the bug, it's pretty useless) "feature" of > COPY CSV HEADER whereby the header line is totally ignored in COPY > OUT. > > Rather than being totally ignored in the COPY OUT (CSV HEADER) case, > the header line in should be parsed to establish which columns are > where and rearranging the output if needed. > > I'm proposing to make the code change here: > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436 > > and a suitable doc change that talks about reading the header only for > the purpose of matching column names to columns, and throwing away the > output as before. > > What say? > First you are talking about COPY IN, not COPY OUT, surely. This is not a bug, it is documented in exactly the place that all other COPY options are documented. The file_fdw page refers the reader to the COPY docs for details. Unless you want us to duplicate the entire COPY docs in the file_fdw page this seems entirely reasonable. The current behaviour was discussed at some length back when we implemented the HEADER feature, IIRC, and is quite intentional. I don't think we should alter the current behaviour, as plenty of people rely on it, some to my certain knowledge. I do see a reasonable case for adding a new behaviour which takes notice of the header line, although it's likely to have plenty of wrinkles. Reordering columns like you suggest might well have a significant impact on COPY performance, BTW. Also note that I created the file_text_array FDW precisely for people who want to be able to cherry pick and reorder columns. See <https://github.com/adunstan/file_text_array_fdw> cheers andrew
On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > Rather than being totally ignored in the COPY OUT (CSV HEADER) > > case, the header line in should be parsed to establish which > > columns are where and rearranging the output if needed. > > This is not "fix a POLA violation". This is a > non-backwards-compatible new feature, which would have to have a > switch to turn it on. OK, new proposal: COPY FROM (Thanks, Andrew! Must not post while asleep...) should have an option which requires that HEADER be enabled and mandates that the column names in the header match the columns coming in. Has someone got a better name for this option than KEEP_HEADER_COLUMN_NAMES? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes: > OK, new proposal: > COPY FROM (Thanks, Andrew! Must not post while asleep...) should have > an option which requires that HEADER be enabled and mandates that the > column names in the header match the columns coming in. > Has someone got a better name for this option than > KEEP_HEADER_COLUMN_NAMES? Well, if it's just checking that the list matches, then maybe CHECK_HEADER would do. In your original proposal, I was rather wondering what should happen if the incoming file didn't have the same set of columns called out in the COPY command's column list. (That is, while *rearranging* the columns might be thought non-astonishing, I'm less convinced about a copy operation that inserts or defaults columns differently from what the command said should happen.) If we're just checking for a match, that question goes away. regards, tom lane
On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > OK, new proposal: > > > COPY FROM (Thanks, Andrew! Must not post while asleep...) should have > > an option which requires that HEADER be enabled and mandates that the > > column names in the header match the columns coming in. > > > Has someone got a better name for this option than > > KEEP_HEADER_COLUMN_NAMES? > > Well, if it's just checking that the list matches, then maybe > CHECK_HEADER would do. OK > In your original proposal, I was rather wondering what should happen > if the incoming file didn't have the same set of columns called out > in the COPY command's column list. (That is, while *rearranging* > the columns might be thought non-astonishing, I'm less convinced > about a copy operation that inserts or defaults columns differently > from what the command said should happen.) If we're just checking > for a match, that question goes away. Let's imagine we have a table foo(id serial, t text, n numeric) and a CSV file foo.csv with headers (n, t). Just to emphasize, the column ordering is different in the places where they match. Would COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER) now "just work" (possibly with some performance penalty) and COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER) only "work," i.e. import gobbledygook, in the case where every t entry in foo.csv happened to be castable to NUMERIC? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012: > On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote: > > In your original proposal, I was rather wondering what should happen > > if the incoming file didn't have the same set of columns called out > > in the COPY command's column list. (That is, while *rearranging* > > the columns might be thought non-astonishing, I'm less convinced > > about a copy operation that inserts or defaults columns differently > > from what the command said should happen.) If we're just checking > > for a match, that question goes away. > > Let's imagine we have a table foo(id serial, t text, n numeric) and a > CSV file foo.csv with headers (n, t). Just to emphasize, the column > ordering is different in the places where they match. For the record, IIRC we had one person trying to do this in the spanish list not long ago. Another related case: you have a file with headers and columns (n, t, x, y, z) but your table only has n and t. How would you tell COPY to discard the junk columns? Currently it just complains that they are there. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 06/20/2012 12:18 PM, Tom Lane wrote: > David Fetter<david@fetter.org> writes: >> OK, new proposal: >> COPY FROM (Thanks, Andrew! Must not post while asleep...) should have >> an option which requires that HEADER be enabled and mandates that the >> column names in the header match the columns coming in. >> Has someone got a better name for this option than >> KEEP_HEADER_COLUMN_NAMES? > Well, if it's just checking that the list matches, then maybe > CHECK_HEADER would do. > > In your original proposal, I was rather wondering what should happen if > the incoming file didn't have the same set of columns called out in the > COPY command's column list. (That is, while *rearranging* the columns > might be thought non-astonishing, I'm less convinced about a copy > operation that inserts or defaults columns differently from what the > command said should happen.) If we're just checking for a match, that > question goes away. > > In the past people have asked me to have copy use the CSV header line in place of supplying a column list in the COPY command. I can certainly see some utility in that, and I think it might achieve what David wants. Using that scenario it would be an error to supply an explicit column list and also use the header line. But then I don't think CHECK_HEADER would be the right name for the option. In any case, specifying a name before we settle on an exact behaviour seems wrong :-) cheers andrew
On 06/20/2012 12:50 PM, Alvaro Herrera wrote: > Another related case: you have a file with headers and columns (n, t, > x, y, z) but your table only has n and t. How would you tell COPY to > discard the junk columns? Currently it just complains that they are > there. That's one of the main use cases for file_text_array_fdw. cheers andrew
> In the past people have asked me to have copy use the CSV header line in > place of supplying a column list in the COPY command. I can certainly > see some utility in that, and I think it might achieve what David wants. > Using that scenario it would be an error to supply an explicit column > list and also use the header line. But then I don't think CHECK_HEADER > would be the right name for the option. In any case, specifying a name > before we settle on an exact behaviour seems wrong :-) Actually, I can see three valid and valuable-to-users behaviors here: 1) current behavior, header line is ignored. 2) CHECK HEADER: a column list is supplied, but a "check header" flag is set. If the column list and header list don't match *exactly*, you get an error. 3) USE HEADER: no column list is supplied, but a "use header" flag is set. A column list is created to match the columns from the CSV header.Of necessity, this will consist of all or some ofthe columns in the table. If columns are supplied which are not in the table, then you get an error (as well as if columns are missing which are NOT NULL, as you get at present). (2) is more valuable to people who want to check data integrity rigorously, and test for unexpected API changes. (3) is more valuable for regular users who want CSV import to "just work". (1) is valuable for backwards compatibility, and for cases where the CSV header was generated by another program (Excel?) so the column names don't match. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
<p><font size="2">> -----Ursprüngliche Nachricht-----<br /> > Von: pgsql-hackers-owner@postgresql.org im Auftragvon Josh Berkus<br /> > Gesendet: Mi 6/20/2012 7:06<br /> > An: pgsql-hackers@postgresql.org<br /> > Betreff:Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER<br /><br /> ...<br /><br /> > (1) is valuable<br/> > for backwards compatibility, and for cases where the CSV header was<br /> > generated by anotherprogram (Excel?) so the column names don't match.<br /><br /><br /> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date'-> 'input_date' ...)<br /><br /> to cover the case when column names do not match.<br /><br /> my 2 pences,<br /><br/> Marc Mamin<br /></font>
> > 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...) > > to cover the case when column names do not match. Personally, I think that's going way beyond what we want to do with COPY. At that point, check out the CSV-array FDW. Of course, if someone writes a WIP patch which implements the above, we can evaluate it then. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Excerpts from Andrew Dunstan's message of mié jun 20 12:56:52 -0400 2012: > > On 06/20/2012 12:50 PM, Alvaro Herrera wrote: > > Another related case: you have a file with headers and columns (n, t, > > x, y, z) but your table only has n and t. How would you tell COPY to > > discard the junk columns? Currently it just complains that they are > > there. > > That's one of the main use cases for file_text_array_fdw. Ah, great, thanks. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Josh Berkus <josh@agliodbs.com> writes: >> >> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...) >> >> to cover the case when column names do not match. > Personally, I think that's going way beyond what we want to do with > COPY. I agree --- if you know that much about the incoming data, you might as well just specify the correct column list in the COPY command. This would only have use if you know somebody used weird column names, but not the order they put the columns in; which seems like a thin use-case. The three options Josh enumerated seem reasonably sane to me. regards, tom lane