Thread: Nasty, propagating POLA violation in COPY CSV HEADER

Nasty, propagating POLA violation in COPY CSV HEADER

From

David Fetter

Date:

20 June 2012, 12:03:13

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:


http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Tom Lane

Date:

20 June 2012, 12:47:29

David Fetter <david@fetter.org> writes:
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.

This is not "fix a POLA violation".  This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.
        regards, tom lane

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Andrew Dunstan

Date:

20 June 2012, 12:48:56

On 06/20/2012 11:02 AM, David Fetter wrote:
> Folks,
>
> A co-worker filed a bug against file_fdw where the columns in a
> FOREIGN TABLE were scrambled on SELECT.  It turned out that this comes
> from the (yes, it's documented, but since it's documented in a place
> not obviously linked to the bug, it's pretty useless) "feature" of
> COPY CSV HEADER whereby the header line is totally ignored in COPY
> OUT.
>
> Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
> the header line in should be parsed to establish which columns are
> where and rearranging the output if needed.
>
> I'm proposing to make the code change here:
>
>
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436
>
> and a suitable doc change that talks about reading the header only for
> the purpose of matching column names to columns, and throwing away the
> output as before.
>
> What say?
>

First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other 
COPY options are documented. The file_fdw page refers the reader to the 
COPY docs for details. Unless you want us to duplicate the entire COPY 
docs in the file_fdw page this seems entirely reasonable.

The current behaviour was discussed at some length back when we 
implemented the HEADER feature, IIRC, and is quite intentional. I don't 
think we should alter the current behaviour, as plenty of people rely on 
it, some to my certain knowledge. I do see a reasonable case for adding 
a new behaviour which takes notice of the header line, although it's 
likely to have plenty of wrinkles.

Reordering columns like you suggest might well have a significant impact 
on COPY performance, BTW. Also note that I created the file_text_array 
FDW precisely for people who want to be able to cherry pick and reorder 
columns. See <https://github.com/adunstan/file_text_array_fdw>

cheers

andrew

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

David Fetter

Date:

20 June 2012, 13:09:36

On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > Rather than being totally ignored in the COPY OUT (CSV HEADER)
> > case, the header line in should be parsed to establish which
> > columns are where and rearranging the output if needed.
> 
> This is not "fix a POLA violation".  This is a
> non-backwards-compatible new feature, which would have to have a
> switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Tom Lane

Date:

20 June 2012, 13:19:03

David Fetter <david@fetter.org> writes:
> OK, new proposal:

> COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> an option which requires that HEADER be enabled and mandates that the
> column names in the header match the columns coming in.

> Has someone got a better name for this option than
> KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list.  (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.)  If we're just checking for a match, that
question goes away.
        regards, tom lane

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

David Fetter

Date:

20 June 2012, 13:43:53

On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > OK, new proposal:
> 
> > COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
> > an option which requires that HEADER be enabled and mandates that the
> > column names in the header match the columns coming in.
> 
> > Has someone got a better name for this option than
> > KEEP_HEADER_COLUMN_NAMES?
> 
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.

OK

> In your original proposal, I was rather wondering what should happen
> if the incoming file didn't have the same set of columns called out
> in the COPY command's column list.  (That is, while *rearranging*
> the columns might be thought non-astonishing, I'm less convinced
> about a copy operation that inserts or defaults columns differently
> from what the command said should happen.)  If we're just checking
> for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t).  Just to emphasize, the column
ordering is different in the places where they match.

Would
   COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now "just work" (possibly with some performance penalty) and 
   COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only "work," i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Alvaro Herrera

Date:

20 June 2012, 13:51:10

Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:
> On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

> > In your original proposal, I was rather wondering what should happen
> > if the incoming file didn't have the same set of columns called out
> > in the COPY command's column list.  (That is, while *rearranging*
> > the columns might be thought non-astonishing, I'm less convinced
> > about a copy operation that inserts or defaults columns differently
> > from what the command said should happen.)  If we're just checking
> > for a match, that question goes away.
>
> Let's imagine we have a table foo(id serial, t text, n numeric) and a
> CSV file foo.csv with headers (n, t).  Just to emphasize, the column
> ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t.  How would you tell COPY to discard the
junk columns?  Currently it just complains that they are there.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Andrew Dunstan

Date:

20 June 2012, 13:54:52

On 06/20/2012 12:18 PM, Tom Lane wrote:
> David Fetter<david@fetter.org>  writes:
>> OK, new proposal:
>> COPY FROM (Thanks, Andrew!  Must not post while asleep...) should have
>> an option which requires that HEADER be enabled and mandates that the
>> column names in the header match the columns coming in.
>> Has someone got a better name for this option than
>> KEEP_HEADER_COLUMN_NAMES?
> Well, if it's just checking that the list matches, then maybe
> CHECK_HEADER would do.
>
> In your original proposal, I was rather wondering what should happen if
> the incoming file didn't have the same set of columns called out in the
> COPY command's column list.  (That is, while *rearranging* the columns
> might be thought non-astonishing, I'm less convinced about a copy
> operation that inserts or defaults columns differently from what the
> command said should happen.)  If we're just checking for a match, that
> question goes away.
>
>             

In the past people have asked me to have copy use the CSV header line in 
place of supplying a column list in the COPY command. I can certainly 
see some utility in that, and I think it might achieve what David wants. 
Using that scenario it would be an error to supply an explicit column 
list and also use the header line. But then I don't think CHECK_HEADER 
would be the right name for the option. In any case, specifying a name 
before we settle on an exact behaviour seems wrong :-)

cheers

andrew

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Andrew Dunstan

Date:

20 June 2012, 13:57:20


On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
> Another related case: you have a file with headers and columns (n, t, 
> x, y, z) but your table only has n and t. How would you tell COPY to 
> discard the junk columns? Currently it just complains that they are 
> there. 


That's one of the main use cases for file_text_array_fdw.

cheers

andrew

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Josh Berkus

Date:

20 June 2012, 14:06:31

> In the past people have asked me to have copy use the CSV header line in
> place of supplying a column list in the COPY command. I can certainly
> see some utility in that, and I think it might achieve what David wants.
> Using that scenario it would be an error to supply an explicit column
> list and also use the header line. But then I don't think CHECK_HEADER
> would be the right name for the option. In any case, specifying a name
> before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a "check header" flag is
set.  If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a "use header" flag is
set.  A column list is created to match the columns from the CSV header.Of necessity, this will consist of all or some
ofthe columns in the
 
table.  If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes.  (3) is more valuable
for regular users who want CSV import to "just work".  (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

"Marc Mamin"

Date:

20 June 2012, 14:37:42

<p><font size="2">>  -----Ursprüngliche Nachricht-----<br /> >  Von: pgsql-hackers-owner@postgresql.org im
Auftragvon Josh Berkus<br /> >  Gesendet: Mi 6/20/2012 7:06<br /> >  An: pgsql-hackers@postgresql.org<br /> > 
Betreff:Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER<br /><br /> ...<br /><br /> >   (1) is
valuable<br/> >  for backwards compatibility, and for cases where the CSV header was<br /> >  generated by
anotherprogram (Excel?) so the column names don't match.<br /><br /><br /> 4) MAP_HEADER ('column 1'-> 'col_1',
'Date'-> 'input_date' ...)<br /><br /> to cover the case when column names do not match.<br /><br /> my 2 pences,<br
/><br/> Marc Mamin<br /></font>

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Josh Berkus

Date:

20 June 2012, 14:40:30

> 
> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
> 
> to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY.  At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Alvaro Herrera

Date:

20 June 2012, 15:20:44

Excerpts from Andrew Dunstan's message of mié jun 20 12:56:52 -0400 2012:
>
> On 06/20/2012 12:50 PM, Alvaro Herrera wrote:
> > Another related case: you have a file with headers and columns (n, t,
> > x, y, z) but your table only has n and t. How would you tell COPY to
> > discard the junk columns? Currently it just complains that they are
> > there.
>
> That's one of the main use cases for file_text_array_fdw.

Ah, great, thanks.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Nasty, propagating POLA violation in COPY CSV HEADER

From

Tom Lane

Date:

20 June 2012, 16:23:26

Josh Berkus <josh@agliodbs.com> writes:
>> 
>> 4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)
>> 
>> to cover the case when column names do not match.

> Personally, I think that's going way beyond what we want to do with
> COPY.

I agree --- if you know that much about the incoming data, you might as
well just specify the correct column list in the COPY command.  This
would only have use if you know somebody used weird column names, but
not the order they put the columns in; which seems like a thin use-case.

The three options Josh enumerated seem reasonably sane to me.
        regards, tom lane