Thread: request for help with COPY syntax

request for help with COPY syntax

From

"Chuck D."

Date:

23 October 2007, 13:25:27

Greetings everyone,

I'm having some trouble with COPY syntax.

I'm importing the cities data from MaxMind, but I run into errors when the 
data adds a double quote inside a field.

The data is CSV, comma delimited, no quotes around fields, ISO-8859-1.  I'm 
using COPY with the defaults and setting client encoding to LATIN1.

The temporary table for importing looks like this:

         Table "geo.orig_city_maxmind"  Column    |         Type          | Modifiers
-------------+-----------------------+-----------cc1         | character(2)          |city        | text
 |accent_city | text                  |region      | character(3)          |latitude    | character varying(18)
|longitude  | character varying(18) |
 

The COPY command is:

COPY geo.orig_city_maxmind FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'
CSV;


Here is one error I get:

ERROR:  value too long for type character(3)
CONTEXT:  COPY orig_city_maxmind, line 281430, column region: "52.1438889"

Looking at line 281430 we see:

by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925

There are a couple " where I would expect to see ' instead.  I see other lines 
in the data that use both in a field.

I tried this with the earth-info.nga.mil data and I have a similar problem but 
they are using newlines within a field and I can't figure out how to allow 
them.

Anyone known how I can rewrite the COPY command to allow those " or ' within 
the data?  After a couple days I wasn't able to find any examples to help.

Re: request for help with COPY syntax

From

Andrew Sullivan

Date:

23 October 2007, 13:47:04

On Tue, Oct 23, 2007 at 10:19:07AM -0600, Chuck D. wrote:
> by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925
> 
> There are a couple " where I would expect to see ' instead.  I see other lines 
> in the data that use both in a field.

Ugh.  I think I would normalise the data before COPYing, myself. This
is a generally good practice for importing data: too much
intelligence in the import stage itself can cause unexpected side
effects and debugging pain.  Better to put an extra step in that
ensures the data is all marked up consistently on the way into the
import step.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.    --H.W. Fowler

Re: request for help with COPY syntax

From

"Fernando Hevia"

Date:

23 October 2007, 13:58:35

> -----Mensaje original-----
> De: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]
> En nombre de Chuck D.
> 

> Anyone known how I can rewrite the COPY command to allow those " or '
> within
> the data?  After a couple days I wasn't able to find any examples to help.
> 

Hi Chuck,
Do you need those characters in your table? If not I think you will be
better off preprocessing the data before running copy.

Replacing those " for ' or directly removing them is quite simple if you are
working in Unix, actually it should be quite simple in any operating system.

Regards,
Fernando

Re: request for help with COPY syntax

From

Adrian Klaver

Date:

23 October 2007, 23:51:09

On Tuesday 23 October 2007 9:19 am, Chuck D. wrote:
> Greetings everyone,
>
> I'm having some trouble with COPY syntax.
>
> I'm importing the cities data from MaxMind, but I run into errors when the
> data adds a double quote inside a field.
>
> The data is CSV, comma delimited, no quotes around fields, ISO-8859-1.  I'm
> using COPY with the defaults and setting client encoding to LATIN1.
>
> The temporary table for importing looks like this:
>
>
>           Table "geo.orig_city_maxmind"
>    Column    |         Type          | Modifiers
> -------------+-----------------------+-----------
>  cc1         | character(2)          |
>  city        | text                  |
>  accent_city | text                  |
>  region      | character(3)          |
>  latitude    | character varying(18) |
>  longitude   | character varying(18) |
>
> The COPY command is:
>
> COPY geo.orig_city_maxmind
>   FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'
> CSV;
>
>
> Here is one error I get:
>
> ERROR:  value too long for type character(3)
> CONTEXT:  COPY orig_city_maxmind, line 281430, column region: "52.1438889"
>
> Looking at line 281430 we see:
>
> by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925
>
> There are a couple " where I would expect to see ' instead.  I see other
> lines in the data that use both in a field.
>
> I tried this with the earth-info.nga.mil data and I have a similar problem
> but they are using newlines within a field and I can't figure out how to
> allow them.
>
> Anyone known how I can rewrite the COPY command to allow those " or '
> within the data?  After a couple days I wasn't able to find any examples to
> help.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

I got it to work with your sample data by using the COPY command as follows:
COPY geo.orig_city_maxmindFROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'CSV quote as '''';

-- 
Adrian Klaver
aklaver@comcast.net

Re: request for help with COPY syntax

From

"Chuck D."

Date:

24 October 2007, 03:59:27

On October 23, 2007 08:51:18 pm you wrote:
>
> I got it to work with your sample data by using the COPY command as
> follows: COPY geo.orig_city_maxmind
>     FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'
>     CSV quote as '''';

I see what you are after and you solved the syntax problem I was having, but 
now I get the same error on a different line:

ad,l'aldosa,L'Aldosa,02,42.5833333,1.6333333

I guess it considers that line quoted now.  In other words, some lines may 
have both single and double quotes involved, like this:

kz,otdeleniye imeni dvadtsat' vtorogo parts"yezda,Otdeleniye Imeni Dvadtsat' 
Vtorogo Parts"yezda,10,41.47,69.1280556

Is there any way to tell Postgresql that a CSV file has no quotes around each 
field, and that each field may have single or double quotes or both?

Re: request for help with COPY syntax

From

"Chuck D."

Date:

24 October 2007, 04:02:44

On October 23, 2007 10:44:51 am you wrote:
> Hi Chuck,
> Do you need those characters in your table? If not I think you will be
> better off preprocessing the data before running copy.
>
> Replacing those " for ' or directly removing them is quite simple if you
> are working in Unix, actually it should be quite simple in any operating
> system.
>
> Regards,
> Fernando

Greetings,

I'm not sure if they are needed because I've never seen a double quote in a 
place name before.  I don't believe they are errors though because there are 
more records that contain them.  As well, some records have single and double 
quotes allowed within a record and this really messes things up.

Any ideas?  Should I consider removing them in favor of a single quote?  If 
so, do you know how to do this with sed or similar?

Re: request for help with COPY syntax

From

Paul Lambert

Date:

24 October 2007, 04:13:18

Chuck D. wrote:
> Greetings everyone,
> 
> I'm having some trouble with COPY syntax.
> 
> I'm importing the cities data from MaxMind, but I run into errors when the 
> data adds a double quote inside a field.
> 
> The data is CSV, comma delimited, no quotes around fields, ISO-8859-1.  I'm 
> using COPY with the defaults and setting client encoding to LATIN1.
> 
> The temporary table for importing looks like this:
> 
> 
>           Table "geo.orig_city_maxmind"
>    Column    |         Type          | Modifiers
> -------------+-----------------------+-----------
>  cc1         | character(2)          |
>  city        | text                  |
>  accent_city | text                  |
>  region      | character(3)          |
>  latitude    | character varying(18) |
>  longitude   | character varying(18) |
> 
> The COPY command is:
> 
> COPY geo.orig_city_maxmind
>   FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'
> CSV;
> 
> 
> Here is one error I get:
> 
> ERROR:  value too long for type character(3)
> CONTEXT:  COPY orig_city_maxmind, line 281430, column region: "52.1438889"
> 
> Looking at line 281430 we see:
> 
> by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925
> 
> There are a couple " where I would expect to see ' instead.  I see other lines 
> in the data that use both in a field.
> 
> I tried this with the earth-info.nga.mil data and I have a similar problem but 
> they are using newlines within a field and I can't figure out how to allow 
> them.
> 
> Anyone known how I can rewrite the COPY command to allow those " or ' within 
> the data?  After a couple days I wasn't able to find any examples to help.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 
> 

I get around this problem with my data loads by specifying some other 
arbitrary character that I know won't appear in the data as the quote 
character.

Eg QUOTE E'\f' will specify form feed as the quote character, ergo any 
data with double or single quotes will be loaded with those quote 
characters in the string.

Something similar may help with your case.

-- 
Paul Lambert
Database Administrator
AutoLedgers

Re: request for help with COPY syntax

From

Harald Fuchs

Date:

24 October 2007, 06:19:56

In article <200710240059.04348.pgsql-list@nullmx.com>,
"Chuck D." <pgsql-list@nullmx.com> writes:

> On October 23, 2007 08:51:18 pm you wrote:
>> 
>> I got it to work with your sample data by using the COPY command as
>> follows: COPY geo.orig_city_maxmind
>> FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'
>> CSV quote as '''';


> I see what you are after and you solved the syntax problem I was having, but 
> now I get the same error on a different line:

> ad,l'aldosa,L'Aldosa,02,42.5833333,1.6333333

> I guess it considers that line quoted now.  In other words, some lines may 
> have both single and double quotes involved, like this:

> kz,otdeleniye imeni dvadtsat' vtorogo parts"yezda,Otdeleniye Imeni Dvadtsat' 
> Vtorogo Parts"yezda,10,41.47,69.1280556

> Is there any way to tell Postgresql that a CSV file has no quotes around each 
> field, and that each field may have single or double quotes or both?

Can't you just say
 COPY geo.orig_city_maxmind FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' DELIMITER ','

?

Re: request for help with COPY syntax

From

"Fernando Hevia"

Date:

24 October 2007, 10:56:12


> De: Chuck D.
> 
> I'm not sure if they are needed because I've never seen a double quote in
> a
> place name before.  I don't believe they are errors though because there
> are
> more records that contain them.  As well, some records have single and
> double
> quotes allowed within a record and this really messes things up.
> 
> Any ideas?  Should I consider removing them in favor of a single quote?
> If
> so, do you know how to do this with sed or similar?
> 

Well, hard to say what to do with those quotes without knowing if your query
conditions for places will include them.
I probably would replace them for an underscore or any other uniquely
identifiable character in order to succeed with the copy, and keep the
possibility to later decide if that underscore becomes again a quote or gets
removed all together. 

If you would like to just remove single or double quotes you should do:  sed "s/[\'\"]//g" file_with_quotes.txt >
file_without_quotes.txt

Say you want to replace quotes with a space, then:  sed "s/[\'\"]/ /g" file_with_quotes.txt > file_without_quotes.txt

Insert whatever you want to replace the quotes between the 2nd and 3rd bar
(/).

Regards,
Fernando.

Re: request for help with COPY syntax

From

"Chuck D."

Date:

25 October 2007, 12:41:03

On October 24, 2007 01:10:59 am Paul Lambert wrote:
>
> I get around this problem with my data loads by specifying some other
> arbitrary character that I know won't appear in the data as the quote
> character.
>
> Eg QUOTE E'\f' will specify form feed as the quote character, ergo any
> data with double or single quotes will be loaded with those quote
> characters in the string.
>
> Something similar may help with your case.

This was the solution.  I specified a quote character that was not in the data 
and the data imported perfectly.  Without specifying any delimiter postgres 
defaults to one of the quotes (I forget which).

Unfortunately, the data I imported wasn't good.  MaxMind, like the 
Geonames.org derivatives, uses FIPS code for a state identifier in the cities 
table for all countries EXCEPT USA in which case they use the iso code.  Both 
these data sets mix types within one column and I find that absolutely 
unacceptable.

Back to my original problem, which was trying to COPY in some of the 
earth-info.nga.mil world city data.  This data is tab delimited, no quotes 
around fields, newline line terminated and UTF-8 encoded.

Using a similar COPY statement with the defaults, it fails with this:

COPY geo.orig_city FROM '/home/www/geo/DATA/nga.mil/geonames_no_header.txt';

ERROR:  literal carriage return found in data
HINT:  Use "\r" to represent carriage return.
CONTEXT:  COPY orig_city, line 1071850

And of course, at that line we find a field that has several lines which 
appear (using cat -A) to be terminated with a new line ($).  I originally 
deleted this line but there are others like it.  And the file is 2 Gigs in 
size so it isn't acceptable to comb through it.

I believe this is a new problem because I have a vintage file dated early 2007 
that didn't have this problem.  Does anyone know how to solve this COPY 
issue?

Re: request for help with COPY syntax

From

"Chuck D."

Date:

25 October 2007, 13:16:35

On October 25, 2007 09:35:23 am Chuck D. wrote:
> On October 24, 2007 01:10:59 am Paul Lambert wrote:
> > I get around this problem with my data loads by specifying some other
> > arbitrary character that I know won't appear in the data as the quote
> > character.
> >
> > Eg QUOTE E'\f' will specify form feed as the quote character, ergo any
> > data with double or single quotes will be loaded with those quote
> > characters in the string.
> >
> > Something similar may help with your case.
>
> This was the solution.  I specified a quote character that was not in the
> data and the data imported perfectly.  Without specifying any delimiter
> postgres defaults to one of the quotes (I forget which).
>
> Unfortunately, the data I imported wasn't good.  MaxMind, like the
> Geonames.org derivatives, uses FIPS code for a state identifier in the
> cities table for all countries EXCEPT USA in which case they use the iso
> code.  Both these data sets mix types within one column and I find that
> absolutely unacceptable.
>
> Back to my original problem, which was trying to COPY in some of the
> earth-info.nga.mil world city data.  This data is tab delimited, no quotes
> around fields, newline line terminated and UTF-8 encoded.
>
> Using a similar COPY statement with the defaults, it fails with this:
>
> COPY geo.orig_city FROM
> '/home/www/geo/DATA/nga.mil/geonames_no_header.txt';
>
> ERROR:  literal carriage return found in data
> HINT:  Use "\r" to represent carriage return.
> CONTEXT:  COPY orig_city, line 1071850
>
> And of course, at that line we find a field that has several lines which
> appear (using cat -A) to be terminated with a new line ($).  I originally
> deleted this line but there are others like it.  And the file is 2 Gigs in
> size so it isn't acceptable to comb through it.
>
> I believe this is a new problem because I have a vintage file dated early
> 2007 that didn't have this problem.  Does anyone know how to solve this
> COPY issue?
>


Pardon me on this, the cat -A report for the failed line (and subsequent 
lines) shows ^M$ within the field, not just $.

I assume that is probably a \r\n and postgres wants \r for field data and \n 
to end a line.

I've tried working this over with sed but can't get the syntax right.  I also 
have iconv installed if that would help any.  Are there any good tools that 
will tell me what this really is instead of just ^M$ ?

Re: request for help with COPY syntax

From

Bricklen Anderson

Date:

25 October 2007, 13:58:15

Chuck D. wrote:

> Pardon me on this, the cat -A report for the failed line (and subsequent 
> lines) shows ^M$ within the field, not just $.
> 
> I assume that is probably a \r\n and postgres wants \r for field data and \n 
> to end a line.
> 
> I've tried working this over with sed but can't get the syntax right.  I also 
> have iconv installed if that would help any.  Are there any good tools that 
> will tell me what this really is instead of just ^M$ ?

If all you just want to do is strip out the ^M, you can run dos2unix on 
it, assuming that you are running a *nix distro.

Re: request for help with COPY syntax

From

"Chuck D."

Date:

25 October 2007, 16:13:59

On October 25, 2007 10:57:49 am you wrote:
>
> If all you just want to do is strip out the ^M, you can run dos2unix on
> it, assuming that you are running a *nix distro.

Well, I guess I could strip the ^M but I'm still left with a $ in the middle 
of a field which in the same as the line terminator, so COPY thinks it is at 
the end of a line when it is really in the middle of the field.  I really 
wish they would have quoted these fields, but I'm at a loss how to import 
these.

Re: request for help with COPY syntax

From

"Fernando Hevia"

Date:

25 October 2007, 18:16:47

> On October 25, 2007 10:57:49 am you wrote:
> >
> > If all you just want to do is strip out the ^M, you can run dos2unix on
> > it, assuming that you are running a *nix distro.
> 
> Well, I guess I could strip the ^M but I'm still left with a $ in the
> middle
> of a field which in the same as the line terminator, so COPY thinks it is
> at
> the end of a line when it is really in the middle of the field.  I really
> wish they would have quoted these fields, but I'm at a loss how to import
> these.
> 

As I understand it when a line starts with $ you would like to merge it with
the previous line.

I suppose you have a file like this:

--- test.txt ---
this is 
$field1, and this is 
$field2

I'll create the test file:

$ printf "this is \n\$field1, and this is \n\$field2\n" > test.txt

(I assume ^M have already been replaced so \n are used instead)

A short C program should do it:

/*------ code listing -----*/
#include <stdio.h>
#include <stdlib.h>

#define NL '\n'
#define FILTER '$'

int main(int argc, char *argv[]) {   FILE *fp;   char c;
   if (argc < 2) fp=stdin;   else {       fp=fopen(argv[1], "r");       if (!fp) {           perror(argv[1]);
exit(1);      }   }
 
   c=fgetc(fp);   while(!feof(fp)) {       if(c==NL) {           c=fgetc(fp);           if(feof(fp)) {
putchar(NL);              break;           }       }       if(c!=FILTER) putchar(c);       c=fgetc(fp);   }   exit
(0);
}
/*------------------*/

compile as:
$ gcc -o test test.c

Execute as:
$ test test.txt
this is  field1, and this is field2


Could this be of help?

Regards,
Fernando.

Re: request for help with COPY syntax

From

"Chuck D."

Date:

25 October 2007, 23:16:29

On October 25, 2007 03:16:59 pm Fernando Hevia wrote:
>
> As I understand it when a line starts with $ you would like to merge it
> with the previous line.
>

No, it appears the data file I am attempting to COPY has some records with 
fields that contain a CR/LF in the data of that field.  Postgres COPY fails 
like this:

ERROR:  literal carriage return found in data
HINT:  Use "\r" to represent carriage return.
CONTEXT:  COPY orig_city_world, line 1071850

I tried this, which I found on the web from Tom Lane:

sed 's/^M/\\r/g' geonames.txt > geonames_fixed.txt

But still get the same error.  I used ctrl-v ctrl-m to reproduce the ^M.  Not 
sure why it is kicking out those lines still.

Re: request for help with COPY syntax

From

Tom Lane

Date:

26 October 2007, 00:22:31

"Chuck D." <pgsql-list@nullmx.com> writes:
> I tried this, which I found on the web from Tom Lane:

> sed 's/^M/\\r/g' geonames.txt > geonames_fixed.txt

> But still get the same error.  I used ctrl-v ctrl-m to reproduce the
> ^M.  Not sure why it is kicking out those lines still.

Did the sed actually do anything?  (Hint: the file size of
geonames_fixed.txt would be larger than geonames.txt if it did.
Or you could diff the two files to confirm that something sensible
happened.)

I suspect that your shell may be fouling things up here.  You may need
to prepare this command as a one-line shell script, using an editor that
doesn't barf on bare carriage returns ...
        regards, tom lane

Re: request for help with COPY syntax

From

"Chuck D."

Date:

26 October 2007, 01:26:32

On October 25, 2007 09:22:10 pm you wrote:
> Did the sed actually do anything?  (Hint: the file size of
> geonames_fixed.txt would be larger than geonames.txt if it did.
> Or you could diff the two files to confirm that something sensible
> happened.)
>
> I suspect that your shell may be fouling things up here.  You may need
> to prepare this command as a one-line shell script, using an editor that
> doesn't barf on bare carriage returns ...
>
>             regards, tom lane


OK, here's a bit more info after testing the sed on a bash line and in 
a /bin/sh script edited with vi.

One of the troubling rows from the original file looks like this.  This is a 
couple fields, not the whole row.

cat -A short.txt

^IJishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$
Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$
Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$
^I2007-07-06$

Here's the shell script:

/bin/sed 's/^M/\\r/' < short.txt > short.out

And the result:

^IJishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$
Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$
Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$
^I2007-07-06$


Of course it sees the \r$ as the end of line and fails with:

ERROR:  missing data for column "full_name"

Does that help diagnose?

Re: request for help with COPY syntax

From

"Chuck D."

Date:

26 October 2007, 13:30:32

On October 25, 2007 09:22:10 pm Tom Lane wrote:
>
> Did the sed actually do anything?  (Hint: the file size of
> geonames_fixed.txt would be larger than geonames.txt if it did.
> Or you could diff the two files to confirm that something sensible
> happened.)
>
> I suspect that your shell may be fouling things up here.  You may need
> to prepare this command as a one-line shell script, using an editor that
> doesn't barf on bare carriage returns ...

Can I ask this, what is the objective here?

I have a field with ^M$ in it (shows as CR/LF in text editor).  The objective 
should be to replace the CR/LF with just a CR shouldn't it?

Or is the objective to replace the CR/LF with a literal \r?  I'm a bit 
confused by the output of COPY that says to use "\r".  Looking at other data 
in my database (user message) I'd have to say it needs to be replaced with a 
literal \r correct?

If that is the case, the sed script doesn't work because it is only replacing 
the CR and still leaves the LF which terminates a line.  I found a little 
perl to do the job of replacing the CR/LF with a literal \r

perl -p -e 's/\r\n/\\r/g' < cities.txt > cities_fixed.txt