Thread: request for help with COPY syntax
Greetings everyone, I'm having some trouble with COPY syntax. I'm importing the cities data from MaxMind, but I run into errors when the data adds a double quote inside a field. The data is CSV, comma delimited, no quotes around fields, ISO-8859-1. I'm using COPY with the defaults and setting client encoding to LATIN1. The temporary table for importing looks like this: Table "geo.orig_city_maxmind" Column | Type | Modifiers -------------+-----------------------+-----------cc1 | character(2) |city | text |accent_city | text |region | character(3) |latitude | character varying(18) |longitude | character varying(18) | The COPY command is: COPY geo.orig_city_maxmind FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' CSV; Here is one error I get: ERROR: value too long for type character(3) CONTEXT: COPY orig_city_maxmind, line 281430, column region: "52.1438889" Looking at line 281430 we see: by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925 There are a couple " where I would expect to see ' instead. I see other lines in the data that use both in a field. I tried this with the earth-info.nga.mil data and I have a similar problem but they are using newlines within a field and I can't figure out how to allow them. Anyone known how I can rewrite the COPY command to allow those " or ' within the data? After a couple days I wasn't able to find any examples to help.
On Tue, Oct 23, 2007 at 10:19:07AM -0600, Chuck D. wrote: > by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925 > > There are a couple " where I would expect to see ' instead. I see other lines > in the data that use both in a field. Ugh. I think I would normalise the data before COPYing, myself. This is a generally good practice for importing data: too much intelligence in the import stage itself can cause unexpected side effects and debugging pain. Better to put an extra step in that ensures the data is all marked up consistently on the way into the import step. A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
> -----Mensaje original----- > De: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] > En nombre de Chuck D. > > Anyone known how I can rewrite the COPY command to allow those " or ' > within > the data? After a couple days I wasn't able to find any examples to help. > Hi Chuck, Do you need those characters in your table? If not I think you will be better off preprocessing the data before running copy. Replacing those " for ' or directly removing them is quite simple if you are working in Unix, actually it should be quite simple in any operating system. Regards, Fernando
On Tuesday 23 October 2007 9:19 am, Chuck D. wrote: > Greetings everyone, > > I'm having some trouble with COPY syntax. > > I'm importing the cities data from MaxMind, but I run into errors when the > data adds a double quote inside a field. > > The data is CSV, comma delimited, no quotes around fields, ISO-8859-1. I'm > using COPY with the defaults and setting client encoding to LATIN1. > > The temporary table for importing looks like this: > > > Table "geo.orig_city_maxmind" > Column | Type | Modifiers > -------------+-----------------------+----------- > cc1 | character(2) | > city | text | > accent_city | text | > region | character(3) | > latitude | character varying(18) | > longitude | character varying(18) | > > The COPY command is: > > COPY geo.orig_city_maxmind > FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' > CSV; > > > Here is one error I get: > > ERROR: value too long for type character(3) > CONTEXT: COPY orig_city_maxmind, line 281430, column region: "52.1438889" > > Looking at line 281430 we see: > > by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925 > > There are a couple " where I would expect to see ' instead. I see other > lines in the data that use both in a field. > > I tried this with the earth-info.nga.mil data and I have a similar problem > but they are using newlines within a field and I can't figure out how to > allow them. > > Anyone known how I can rewrite the COPY command to allow those " or ' > within the data? After a couple days I wasn't able to find any examples to > help. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org I got it to work with your sample data by using the COPY command as follows: COPY geo.orig_city_maxmindFROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt'CSV quote as ''''; -- Adrian Klaver aklaver@comcast.net
On October 23, 2007 08:51:18 pm you wrote: > > I got it to work with your sample data by using the COPY command as > follows: COPY geo.orig_city_maxmind > FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' > CSV quote as ''''; I see what you are after and you solved the syntax problem I was having, but now I get the same error on a different line: ad,l'aldosa,L'Aldosa,02,42.5833333,1.6333333 I guess it considers that line quoted now. In other words, some lines may have both single and double quotes involved, like this: kz,otdeleniye imeni dvadtsat' vtorogo parts"yezda,Otdeleniye Imeni Dvadtsat' Vtorogo Parts"yezda,10,41.47,69.1280556 Is there any way to tell Postgresql that a CSV file has no quotes around each field, and that each field may have single or double quotes or both?
On October 23, 2007 10:44:51 am you wrote: > Hi Chuck, > Do you need those characters in your table? If not I think you will be > better off preprocessing the data before running copy. > > Replacing those " for ' or directly removing them is quite simple if you > are working in Unix, actually it should be quite simple in any operating > system. > > Regards, > Fernando Greetings, I'm not sure if they are needed because I've never seen a double quote in a place name before. I don't believe they are errors though because there are more records that contain them. As well, some records have single and double quotes allowed within a record and this really messes things up. Any ideas? Should I consider removing them in favor of a single quote? If so, do you know how to do this with sed or similar?
Chuck D. wrote: > Greetings everyone, > > I'm having some trouble with COPY syntax. > > I'm importing the cities data from MaxMind, but I run into errors when the > data adds a double quote inside a field. > > The data is CSV, comma delimited, no quotes around fields, ISO-8859-1. I'm > using COPY with the defaults and setting client encoding to LATIN1. > > The temporary table for importing looks like this: > > > Table "geo.orig_city_maxmind" > Column | Type | Modifiers > -------------+-----------------------+----------- > cc1 | character(2) | > city | text | > accent_city | text | > region | character(3) | > latitude | character varying(18) | > longitude | character varying(18) | > > The COPY command is: > > COPY geo.orig_city_maxmind > FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' > CSV; > > > Here is one error I get: > > ERROR: value too long for type character(3) > CONTEXT: COPY orig_city_maxmind, line 281430, column region: "52.1438889" > > Looking at line 281430 we see: > > by,kruhavyetskalini"na,KruhavyetsKalini"na,02,52.1438889,31.6925 > > There are a couple " where I would expect to see ' instead. I see other lines > in the data that use both in a field. > > I tried this with the earth-info.nga.mil data and I have a similar problem but > they are using newlines within a field and I can't figure out how to allow > them. > > Anyone known how I can rewrite the COPY command to allow those " or ' within > the data? After a couple days I wasn't able to find any examples to help. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > I get around this problem with my data loads by specifying some other arbitrary character that I know won't appear in the data as the quote character. Eg QUOTE E'\f' will specify form feed as the quote character, ergo any data with double or single quotes will be loaded with those quote characters in the string. Something similar may help with your case. -- Paul Lambert Database Administrator AutoLedgers
In article <200710240059.04348.pgsql-list@nullmx.com>, "Chuck D." <pgsql-list@nullmx.com> writes: > On October 23, 2007 08:51:18 pm you wrote: >> >> I got it to work with your sample data by using the COPY command as >> follows: COPY geo.orig_city_maxmind >> FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' >> CSV quote as ''''; > I see what you are after and you solved the syntax problem I was having, but > now I get the same error on a different line: > ad,l'aldosa,L'Aldosa,02,42.5833333,1.6333333 > I guess it considers that line quoted now. In other words, some lines may > have both single and double quotes involved, like this: > kz,otdeleniye imeni dvadtsat' vtorogo parts"yezda,Otdeleniye Imeni Dvadtsat' > Vtorogo Parts"yezda,10,41.47,69.1280556 > Is there any way to tell Postgresql that a CSV file has no quotes around each > field, and that each field may have single or double quotes or both? Can't you just say COPY geo.orig_city_maxmind FROM '/home/www/geo/DATA/MAXMIND.com/cities_no_header.txt' DELIMITER ',' ?
> De: Chuck D. > > I'm not sure if they are needed because I've never seen a double quote in > a > place name before. I don't believe they are errors though because there > are > more records that contain them. As well, some records have single and > double > quotes allowed within a record and this really messes things up. > > Any ideas? Should I consider removing them in favor of a single quote? > If > so, do you know how to do this with sed or similar? > Well, hard to say what to do with those quotes without knowing if your query conditions for places will include them. I probably would replace them for an underscore or any other uniquely identifiable character in order to succeed with the copy, and keep the possibility to later decide if that underscore becomes again a quote or gets removed all together. If you would like to just remove single or double quotes you should do: sed "s/[\'\"]//g" file_with_quotes.txt > file_without_quotes.txt Say you want to replace quotes with a space, then: sed "s/[\'\"]/ /g" file_with_quotes.txt > file_without_quotes.txt Insert whatever you want to replace the quotes between the 2nd and 3rd bar (/). Regards, Fernando.
On October 24, 2007 01:10:59 am Paul Lambert wrote: > > I get around this problem with my data loads by specifying some other > arbitrary character that I know won't appear in the data as the quote > character. > > Eg QUOTE E'\f' will specify form feed as the quote character, ergo any > data with double or single quotes will be loaded with those quote > characters in the string. > > Something similar may help with your case. This was the solution. I specified a quote character that was not in the data and the data imported perfectly. Without specifying any delimiter postgres defaults to one of the quotes (I forget which). Unfortunately, the data I imported wasn't good. MaxMind, like the Geonames.org derivatives, uses FIPS code for a state identifier in the cities table for all countries EXCEPT USA in which case they use the iso code. Both these data sets mix types within one column and I find that absolutely unacceptable. Back to my original problem, which was trying to COPY in some of the earth-info.nga.mil world city data. This data is tab delimited, no quotes around fields, newline line terminated and UTF-8 encoded. Using a similar COPY statement with the defaults, it fails with this: COPY geo.orig_city FROM '/home/www/geo/DATA/nga.mil/geonames_no_header.txt'; ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY orig_city, line 1071850 And of course, at that line we find a field that has several lines which appear (using cat -A) to be terminated with a new line ($). I originally deleted this line but there are others like it. And the file is 2 Gigs in size so it isn't acceptable to comb through it. I believe this is a new problem because I have a vintage file dated early 2007 that didn't have this problem. Does anyone know how to solve this COPY issue?
On October 25, 2007 09:35:23 am Chuck D. wrote: > On October 24, 2007 01:10:59 am Paul Lambert wrote: > > I get around this problem with my data loads by specifying some other > > arbitrary character that I know won't appear in the data as the quote > > character. > > > > Eg QUOTE E'\f' will specify form feed as the quote character, ergo any > > data with double or single quotes will be loaded with those quote > > characters in the string. > > > > Something similar may help with your case. > > This was the solution. I specified a quote character that was not in the > data and the data imported perfectly. Without specifying any delimiter > postgres defaults to one of the quotes (I forget which). > > Unfortunately, the data I imported wasn't good. MaxMind, like the > Geonames.org derivatives, uses FIPS code for a state identifier in the > cities table for all countries EXCEPT USA in which case they use the iso > code. Both these data sets mix types within one column and I find that > absolutely unacceptable. > > Back to my original problem, which was trying to COPY in some of the > earth-info.nga.mil world city data. This data is tab delimited, no quotes > around fields, newline line terminated and UTF-8 encoded. > > Using a similar COPY statement with the defaults, it fails with this: > > COPY geo.orig_city FROM > '/home/www/geo/DATA/nga.mil/geonames_no_header.txt'; > > ERROR: literal carriage return found in data > HINT: Use "\r" to represent carriage return. > CONTEXT: COPY orig_city, line 1071850 > > And of course, at that line we find a field that has several lines which > appear (using cat -A) to be terminated with a new line ($). I originally > deleted this line but there are others like it. And the file is 2 Gigs in > size so it isn't acceptable to comb through it. > > I believe this is a new problem because I have a vintage file dated early > 2007 that didn't have this problem. Does anyone know how to solve this > COPY issue? > Pardon me on this, the cat -A report for the failed line (and subsequent lines) shows ^M$ within the field, not just $. I assume that is probably a \r\n and postgres wants \r for field data and \n to end a line. I've tried working this over with sed but can't get the syntax right. I also have iconv installed if that would help any. Are there any good tools that will tell me what this really is instead of just ^M$ ?
Chuck D. wrote: > Pardon me on this, the cat -A report for the failed line (and subsequent > lines) shows ^M$ within the field, not just $. > > I assume that is probably a \r\n and postgres wants \r for field data and \n > to end a line. > > I've tried working this over with sed but can't get the syntax right. I also > have iconv installed if that would help any. Are there any good tools that > will tell me what this really is instead of just ^M$ ? If all you just want to do is strip out the ^M, you can run dos2unix on it, assuming that you are running a *nix distro.
On October 25, 2007 10:57:49 am you wrote: > > If all you just want to do is strip out the ^M, you can run dos2unix on > it, assuming that you are running a *nix distro. Well, I guess I could strip the ^M but I'm still left with a $ in the middle of a field which in the same as the line terminator, so COPY thinks it is at the end of a line when it is really in the middle of the field. I really wish they would have quoted these fields, but I'm at a loss how to import these.
> On October 25, 2007 10:57:49 am you wrote: > > > > If all you just want to do is strip out the ^M, you can run dos2unix on > > it, assuming that you are running a *nix distro. > > Well, I guess I could strip the ^M but I'm still left with a $ in the > middle > of a field which in the same as the line terminator, so COPY thinks it is > at > the end of a line when it is really in the middle of the field. I really > wish they would have quoted these fields, but I'm at a loss how to import > these. > As I understand it when a line starts with $ you would like to merge it with the previous line. I suppose you have a file like this: --- test.txt --- this is $field1, and this is $field2 I'll create the test file: $ printf "this is \n\$field1, and this is \n\$field2\n" > test.txt (I assume ^M have already been replaced so \n are used instead) A short C program should do it: /*------ code listing -----*/ #include <stdio.h> #include <stdlib.h> #define NL '\n' #define FILTER '$' int main(int argc, char *argv[]) { FILE *fp; char c; if (argc < 2) fp=stdin; else { fp=fopen(argv[1], "r"); if (!fp) { perror(argv[1]); exit(1); } } c=fgetc(fp); while(!feof(fp)) { if(c==NL) { c=fgetc(fp); if(feof(fp)) { putchar(NL); break; } } if(c!=FILTER) putchar(c); c=fgetc(fp); } exit (0); } /*------------------*/ compile as: $ gcc -o test test.c Execute as: $ test test.txt this is field1, and this is field2 Could this be of help? Regards, Fernando.
On October 25, 2007 03:16:59 pm Fernando Hevia wrote: > > As I understand it when a line starts with $ you would like to merge it > with the previous line. > No, it appears the data file I am attempting to COPY has some records with fields that contain a CR/LF in the data of that field. Postgres COPY fails like this: ERROR: literal carriage return found in data HINT: Use "\r" to represent carriage return. CONTEXT: COPY orig_city_world, line 1071850 I tried this, which I found on the web from Tom Lane: sed 's/^M/\\r/g' geonames.txt > geonames_fixed.txt But still get the same error. I used ctrl-v ctrl-m to reproduce the ^M. Not sure why it is kicking out those lines still.
"Chuck D." <pgsql-list@nullmx.com> writes: > I tried this, which I found on the web from Tom Lane: > sed 's/^M/\\r/g' geonames.txt > geonames_fixed.txt > But still get the same error. I used ctrl-v ctrl-m to reproduce the > ^M. Not sure why it is kicking out those lines still. Did the sed actually do anything? (Hint: the file size of geonames_fixed.txt would be larger than geonames.txt if it did. Or you could diff the two files to confirm that something sensible happened.) I suspect that your shell may be fouling things up here. You may need to prepare this command as a one-line shell script, using an editor that doesn't barf on bare carriage returns ... regards, tom lane
On October 25, 2007 09:22:10 pm you wrote: > Did the sed actually do anything? (Hint: the file size of > geonames_fixed.txt would be larger than geonames.txt if it did. > Or you could diff the two files to confirm that something sensible > happened.) > > I suspect that your shell may be fouling things up here. You may need > to prepare this command as a one-line shell script, using an editor that > doesn't barf on bare carriage returns ... > > regards, tom lane OK, here's a bit more info after testing the sed on a bash line and in a /bin/sh script edited with vi. One of the troubling rows from the original file looks like this. This is a couple fields, not the whole row. cat -A short.txt ^IJishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$ Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$ Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian^M$ ^I2007-07-06$ Here's the shell script: /bin/sed 's/^M/\\r/' < short.txt > short.out And the result: ^IJishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$ Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$ Jishishan Bonanzu Dongxiangzu Salarzu Zizhixian\r$ ^I2007-07-06$ Of course it sees the \r$ as the end of line and fails with: ERROR: missing data for column "full_name" Does that help diagnose?
On October 25, 2007 09:22:10 pm Tom Lane wrote: > > Did the sed actually do anything? (Hint: the file size of > geonames_fixed.txt would be larger than geonames.txt if it did. > Or you could diff the two files to confirm that something sensible > happened.) > > I suspect that your shell may be fouling things up here. You may need > to prepare this command as a one-line shell script, using an editor that > doesn't barf on bare carriage returns ... Can I ask this, what is the objective here? I have a field with ^M$ in it (shows as CR/LF in text editor). The objective should be to replace the CR/LF with just a CR shouldn't it? Or is the objective to replace the CR/LF with a literal \r? I'm a bit confused by the output of COPY that says to use "\r". Looking at other data in my database (user message) I'd have to say it needs to be replaced with a literal \r correct? If that is the case, the sed script doesn't work because it is only replacing the CR and still leaves the LF which terminates a line. I found a little perl to do the job of replacing the CR/LF with a literal \r perl -p -e 's/\r\n/\\r/g' < cities.txt > cities_fixed.txt