Thread: Invalid Character Data Problem
I am having some trouble pulling some data out from the database. I have tried against both 7.4.5, 8.0b4 and 8.0b5 and get the same result. I have used both the v308 and CVS HEAD of the JDBC driver. When I SELECT from a certain table, I see this JDBC exception: "Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database." The database is indeed of type SQL_ASCII. The table stores mailing list data and has about 400,000 rows. Looking at the data via psql, I see that some of the rows have strange characters in them, such as question marks where I would not expect them, etc... What are my options? Is there a way to identify the 'bad' records, or the ones causing trouble? Thanks, Hunter
On Fri, 26 Nov 2004, Hunter Hillegas wrote: > When I SELECT from a certain table, I see this JDBC exception: > > "Invalid character data was found. This is most likely caused by stored > data containing characters that are invalid for the character set the > database was created in. The most common example of this is storing 8bit > data in a SQL_ASCII database." > > The database is indeed of type SQL_ASCII. The table stores mailing list data > and has about 400,000 rows. > > Looking at the data via psql, I see that some of the rows have strange > characters in them, such as question marks where I would not expect them, > etc... > > What are my options? Is there a way to identify the 'bad' records, or the > ones causing trouble? > To really solve this problem you need to have a correctly encoded database. This will involve a dump and restore process and possibly recoding your data. This is straightforward if you know what encoding your data is, although it will cause some downtime. To detect the bad data you can try various SELECTs with the JDBC driver and see what errors out. The function below will determine if a particular field has data with the high bit set which is something the database really doesn't know what to do with. SELECT pkcolumn, hashighbit(columna), hashighbit(columnb) FROM mytable; Kris Jurka CREATE OR REPLACE FUNCTION hashighbit(text) RETURNS boolean AS ' DECLARE i int; BEGIN i := LENGTH($1); WHILE i > 0 LOOP IF ascii(substring($1, i, 1)) >= 128 THEN RETURN true; END IF; i := i-1; END LOOP; RETURN false; END; ' LANGUAGE 'plpgsql';
I got this going with a dump/reload. Beware however, 8.0b5 and the JDBC driver don't seem to work with UNICODE encoding for the database. You have to use b4 or HEAD. Hunter > From: Kris Jurka <books@ejurka.com> > Date: Tue, 30 Nov 2004 01:34:46 -0500 (EST) > To: Hunter Hillegas <lists@lastonepicked.com> > Cc: PostgreSQL <pgsql-general@postgresql.org>, Postgre JDBC List > <pgsql-jdbc@postgresql.org> > Subject: Re: [JDBC] Invalid Character Data Problem > > > > On Fri, 26 Nov 2004, Hunter Hillegas wrote: > >> When I SELECT from a certain table, I see this JDBC exception: >> >> "Invalid character data was found. This is most likely caused by stored >> data containing characters that are invalid for the character set the >> database was created in. The most common example of this is storing 8bit >> data in a SQL_ASCII database." >> >> The database is indeed of type SQL_ASCII. The table stores mailing list data >> and has about 400,000 rows. >> >> Looking at the data via psql, I see that some of the rows have strange >> characters in them, such as question marks where I would not expect them, >> etc... >> >> What are my options? Is there a way to identify the 'bad' records, or the >> ones causing trouble? >> > > To really solve this problem you need to have a correctly encoded > database. This will involve a dump and restore process and possibly > recoding your data. This is straightforward if you know what > encoding your data is, although it will cause some downtime. > > To detect the bad data you can try various SELECTs with the JDBC driver > and see what errors out. The function below will determine if a > particular field has data with the high bit set which is something the > database really doesn't know what to do with. > > SELECT pkcolumn, hashighbit(columna), hashighbit(columnb) FROM mytable; > > Kris Jurka > > CREATE OR REPLACE FUNCTION hashighbit(text) RETURNS boolean AS ' > DECLARE > i int; > BEGIN > i := LENGTH($1); > WHILE i > 0 LOOP > IF ascii(substring($1, i, 1)) >= 128 THEN > RETURN true; > END IF; > i := i-1; > END LOOP; > RETURN false; > END; > ' LANGUAGE 'plpgsql'; > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org