Thread: invalid byte sequence for encoding "UTF8"

invalid byte sequence for encoding "UTF8"

From

"Fuzzygoth"

Date:

21 March 2007, 13:31:20

Hi,

I am trying currently trying to setup our new database sever, we have
upgraded
to PostgreSQL 8.1.8. When I try to restore the backup (which is stored
as a set
of SQL statements that my restore script feeds into PSQL to execute)
it returns
the following error.

psql:/mnt/tmp/app/application_data.sql:97425: ERROR:  invalid byte
sequence for encoding "UTF8": 0xff

HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

along other byte sequences eg: 0xa1, 0xac, the two remaining schemas
are
roughly 22GB and 66GB in size and is read into postgres from flat
cobol
datafiles.

our data has progressed as displayed below
PostgreSQL 7.?.? Stored in SQL-ASCII (Old configuration)
PostgreSQL 8.1.3 Stored in UTF8 (current conguration)
PostgreSQL 8.1.8 Stored in UTF8 (our future configuration)

The encoding type set on the server was changed to UTF8 from SQL-ASCII
after
we moved to version 8.1.3 for purposes of globalisation.

I've searched the forums and found people with similar problems but
not much
on a way to remedy it. I did try using iconv which was suggested in a
thread
but it returned an error saying even the 22GB file was too large to
work on.

any help would be gratfully appreciated.

Many Thanks
David P

Re: invalid byte sequence for encoding "UTF8"

From

Alan Hodgson

Date:

21 March 2007, 13:56:13

On Wednesday 21 March 2007 04:17, "Fuzzygoth" <dav.phillips@ntlworld.com>
wrote:
> I've searched the forums and found people with similar problems but
> not much
> on a way to remedy it. I did try using iconv which was suggested in a
> thread
> but it returned an error saying even the 22GB file was too large to
> work on.

iconv needs to read the whole file into RAM.  What you can do is use the
UNIX split utility to split the dump file into smaller segments, use iconv
on each segment, and then cat all the converted segments back together into
a new dump file.  iconv is I think your best option for converting the dump
to a valid encoding.

--
"None are more hopelessly enslaved than those who falsely believe they are
free." -- Johann W. Von Goethe

Re: invalid byte sequence for encoding "UTF8"

From

Martijn van Oosterhout

Date:

21 March 2007, 16:58:10

On Wed, Mar 21, 2007 at 09:54:41AM -0700, Alan Hodgson wrote:
> iconv needs to read the whole file into RAM.  What you can do is use the
> UNIX split utility to split the dump file into smaller segments, use iconv
> on each segment, and then cat all the converted segments back together into
> a new dump file.  iconv is I think your best option for converting the dump
> to a valid encoding.

The guys at openstreetmap have written a UTF-8 cleaner that doesn't
read the whole file into memory:

http://trac.openstreetmap.org/browser/utils/planet.osm/C

Definitly more convenient for large files.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc