Home > mailing lists

invalid byte sequence on restore - Mailing list pgsql-general

From	Markus Wollny
Subject	invalid byte sequence on restore
Date	November 14, 2008 08:58:13
Msg-id	28011CD60FB1724DBA4442E38277F6260AFCD334@hermes.computec.de Whole thread Raw
Responses	Re: invalid byte sequence on restore
List	pgsql-general

Tree view

Hi!

I am currently struggling with a couple oif tainted bytes in one of our PostgreSQL 8.2 databases which I plan to move
to8.3 soon - so I need to dump & restore.

I think this problem bit me almost every single time during a major upgrade in the last couple of years, so I must say
thatI have become somewhat used to the procedure. Before now I have always used the plain text format dump, which I fed
throughiconv in order to correct encoding errors. This time I also had to convert from the 8.2 contrib-tsearch2 to 8.3
core-tsearch2,using the helpful instructions found at
http://sql-info.de/postgresql/notes/converting-tsearch2-to-8.3.html

Unfortunately this involves using the custom dump format and pg_restore. Using iconv on a custom dump is most probably
notsuch a good idea :)

On restoring, I received errors like the following:

pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0x80
pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xcd09

Fortunately I'm restoring to a test server, so the original DB is still available to me - and so I decided to correct
theencoding errors before dumping. I just had to find out what to correct - and that's the tricky bit. pg_restore will
errorout on the first occurrence of an invalid byte sequence for a table and so the table remains empty on restore and
what'smore: Even when you would find this one spot per chance from the little information you've got now and you'd
correctit before dumping again, you'd never know if a similar issue wouldn't bite you a few lines further down in the
dumpon the next attempt. So it's better to sieve through the complete contents of the affected tables before attempting
anotherrestore.

Here's a possible path of actions to resolve the issue:

1. Take a look in the PostgreSQL logfile of the server you restore to in order to determine the tables where the error
occurs.For each affected table, you'll find a couple of lines like the following:
somedb>ERROR: invalid byte sequence for encoding "UTF8": 0x80
somedb>HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which
iscontrolled by "client_encoding".
somedb>CONTEXT: COPY topsearchterms, line 8998
somedb>STATEMENT: COPY topsearchterms (searchterm, usercount) FROM stdin;
In this case, a table named topsearchterms is affected. Unfortunately you don't get to know which schema, but that
doesn'tmatter right now.

2. Now get a list of all the objects in your custom dump:
# pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -l>somedb_objects.txt

3. In somedb_objects.txt, comment out everything but the lines for the tables where the errors occur; be sure to keep
alltables with matching names and table column definitions in there, no matter which schema.

4. Restore those tables' contents into a file:
# pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -L somedb_objects.txt>broken_tables.txt

5. Now filter the tables' contents through iconv
# cat broken_tables.txt | ./iconv-chunks - -c -f utf8 -t utf8 | fixed_tables.txt
As my databases are quite big, I always use this helpful script here:
http://maurice.aubrey.googlepages.com/iconv-chunks.txt- this is feeding the input in chunks to iconv, thus avoiding
memoryexhaustion.

6. Now you can simply use diff to find the affected tuples:
# diff broken_tables.txt fixed_tables.txt

7. Even when diff output is not enough in itself, it will give you the line numbers, where the error occurs. So fire up
yourfavorite editor and examine these lines in broken_tables.txt.

8. Update your affected tables in your original database.

9. Dump & reload again - this time it'll hoepfully run smoothly :)

I hope that this may help somebody facing the same problem. I'd also welcome any suggestions on how to improve on this
procedure.

Kind regards

Markus

Computec Media AG
Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818)
Vorstandsmitglieder: Johannes S. Gözalan (Vorsitzender) und Rainer Rosenbusch
Vorsitzender des Aufsichtsrates: Jürg Marquard
Umsatzsteuer-Identifikationsnummer: DE 812 575 276

pgsql-general by date:

From: Christiaan Willemsen
Date: 14 November 2008, 08:57:01
Subject: FreeBSD 7 needing to allocate lots of shared memory

From: Martijn van Oosterhout
Date: 14 November 2008, 09:03:32
Subject: Re: invalid byte sequence on restore

invalid byte sequence on restore - Mailing list pgsql-general

Previous

Next