invalid byte sequence on restore - Mailing list pgsql-general
From | Markus Wollny |
---|---|
Subject | invalid byte sequence on restore |
Date | |
Msg-id | 28011CD60FB1724DBA4442E38277F6260AFCD334@hermes.computec.de Whole thread Raw |
Responses |
Re: invalid byte sequence on restore
|
List | pgsql-general |
Hi! I am currently struggling with a couple oif tainted bytes in one of our PostgreSQL 8.2 databases which I plan to move to8.3 soon - so I need to dump & restore. I think this problem bit me almost every single time during a major upgrade in the last couple of years, so I must say thatI have become somewhat used to the procedure. Before now I have always used the plain text format dump, which I fed throughiconv in order to correct encoding errors. This time I also had to convert from the 8.2 contrib-tsearch2 to 8.3 core-tsearch2,using the helpful instructions found at http://sql-info.de/postgresql/notes/converting-tsearch2-to-8.3.html Unfortunately this involves using the custom dump format and pg_restore. Using iconv on a custom dump is most probably notsuch a good idea :) On restoring, I received errors like the following: pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0x80 pg_restore: [archiver (db)] COPY failed: ERROR: invalid byte sequence for encoding "UTF8": 0xcd09 Fortunately I'm restoring to a test server, so the original DB is still available to me - and so I decided to correct theencoding errors before dumping. I just had to find out what to correct - and that's the tricky bit. pg_restore will errorout on the first occurrence of an invalid byte sequence for a table and so the table remains empty on restore and what'smore: Even when you would find this one spot per chance from the little information you've got now and you'd correctit before dumping again, you'd never know if a similar issue wouldn't bite you a few lines further down in the dumpon the next attempt. So it's better to sieve through the complete contents of the affected tables before attempting anotherrestore. Here's a possible path of actions to resolve the issue: 1. Take a look in the PostgreSQL logfile of the server you restore to in order to determine the tables where the error occurs.For each affected table, you'll find a couple of lines like the following: somedb>ERROR: invalid byte sequence for encoding "UTF8": 0x80 somedb>HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which iscontrolled by "client_encoding". somedb>CONTEXT: COPY topsearchterms, line 8998 somedb>STATEMENT: COPY topsearchterms (searchterm, usercount) FROM stdin; In this case, a table named topsearchterms is affected. Unfortunately you don't get to know which schema, but that doesn'tmatter right now. 2. Now get a list of all the objects in your custom dump: # pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -l>somedb_objects.txt 3. In somedb_objects.txt, comment out everything but the lines for the tables where the errors occur; be sure to keep alltables with matching names and table column definitions in there, no matter which schema. 4. Restore those tables' contents into a file: # pg_restore --disable-triggers -U postgres -Fc somedb-data.pg -L somedb_objects.txt>broken_tables.txt 5. Now filter the tables' contents through iconv # cat broken_tables.txt | ./iconv-chunks - -c -f utf8 -t utf8 | fixed_tables.txt As my databases are quite big, I always use this helpful script here: http://maurice.aubrey.googlepages.com/iconv-chunks.txt- this is feeding the input in chunks to iconv, thus avoiding memoryexhaustion. 6. Now you can simply use diff to find the affected tuples: # diff broken_tables.txt fixed_tables.txt 7. Even when diff output is not enough in itself, it will give you the line numbers, where the error occurs. So fire up yourfavorite editor and examine these lines in broken_tables.txt. 8. Update your affected tables in your original database. 9. Dump & reload again - this time it'll hoepfully run smoothly :) I hope that this may help somebody facing the same problem. I'd also welcome any suggestions on how to improve on this procedure. Kind regards Markus Computec Media AG Sitz der Gesellschaft und Registergericht: Fürth (HRB 8818) Vorstandsmitglieder: Johannes S. Gözalan (Vorsitzender) und Rainer Rosenbusch Vorsitzender des Aufsichtsrates: Jürg Marquard Umsatzsteuer-Identifikationsnummer: DE 812 575 276
pgsql-general by date: