BUG #13785: Postgresql encoding screw-up - Mailing list pgsql-bugs
From | ntpt@seznam.cz |
---|---|
Subject | BUG #13785: Postgresql encoding screw-up |
Date | |
Msg-id | 20151126121558.2760.44255@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #13785: Postgresql encoding screw-up
Re: BUG #13785: Postgresql encoding screw-up |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 13785 Logged by: ntpt Email address: ntpt@seznam.cz PostgreSQL version: 9.4.5 Operating system: linux Description: Hi, all I use postgresql several years. But recently with hw upgrade a fall to problem. pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in encoding "UTF8" has no equivalent in encoding "WIN1250" It is a strange. First there was a database with latin2 encoding. to this database connect an aplicaton with "set client encoding to win1250" and manipulating data then database was dumped with pg_dump -E UTF8 then database was restored pg_restore on another cluster in database with UTF8 encoding then application connect to new database with "set client encoding to win1250" and - query failed How in this scenario could invaid characters reach the database ??? Look here at whole thread http://postgresql.nabble.com/Query-failed-ERROR-character-with-byte-sequence-0xc2-0x96-in-encoding-quot-UTF8-quot-has-no-equivale-td5875048.html I examine this situation. I am affraid that there is a major design flaw or bug that can screw up a lot of databases. Please look at the https://cs.wikipedia.org/wiki/Windows-1250 (in czech only) but translation table "Srovnánàs ISO 8859-2" is selfexplanatory. In situatin where db have latin2 encoding but client set client encoding to win1250 and manipulate data: Because some characters in cp1250 (win1250) are not present in latin2 encoding, postgres store value of that chars "as is" ie without charset translation ie as example 0x96 (EN_DASH in win1250) is stored as 0x96 (unknown in latin2) (at least version 9.0) If You read from that database with client encoding set to win1250, this "as is" characters are "as is" returned - and from point of client aplication - everything work as expected. But.. If You need to migrate and transcode database to utf8 by recomended way (pg_dump -E utf8 ) things goes weired Because there is no character 0x96 in latin2 , transcoder to utf8 does not know the recipe how treat this chracter - and leave it "as is" producing \u0096 character in output . But if You look to the table "Mapovánà do Unikódu" from wiki above mentioned , proper unicode code for this char in WIN1250 is \u2013. (And postgres can not know it, because postgres does not know from with codepage these stored "as is" characters came ) So now if You restore it to new utf8 database, Part of characters would have "unicodized" raw values of that chars that was not presented in original database encoding, but presented in original client encoding . And result is an error as described in my post in mailing list So everybody in situation where need transfer to utf8 and original db encoding have no representation for some chars in client encodings ended like this - with working - but screwed and non transferable database. I thik that safe practice would be: Pg_dum with -E as used by client applicaton and then restore to newly created utf8 database . It should be mentioned as safe way in the doc, at least But in enviroment, where client use multiple encodings that have more characters then database encoding, the database is screwed forever - will work but can not be repaired and migrated to another encoding. Fix me if i am wrong.. PS: execuse my bad english
pgsql-bugs by date: