Re: Trouble with UTF-8 data - Mailing list pgsql-general

From Tom Lane
Subject Re: Trouble with UTF-8 data
Date
Msg-id 16915.1200613130@sss.pgh.pa.us
Whole thread Raw
In response to Trouble with UTF-8 data  (Janine Sisk <janine@furfly.net>)
Responses Re: Trouble with UTF-8 data
List pgsql-general
Janine Sisk <janine@furfly.net> writes:
> But I'm still getting this error when loading the data into the new
> database:

> ERROR:  invalid byte sequence for encoding "UTF8": 0xeda7a1

The reason PG doesn't like this sequence is that it corresponds to
a Unicode "surrogate pair" code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits.  See

http://en.wikipedia.org/wiki/UTF-16

I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form.  It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.

7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...

            regards, tom lane

pgsql-general by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: Accessing composite type columns from C
Next
From: Ivan Sergio Borgonovo
Date:
Subject: case dumbiness in return from functions