Thread: Re: [GENERAL] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3
Re: [GENERAL] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3
From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes: > Missed the mailing list on the last reply >> patrick wrote: >>> thoses queries are not working, same message: >>> ERROR: invalid byte sequence for encoding "UTF8": 0xc3 >>> >>> what i found is in postgresql.conf if i change: >>> default_text_search_config from pg_catalog.french to >>> pg_catalog.english then the query is working fine. I am just about convinced the problem is with french.stop. There is more to that error message than meets the eye: 0xc3 is a valid first byte for a two-byte UTF8 character, so the only way that the message would look just like that is if 0xc3 is the last byte in the presented string. Looking at french.stop, the only plausible place for this to happen is the line à (that's \303\240 or 0xc3 0xa0). I am thinking that something decided the \240 was junk and removed it. I wonder whether the dictionaries ought not be reading their data files in binary mode. They appear to all be using AllocateFile(filename, "r") which means that we're at the mercy of whatever text-mode conversion Windows feels like doing. regards, tom lane
Re: [GENERAL] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3
From
Martijn van Oosterhout
Date:
On Wed, Mar 19, 2008 at 07:55:40PM -0400, Tom Lane wrote: > (that's \303\240 or 0xc3 0xa0). I am thinking that something decided > the \240 was junk and removed it. Hmm, it is coincidently the space character +0x80, which is defined as a non-breaking space in many Latin encodings. Perhaps ctype decided it was a space, or sscanf didn't read it... Have anice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Re: [GENERAL] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3
From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes: > On Wed, Mar 19, 2008 at 07:55:40PM -0400, Tom Lane wrote: >> (that's \303\240 or 0xc3 0xa0). I am thinking that something decided >> the \240 was junk and removed it. > Hmm, it is coincidently the space character +0x80, which is defined as > a non-breaking space in many Latin encodings. Yeah, that's what I'm thinking about. I poked around in Microsoft's documentation and couldn't find any suggestion that fgets() would remove such a character, however. Another possible theory is that the french.stop file got edited using something that had the wrong idea about the file's encoding, and proceeded to throw away the nbsp. regards, tom lane