Thread: Unicode normalization

Unicode normalization

From

Andreas Kalsch

Date:

16 September 2009, 14:52:14

Has somebody integrated Unicode normalization into Postgres? if not, I
would have to implement my own function by using this CPAN module:
http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .

I need a function which removes all diacritics (1) and transforms some
characters to a more compatible form (2) to get a better index on strings.

Best,

Andi


1) à,ä, ... => a
2) ø => o, ƒ => f, ª => a

Re: Unicode normalization

From

David Fetter

Date:

16 September 2009, 16:02:02

On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
> Has somebody integrated Unicode normalization into Postgres? if not, I
> would have to implement my own function by using this CPAN module:
> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>
> I need a function which removes all diacritics (1) and transforms some
> characters to a more compatible form (2) to get a better index on
> strings.
>
> Best,
>
> Andi
>
>
> 1) à,ä, ... => a
> 2) ø => o, ƒ => f, ª => a

You mean something like this?

http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Unicode normalization

From

Andreas Kalsch

Date:

16 September 2009, 16:35:13

No,

I need a solution which is as generic as possible. I use UTF-8 encoded
unicode strings on all levels. This is what I have done so far:


1) Writing a separate Python command line script for testing - works as
expected:

#!/usr/bin/python

import sys
import unicodedata

str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str


2) Transfering this to PL/Python:

CREATE OR REPLACE FUNCTION test (str text)
  RETURNS text
AS $$
    import unicodedata
    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;

Problem: plpython throws an error, where my commandline script did it
correctly:

# select test('aÄÖÜ');

ERROR:  plpython: function "test" could not create return value
DETAIL:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
encode character u'\u0308' in position 2: ordinal not in range(128)



I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in
a normal python environment?


In the end it should look like this:

CREATE TABLE t (
...
ts ts_vector NOT NULL
);

INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));

Andi


David Fetter schrieb:
> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>
>> Has somebody integrated Unicode normalization into Postgres? if not, I
>> would have to implement my own function by using this CPAN module:
>> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>
>> I need a function which removes all diacritics (1) and transforms some
>> characters to a more compatible form (2) to get a better index on
>> strings.
>>
>> Best,
>>
>> Andi
>>
>>
>> 1) à,ä, ... => a
>> 2) ø => o, ƒ => f, ª => a
>>
>
> You mean something like this?
>
> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>
> Cheers,
> David.
>

Re: Unicode normalization

From

Andreas Kalsch

Date:

16 September 2009, 17:37:04

Update: The error is of course: The function tries to return "str"
instead of unicode. It is not str.decode('UTF-8') which causes the error.

Andreas Kalsch schrieb:
> No,
>
> I need a solution which is as generic as possible. I use UTF-8 encoded
> unicode strings on all levels. This is what I have done so far:
>
>
> 1) Writing a separate Python command line script for testing - works
> as expected:
>
> #!/usr/bin/python
>
> import sys
> import unicodedata
>
> str = sys.argv[1].decode('UTF-8')
> str = unicodedata.normalize('NFKD', str)
> str = ''.join(c for c in str if unicodedata.combining(c) == 0)
> print str
>
>
> 2) Transfering this to PL/Python:
>
> CREATE OR REPLACE FUNCTION test (str text)
>  RETURNS text
> AS $$
>    import unicodedata
>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
> $$ LANGUAGE plpythonu;
>
> Problem: plpython throws an error, where my commandline script did it
> correctly:
>
> # select test('aÄÖÜ');
>
> ERROR:  plpython: function "test" could not create return value
> DETAIL:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
> encode character u'\u0308' in position 2: ordinal not in range(128)
>
>
>
> I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like
> in a normal python environment?
>
>
> In the end it should look like this:
>
> CREATE TABLE t (
> ...
> ts ts_vector NOT NULL
> );
>
> INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));
>
> Andi
>
>
> David Fetter schrieb:
>> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>>
>>> Has somebody integrated Unicode normalization into Postgres? if not,
>>> I  would have to implement my own function by using this CPAN
>>> module:  http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>>
>>> I need a function which removes all diacritics (1) and transforms
>>> some  characters to a more compatible form (2) to get a better index
>>> on strings.
>>>
>>> Best,
>>>
>>> Andi
>>>
>>>
>>> 1) à,ä, ... => a
>>> 2) ø => o, ƒ => f, ª => a
>>>
>>
>> You mean something like this?
>>
>> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>>
>>
>> Cheers,
>> David.
>>
>
>

Re: Unicode normalization

From

Sam Mason

Date:

16 September 2009, 19:43:09

On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote:
> CREATE OR REPLACE FUNCTION test (str text)
>  RETURNS text
> AS $$
>    import unicodedata
>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
> $$ LANGUAGE plpythonu;

I'd guess you want that to be:

  return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8');

If you're converting from a utf8 encoding, you probably need to go
back again!  This could certainly be made easier though, PG knows what
encoding its strings are stored in, why doesn't it work with unicode
strings by default?

--
  Sam  http://samason.me.uk/

Re: Unicode normalization

From

Scott Marlowe

Date:

16 September 2009, 20:01:51

On Wed, Sep 16, 2009 at 4:42 PM, Sam Mason <sam@samason.me.uk> wrote:
> On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote:
>> CREATE OR REPLACE FUNCTION test (str text)
>>  RETURNS text
>> AS $$
>>    import unicodedata
>>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
>> $$ LANGUAGE plpythonu;
>
> I'd guess you want that to be:
>
>  return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8');
>
> If you're converting from a utf8 encoding, you probably need to go
> back again!  This could certainly be made easier though, PG knows what
> encoding its strings are stored in, why doesn't it work with unicode
> strings by default?

Isn't it python that's making the mistake here, not pg?

How to simplify unicode strings

From

Andreas Kalsch

Date:

16 September 2009, 20:38:02

Thank you Sam,

this leaded to the correct solution:


CREATE OR REPLACE FUNCTION simplify (str text)
RETURNS text
AS $$
import unicodedata

s = unicodedata.normalize('NFKD', str.decode('UTF-8'))
s = ''.join(c for c in s if unicodedata.combining(c) == 0)
return s.encode('UTF-8')
$$ LANGUAGE plpythonu;


test=# select simplify('Français va à Paris, () {} [] µ @ º Ångstrøm
Phiat-im hû-hō sī phiat tī 1-ê ki-chhó· jī-bó bīn-téng ê hû-hō. Siōng
phó·-phiàn ê kong-lêng sī kái-piàn ki-chhó· jī-bó ê hoat-im.');
simplify

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Francais va a Paris, () {} [] μ @ o Angstrøm Phiat-im hu-ho si phiat ti
1-e ki-chho· ji-bo bin-teng e hu-ho. Siong pho·-phian e kong-leng si
kai-pian ki-chho· ji-bo e hoat-im.
(1 row)


One question remains: How is the performance of PL/Python?
When there are syntax errors in the Python code, they are not reported
on CREATE, because the function seems be recompiled on every call.

This leads to the next question: When will the unicode stuff included in
the main distribution?

Andi



Sam Mason schrieb:
> On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote:
>
>> CREATE OR REPLACE FUNCTION test (str text)
>>  RETURNS text
>> AS $$
>>    import unicodedata
>>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
>> $$ LANGUAGE plpythonu;
>>
>
> I'd guess you want that to be:
>
>   return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8');
>
> If you're converting from a utf8 encoding, you probably need to go
> back again!  This could certainly be made easier though, PG knows what
> encoding its strings are stored in, why doesn't it work with unicode
> strings by default?
>
>

Re: Unicode normalization

From

Alvaro Herrera

Date:

17 September 2009, 01:02:20

Andreas Kalsch wrote:

> 2) Transfering this to PL/Python:
>
> CREATE OR REPLACE FUNCTION test (str text)
>  RETURNS text
> AS $$
>    import unicodedata
>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
> $$ LANGUAGE plpythonu;

This is wrong, which is why we published a correct version here:

http://wiki.postgresql.org/wiki/Strip_accents_from_strings

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Unicode normalization

From

Sam Mason

Date:

17 September 2009, 07:14:34

On Thu, Sep 17, 2009 at 12:01:57AM -0400, Alvaro Herrera wrote:
> http://wiki.postgresql.org/wiki/Strip_accents_from_strings

I'm still confused as to why plpython doesn't know the server's encoding
already; seems as though all text operations are predicated on knowing
this and hence all but the most trivial code has to go out of its way to
be correct with respect to this.

--
  Sam  http://samason.me.uk/

Re: Unicode normalization

From

Andreas Kalsch

Date:

17 September 2009, 09:58:54

My standard encoding is UTF-8 on all levels so I don't need this
high-cost call:

plpy.execute("select setting from pg_settings where name =
'server_encoding'");

Additionally I want to get the original cases.

For this purpose my solution is still fitting to my need. But it is not
the one you have cited below, but:

CREATE OR REPLACE FUNCTION simplify (str text)
RETURNS text
AS $$
import unicodedata

s = unicodedata.normalize('NFKD', str.decode('UTF-8'))
s = ''.join(c for c in s if unicodedata.combining(c) == 0)
return s.encode('UTF-8')
$$ LANGUAGE plpythonu;

Andi

>> 2) Transfering this to PL/Python:
>>
>> CREATE OR REPLACE FUNCTION test (str text)
>>  RETURNS text
>> AS $$
>>    import unicodedata
>>    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
>> $$ LANGUAGE plpythonu;