Thread: Unicode normalization
Has somebody integrated Unicode normalization into Postgres? if not, I would have to implement my own function by using this CPAN module: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ . I need a function which removes all diacritics (1) and transforms some characters to a more compatible form (2) to get a better index on strings. Best, Andi 1) à,ä, ... => a 2) ø => o, ƒ => f, ª => a
On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote: > Has somebody integrated Unicode normalization into Postgres? if not, I > would have to implement my own function by using this CPAN module: > http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ . > > I need a function which removes all diacritics (1) and transforms some > characters to a more compatible form (2) to get a better index on > strings. > > Best, > > Andi > > > 1) à,ä, ... => a > 2) ø => o, ƒ => f, ª => a You mean something like this? http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
No, I need a solution which is as generic as possible. I use UTF-8 encoded unicode strings on all levels. This is what I have done so far: 1) Writing a separate Python command line script for testing - works as expected: #!/usr/bin/python import sys import unicodedata str = sys.argv[1].decode('UTF-8') str = unicodedata.normalize('NFKD', str) str = ''.join(c for c in str if unicodedata.combining(c) == 0) print str 2) Transfering this to PL/Python: CREATE OR REPLACE FUNCTION test (str text) RETURNS text AS $$ import unicodedata return unicodedata.normalize('NFKD', str.decode('UTF-8')) $$ LANGUAGE plpythonu; Problem: plpython throws an error, where my commandline script did it correctly: # select test('aÄÖÜ'); ERROR: plpython: function "test" could not create return value DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\u0308' in position 2: ordinal not in range(128) I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in a normal python environment? In the end it should look like this: CREATE TABLE t ( ... ts ts_vector NOT NULL ); INSERT INTO t (ts) VALUES(to_tsvector(normalize(?))); Andi David Fetter schrieb: > On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote: > >> Has somebody integrated Unicode normalization into Postgres? if not, I >> would have to implement my own function by using this CPAN module: >> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ . >> >> I need a function which removes all diacritics (1) and transforms some >> characters to a more compatible form (2) to get a better index on >> strings. >> >> Best, >> >> Andi >> >> >> 1) à,ä, ... => a >> 2) ø => o, ƒ => f, ª => a >> > > You mean something like this? > > http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase > > Cheers, > David. >
Update: The error is of course: The function tries to return "str" instead of unicode. It is not str.decode('UTF-8') which causes the error. Andreas Kalsch schrieb: > No, > > I need a solution which is as generic as possible. I use UTF-8 encoded > unicode strings on all levels. This is what I have done so far: > > > 1) Writing a separate Python command line script for testing - works > as expected: > > #!/usr/bin/python > > import sys > import unicodedata > > str = sys.argv[1].decode('UTF-8') > str = unicodedata.normalize('NFKD', str) > str = ''.join(c for c in str if unicodedata.combining(c) == 0) > print str > > > 2) Transfering this to PL/Python: > > CREATE OR REPLACE FUNCTION test (str text) > RETURNS text > AS $$ > import unicodedata > return unicodedata.normalize('NFKD', str.decode('UTF-8')) > $$ LANGUAGE plpythonu; > > Problem: plpython throws an error, where my commandline script did it > correctly: > > # select test('aÄÖÜ'); > > ERROR: plpython: function "test" could not create return value > DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't > encode character u'\u0308' in position 2: ordinal not in range(128) > > > > I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like > in a normal python environment? > > > In the end it should look like this: > > CREATE TABLE t ( > ... > ts ts_vector NOT NULL > ); > > INSERT INTO t (ts) VALUES(to_tsvector(normalize(?))); > > Andi > > > David Fetter schrieb: >> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote: >> >>> Has somebody integrated Unicode normalization into Postgres? if not, >>> I would have to implement my own function by using this CPAN >>> module: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ . >>> >>> I need a function which removes all diacritics (1) and transforms >>> some characters to a more compatible form (2) to get a better index >>> on strings. >>> >>> Best, >>> >>> Andi >>> >>> >>> 1) à,ä, ... => a >>> 2) ø => o, ƒ => f, ª => a >>> >> >> You mean something like this? >> >> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase >> >> >> Cheers, >> David. >> > >
On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote: > CREATE OR REPLACE FUNCTION test (str text) > RETURNS text > AS $$ > import unicodedata > return unicodedata.normalize('NFKD', str.decode('UTF-8')) > $$ LANGUAGE plpythonu; I'd guess you want that to be: return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8'); If you're converting from a utf8 encoding, you probably need to go back again! This could certainly be made easier though, PG knows what encoding its strings are stored in, why doesn't it work with unicode strings by default? -- Sam http://samason.me.uk/
On Wed, Sep 16, 2009 at 4:42 PM, Sam Mason <sam@samason.me.uk> wrote: > On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote: >> CREATE OR REPLACE FUNCTION test (str text) >> RETURNS text >> AS $$ >> import unicodedata >> return unicodedata.normalize('NFKD', str.decode('UTF-8')) >> $$ LANGUAGE plpythonu; > > I'd guess you want that to be: > > return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8'); > > If you're converting from a utf8 encoding, you probably need to go > back again! This could certainly be made easier though, PG knows what > encoding its strings are stored in, why doesn't it work with unicode > strings by default? Isn't it python that's making the mistake here, not pg?
Thank you Sam, this leaded to the correct solution: CREATE OR REPLACE FUNCTION simplify (str text) RETURNS text AS $$ import unicodedata s = unicodedata.normalize('NFKD', str.decode('UTF-8')) s = ''.join(c for c in s if unicodedata.combining(c) == 0) return s.encode('UTF-8') $$ LANGUAGE plpythonu; test=# select simplify('Français va à Paris, () {} [] µ @ º Ångstrøm Phiat-im hû-hō sī phiat tī 1-ê ki-chhó· jī-bó bīn-téng ê hû-hō. Siōng phó·-phiàn ê kong-lêng sī kái-piàn ki-chhó· jī-bó ê hoat-im.'); simplify --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Francais va a Paris, () {} [] μ @ o Angstrøm Phiat-im hu-ho si phiat ti 1-e ki-chho· ji-bo bin-teng e hu-ho. Siong pho·-phian e kong-leng si kai-pian ki-chho· ji-bo e hoat-im. (1 row) One question remains: How is the performance of PL/Python? When there are syntax errors in the Python code, they are not reported on CREATE, because the function seems be recompiled on every call. This leads to the next question: When will the unicode stuff included in the main distribution? Andi Sam Mason schrieb: > On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote: > >> CREATE OR REPLACE FUNCTION test (str text) >> RETURNS text >> AS $$ >> import unicodedata >> return unicodedata.normalize('NFKD', str.decode('UTF-8')) >> $$ LANGUAGE plpythonu; >> > > I'd guess you want that to be: > > return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8'); > > If you're converting from a utf8 encoding, you probably need to go > back again! This could certainly be made easier though, PG knows what > encoding its strings are stored in, why doesn't it work with unicode > strings by default? > >
Andreas Kalsch wrote: > 2) Transfering this to PL/Python: > > CREATE OR REPLACE FUNCTION test (str text) > RETURNS text > AS $$ > import unicodedata > return unicodedata.normalize('NFKD', str.decode('UTF-8')) > $$ LANGUAGE plpythonu; This is wrong, which is why we published a correct version here: http://wiki.postgresql.org/wiki/Strip_accents_from_strings -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Thu, Sep 17, 2009 at 12:01:57AM -0400, Alvaro Herrera wrote: > http://wiki.postgresql.org/wiki/Strip_accents_from_strings I'm still confused as to why plpython doesn't know the server's encoding already; seems as though all text operations are predicated on knowing this and hence all but the most trivial code has to go out of its way to be correct with respect to this. -- Sam http://samason.me.uk/
My standard encoding is UTF-8 on all levels so I don't need this high-cost call: plpy.execute("select setting from pg_settings where name = 'server_encoding'"); Additionally I want to get the original cases. For this purpose my solution is still fitting to my need. But it is not the one you have cited below, but: CREATE OR REPLACE FUNCTION simplify (str text) RETURNS text AS $$ import unicodedata s = unicodedata.normalize('NFKD', str.decode('UTF-8')) s = ''.join(c for c in s if unicodedata.combining(c) == 0) return s.encode('UTF-8') $$ LANGUAGE plpythonu; Andi >> 2) Transfering this to PL/Python: >> >> CREATE OR REPLACE FUNCTION test (str text) >> RETURNS text >> AS $$ >> import unicodedata >> return unicodedata.normalize('NFKD', str.decode('UTF-8')) >> $$ LANGUAGE plpythonu;