Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents) - Mailing list pgsql-sql
From | Patrice Hédé |
---|---|
Subject | Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents) |
Date | |
Msg-id | Pine.LNX.3.96.980617124207.2514F-100000@paris.ivo.fr Whole thread Raw |
In response to | Internationalisation: SELECT str (ignoring Umlauts/Accents) (Benedikt Eric Heinen <beh@icemark.ch>) |
Responses |
Re: [SQL] Internationalisation: SELECT str (ignoring Umlauts/Accents)
|
List | pgsql-sql |
On Wed, 17 Jun 1998, Benedikt Eric Heinen wrote: > has anyone done a good implementation of a search algorithm, that would > allow searching for a string, with automatic use of alternate forms for > Umlauts ('ae' in a search string matches the a-Umlaut ), or allowing to > search without knowing the correct accents ( 'a' in a search string also > matches 'à', 'e' also matches 'é', ... )? > > This should - of course - work one-way only, i.e. if a string in the > database contains "ae" and the search string contains the German a-Umlaut, > then the string containing ae should not be matched, while the other way > around it should. I don't know what you exactly looking for : a specific solution, or a general one. If this is the second case, you have to take care that different languages have different ways for dealing with crippled texts ( = without accents...). For example, in French, you just forget the marks for all characters (i.e. é <-> e, ç <-> c, but also ü <-> u and not ue, and for foreign words ß <-> s, ø <-> o, etc. ), and in other languages, it could be different (Icelandic ð <-> d *and* dh, æ <-> ae, Norwegian å -> aa), etc. Maybe this is a point which should be considered in the discussion about i18n and l10n and collation sequences that is floating around. For a specific solution, you can try to use regex, with Perl substitutions first to have a good pattern. For example, if you're looking for Schädel (word taken randomly in the dictionnary :) ) : select * from text where wort ~* 'sch(Ä|ä|ae)del'; if you want an insensitive search... I think using [] would be nicer but you can't do that, because 'ae' is two characters... Moreover, you have to specify both Ä and ä, because they're not considered upper- and lowercase of the same letter (which I consider a bug... as with most languages, most of the accented characters pairs should be matched as upper/lower in an insensitive search (the only exception is ß/ÿ (0xdf, 0xff) and maybe ×/÷ (0xd7, 0xf7) ;) ). Something strange : ~* '^sch(Ä|ä|ae)del' doesn't return anything, though I had "schädel", "schaedel" as words.... Bug ! Bug ! Anyway, re-reading your mail, I see that you considered only the case where the database has the right spelling, but someone may enter something misspelled (as far as forgetting accents/umlauts is called misspelling).... then, what you want is that someone writing "schaedel" should match both "schaedel" and "schädel", and writing "idee" should match "idée"... then you have to work on the original string... if you want to do that inside a postgres query, you have to do your own function (does PL/Perl exists somewhere ??), which rewrite the string according to your needs... I can think about it further if you need... (I've done it in Perl, for French words, but in your case, you have to consider two characters matching one : (ae|ä) -> ä ae only -> ae a -> (a|à|á|â|ä) (but not ae -> (a|à|á|â|ä)(e|é|è|ê|ë) ?? ) All these rules means you really can't do successive global substitutions, which is what I did since in French, it's always a one to one mapping. It's not trivial, and should be dealt language by language, but maybe with a common base. Actually, I would really like that to be dealt in the i18n discussion, maybe as an extension to regexes, or as a function which is aware of the language of the field queried (which means that if national chars are implanted, functions should be able to know what language is in use for the particular field). As I will have to do multilingual "wordlists" tables in some time, I am very interested in this subject (though I don't know what it would imply with non-iso-8859-1 character sets). I realize also that I don't know much about the specificities of Postgres regexes as compared to perl regexes for example, I will have to look about that (are they fully described in the manuals ?). Patrice Hédé -- Patrice HÉDÉ --------------------------------- patrice@idf.net ----- ... Ásólfr hljóp upp á skip Hrúts ok varð fjögurra manna bani, áðr Hrútr varð varr við. Sneri hann þá í móti honum. En er þeir fundust, lagði Ásólfr í skjöld Hrúts ok í gegnum, en Hrútr hjó til Ásólfs, ok varð þat banahögg. --- Njáls Saga ----- http://www.idf.net/patrice/ ----------------------------------