Re: BUG #13440: unaccent does not remove all diacritics - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: BUG #13440: unaccent does not remove all diacritics |
Date | |
Msg-id | CAEepm=1KRVinFtuDao4L+qSBh4T4k3z996EwD5-zgytu4Qa5Fw@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #13440: unaccent does not remove all diacritics (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: BUG #13440: unaccent does not remove all diacritics
|
List | pgsql-bugs |
On Fri, Jun 19, 2015 at 2:00 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thomas Munro <thomas.munro@enterprisedb.com> writes: >>> On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I'm really dubious that we should be translating those ligatures at >>>> all (since the standard file is only advertised to do "unaccenting"), >>>> and if we do translate them, shouldn't they convert to AE, ae, etc? >> >>> Perhaps these conversions are intended only for comparisons, full text >>> indexing etc but not showing the converted text to a user, in which >>> case it doesn't matter too much if the conversions are a bit weird >>> (œuf and oeuf are interchangeable in French, but euf is nonsense). >>> But can we actually change them? That could cause difficulty for >>> users with existing unaccented data stored/indexed... But I suppose >>> even adding new mappings could cause problems. >> >> Yeah, if we do anything other than adding new mappings, I suspect that >> part could not be back-patched. Maybe adding new mappings shouldn't >> be back-patched either, though it seems relatively safe to me. >> >>> Right, that does seem a little bit weak. Instead of making >>> assumptions about the format of those names, we could make use of the >>> precomposed -> composed character mappings in the file. We could look >>> for characters in the "letters" category where there is decomposition >>> information (ie combining characters for the individual accents) and >>> the base character is [a-zA-Z]. See attached. This produces 411 >>> mappings (including the 14 extras). I didn't spend the time to figure >>> out which 300 odd characters were dropped but I noticed that our >>> Romanian characters of interest are definitely in. >> >> I took a quick look at this list and it seems fairly sane as far as the >> automatically-generated items go, except that I see it hits a few >> LIGATURE cases (including the existing ij cases, but also fi fl and ffl). >> I'm still quite dubious that that is appropriate; at least, if we do it >> I think we should be expanding out to the equivalent multi-letter form, >> not simply taking one of the letters and dropping the rest. Anybody else >> have an opinion on how to handle ligatures? > > Here is a version that optionally expands ligatures if asked to with > --expand-ligatures. I looked at this again and noticed a few problems. I've attached a new version. Here is a summary of the changes compared to what is in master: * 6 existing ligatures expanded fully: Æ, æ, IJ, ij, Œ, œ * 18 new ligatures added: DŽ, Dž, dž, LJ, Lj, lj, NJ, Nj, nj, DZ, Dz, dz, ff, fi, fl, ffi, ffl, st * ß expanded to ss instead of S * ʼn expanded to 'n instead of n * 5 existing characters that involve neither diacritic marks[1] nor ligatures dropped: ĸ, Ŀ, ŀ, Ŋ, ŋ * 213 new characters with diacritics added: Ơ, ơ, Ư, ư, Ǎ, ǎ, Ǐ, ǐ, Ǒ, ǒ, Ǔ, ǔ, Ǧ, ǧ, Ǩ, ǩ, Ǫ, ǫ, ǰ, Ǵ, ǵ, Ǹ, ǹ, Ȁ, ȁ, Ȃ, ȃ, Ȅ, ȅ, Ȇ, ȇ, Ȉ, ȉ, Ȋ, ȋ, Ȍ, ȍ, Ȏ, ȏ, Ȑ, ȑ, Ȓ, ȓ, Ȕ, ȕ, Ȗ, ȗ, Ș, ș, Ț, ț, Ȟ, ȟ, Ȧ, ȧ, Ȩ, ȩ, Ȯ, ȯ, Ȳ, ȳ, Ḁ, ḁ, Ḃ, ḃ, Ḅ, ḅ, Ḇ, ḇ, Ḋ, ḋ, Ḍ, ḍ, Ḏ, ḏ, Ḑ, ḑ, Ḓ, ḓ, Ḙ, ḙ, Ḛ, ḛ, Ḟ, ḟ, Ḡ, ḡ, Ḣ, ḣ, Ḥ, ḥ, Ḧ, ḧ, Ḩ, ḩ, Ḫ, ḫ, Ḭ, ḭ, Ḱ, ḱ, Ḳ, ḳ, Ḵ, ḵ, Ḷ, ḷ, Ḻ, ḻ, Ḽ, ḽ, Ḿ, ḿ, Ṁ, ṁ, Ṃ, ṃ, Ṅ, ṅ, Ṇ, ṇ, Ṉ, ṉ, Ṋ, ṋ, Ṕ, ṕ, Ṗ, ṗ, Ṙ, ṙ, Ṛ, ṛ, Ṟ, ṟ, Ṡ, ṡ, Ṣ, ṣ, Ṫ, ṫ, Ṭ, ṭ, Ṯ, ṯ, Ṱ, ṱ, Ṳ, ṳ, Ṵ, ṵ, Ṷ, ṷ, Ṽ, ṽ, Ṿ, ṿ, Ẁ, ẁ, Ẃ, ẃ, Ẅ, ẅ, Ẇ, ẇ, Ẉ, ẉ, Ẋ, ẋ, Ẍ, ẍ, Ẏ, ẏ, Ẑ, ẑ, Ẓ, ẓ, Ẕ, ẕ, ẖ, ẗ, ẘ, ẙ, Ạ, ạ, Ả, ả, Ẹ, ẹ, Ẻ, ẻ, Ẽ, ẽ, Ỉ, ỉ, Ị, ị, Ọ, ọ, Ỏ, ỏ, Ụ, ụ, Ủ, ủ, Ỳ, ỳ, Ỵ, ỵ, Ỷ, ỷ, Ỹ, ỹ In the previous version I'd missed the LATIN ... WITH STROKE characters like ø and ł because they aren't treated as diacritics or ligatures in the Unicode decomposition data (they're just separate letters, but they have an obvious unadorned ASCII replacement letter and we already handle these). There may be a case for replacing ø with oe[2] but that's not what we do now. Can any Danish or Norwegian speakers comment on this? There are actually 36 characters with names matching /LATIN (CAPITAL|SMALL) LETTER [A-Z] WITH STROKE/, but I added only the ones that we already had, namely O, D, H, L and lower case equivalents. Many of the rest seem to be obscure specialised characters not used in real languages. I don't see why we would take out that Cyrillic character: it seems like a totally legitimate case[3]. Even though it doesn't fit in with the idea that some might have of unaccent as the "make-this-into-plain-ASCII" function, there doesn't seem to be any reason why we shouldn't be able to handle Latin, Cyrillic and (if someone with the knowledge wants to add them) Greek characters in the same rule file -- they are non-overlapping, and all have diacritic marks which can be stripped to give a basic character set. That seems pretty useful for text search type applications, which is what this feature is for AFAIK. [1] That L is combining with punctuation, not a mark, according to Unicode, and generally doesn't seem to be used in any language (unlike ʼn/'n which is a common word in Afrikaans) [2] https://en.wikipedia.org/wiki/%C3%98 'In other languages that do not have the letter as part of the regular alphabet, or in limited character sets such as ASCII, ø is frequently replaced with the two-letter combination "oe".' [3] https://en.wiktionary.org/wiki/%D1%91 'This letter invariably bears the word stress. However, the diaeresis is usually not used outside of dictionaries and children’s books, where the letter is usually written simply as е.' -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-bugs by date: