Re: BUG #13440: unaccent does not remove all diacritics - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #13440: unaccent does not remove all diacritics |
Date | |
Msg-id | 1790.1434492074@sss.pgh.pa.us Whole thread Raw |
In response to | Re: BUG #13440: unaccent does not remove all diacritics (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: BUG #13440: unaccent does not remove all diacritics
|
List | pgsql-bugs |
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> It looks like Romanian also has s with comma. Perhaps we should have >> all these characters: >> >> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep >> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l >> 702 > Here is an unaccent.rules file that maps those 702 characters from > Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH > ..." to their base letter, plus 14 extra cases to match the existing > unaccent.rules file. If you sort and diff this and the existing file, > you can see that this file only adds new lines. Also, here is the > script I used to build it from UnicodeData.txt. Hm. The "extra cases" are pretty disturbing, because some of them sure look like bugs; which makes me wonder how closely the unaccent.rules file was vetted to begin with. For those following along at home, here are Thomas' extra cases, annotated by me with the Unicode file's description of each source character: print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S print_record(0x00e6, "a") # LATIN SMALL LETTER AE print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ print_record(0x0138, "k") # LATIN SMALL LETTER KRA print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG print_record(0x014b, "n") # LATIN SMALL LETTER ENG print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE print_record(0x0153, "e") # LATIN SMALL LIGATURE OE print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO I'm really dubious that we should be translating those ligatures at all (since the standard file is only advertised to do "unaccenting"), and if we do translate them, shouldn't they convert to AE, ae, etc? Also unclear why we're dealing with KRA and ENG but not any of the other marginal letters that Unicode labels as LATIN (what the heck is an "AFRICAN D", for instance?) Also, while my German is nearly nonexistent, I had the idea that sharp-S to "S" would be considered a case-folding transformation not an accent removal. Comments from German speakers welcome of course. Likewise dubious about those Cyrillic entries, although I suppose Teodor probably had good reasons for including them. On the other side of the coin, I think Thomas' regex might have swept up a bit too much. I did this to see what sort of decorations were described: $ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq -c 34 ACUTE 2 ACUTE AND DOT ABOVE 4 BAR 2 BELT 12 BREVE 2 BREVE AND ACUTE 2 BREVE AND DOT BELOW 2 BREVE AND GRAVE 2 BREVE AND HOOK ABOVE 2 BREVE AND TILDE 2 BREVE BELOW 34 CARON 2 CARON AND DOT ABOVE 22 CEDILLA 2 CEDILLA AND ACUTE 2 CEDILLA AND BREVE 26 CIRCUMFLEX 6 CIRCUMFLEX AND ACUTE 6 CIRCUMFLEX AND DOT BELOW 6 CIRCUMFLEX AND GRAVE 6 CIRCUMFLEX AND HOOK ABOVE 6 CIRCUMFLEX AND TILDE 12 CIRCUMFLEX BELOW 4 COMMA BELOW 4 CROSSED-TAIL 7 CURL 8 DESCENDER 19 DIAERESIS 4 DIAERESIS AND ACUTE 2 DIAERESIS AND CARON 2 DIAERESIS AND GRAVE 6 DIAERESIS AND MACRON 2 DIAERESIS BELOW 8 DIAGONAL STROKE 39 DOT ABOVE 4 DOT ABOVE AND MACRON 38 DOT BELOW 2 DOT BELOW AND DOT ABOVE 4 DOT BELOW AND MACRON 4 DOUBLE ACUTE 2 DOUBLE BAR 12 DOUBLE GRAVE 1 DOUBLE MIDDLE TILDE 1 FISHHOOK 1 FISHHOOK AND MIDDLE TILDE 5 FLOURISH 16 GRAVE 2 HIGH STROKE 30 HOOK 12 HOOK ABOVE 1 HOOK AND TAIL 1 HOOK TAIL 4 HORN 4 HORN AND ACUTE 4 HORN AND DOT BELOW 4 HORN AND GRAVE 4 HORN AND HOOK ABOVE 4 HORN AND TILDE 12 INVERTED BREVE 1 INVERTED LAZY S 3 LEFT HOOK 17 LINE BELOW 1 LONG LEFT LEG 1 LONG LEFT LEG AND LOW RIGHT RING 1 LONG LEG 2 LONG RIGHT LEG 2 LONG STROKE OVERLAY 4 LOOP 1 LOW RIGHT RING 1 LOW RING INSIDE 14 MACRON 4 MACRON AND ACUTE 2 MACRON AND DIAERESIS 4 MACRON AND GRAVE 2 MIDDLE DOT 1 MIDDLE RING 13 MIDDLE TILDE 1 NOTCH 10 OBLIQUE STROKE 10 OGONEK 2 OGONEK AND MACRON 17 PALATAL HOOK 9 RETROFLEX HOOK 1 RETROFLEX HOOK AND BELT 1 RIGHT HALF RING 1 RIGHT HOOK 6 RING ABOVE 2 RING ABOVE AND ACUTE 2 RING BELOW 1 SERIF 2 SHORT RIGHT LEG 2 SMALL LETTER J 1 SMALL LETTER Z 2 SQUIRREL TAIL 36 STROKE 2 STROKE AND ACUTE 2 STROKE AND DIAGONAL STROKE 4 STROKE THROUGH DESCENDER 4 SWASH TAIL 3 TAIL 16 TILDE 4 TILDE AND ACUTE 2 TILDE AND DIAERESIS 2 TILDE AND MACRON 6 TILDE BELOW 4 TOPBAR Do we really need to expand the rule list fivefold to get rid of things like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even legitimately "unaccenting"? I dunno, but I think it would be good to have some consensus about what we want this file to do. I'm not sure that we should be basing the transformation on minor phrasing details in the Unicode data file. regards, tom lane
pgsql-bugs by date: