Re: BUG #13440: unaccent does not remove all diacritics - Mailing list pgsql-bugs
From | Thomas Munro |
---|---|
Subject | Re: BUG #13440: unaccent does not remove all diacritics |
Date | |
Msg-id | CAEepm=2yw0so0ke8ZRy-qWOCrPRC2Ts0cs_6O2Zudkg=R+sR9Q@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #13440: unaccent does not remove all diacritics (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #13440: unaccent does not remove all diacritics
|
List | pgsql-bugs |
On Wed, Jun 17, 2015 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Here is an unaccent.rules file that maps those 702 characters from >> Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH >> ..." to their base letter, plus 14 extra cases to match the existing >> unaccent.rules file. If you sort and diff this and the existing file, >> you can see that this file only adds new lines. Also, here is the >> script I used to build it from UnicodeData.txt. > > Hm. The "extra cases" are pretty disturbing, because some of them sure > look like bugs; which makes me wonder how closely the unaccent.rules > file was vetted to begin with. For those following along at home, > here are Thomas' extra cases, annotated by me with the Unicode file's > description of each source character: > > print_record(0x00c6, "A") # LATIN CAPITAL LETTER AE > print_record(0x00df, "S") # LATIN SMALL LETTER SHARP S > print_record(0x00e6, "a") # LATIN SMALL LETTER AE > print_record(0x0131, "i") # LATIN SMALL LETTER DOTLESS I > print_record(0x0132, "I") # LATIN CAPITAL LIGATURE IJ > print_record(0x0133, "i") # LATIN SMALL LIGATURE IJ > print_record(0x0138, "k") # LATIN SMALL LETTER KRA > print_record(0x0149, "n") # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE > print_record(0x014a, "N") # LATIN CAPITAL LETTER ENG > print_record(0x014b, "n") # LATIN SMALL LETTER ENG > print_record(0x0152, "E") # LATIN CAPITAL LIGATURE OE > print_record(0x0153, "e") # LATIN SMALL LIGATURE OE > print_record(0x0401, u"\u0415") # CYRILLIC CAPITAL LETTER IO > print_record(0x0451, u"\u0435") # CYRILLIC SMALL LETTER IO > > I'm really dubious that we should be translating those ligatures at > all (since the standard file is only advertised to do "unaccenting"), > and if we do translate them, shouldn't they convert to AE, ae, etc? Perhaps these conversions are intended only for comparisons, full text indexing etc but not showing the converted text to a user, in which case it doesn't matter too much if the conversions are a bit weird (œuf and oeuf are interchangeable in French, but euf is nonsense). But can we actually change them? That could cause difficulty for users with existing unaccented data stored/indexed... But I suppose even adding new mappings could cause problems. > Also unclear why we're dealing with KRA and ENG but not any of the > other marginal letters that Unicode labels as LATIN (what the heck > is an "AFRICAN D", for instance?) > > Also, while my German is nearly nonexistent, I had the idea that sharp-S > to "S" would be considered a case-folding transformation not an accent > removal. Comments from German speakers welcome of course. > > Likewise dubious about those Cyrillic entries, although I suppose > Teodor probably had good reasons for including them. > > On the other side of the coin, I think Thomas' regex might have swept up a > bit too much. I did this to see what sort of decorations were described: > > $ egrep ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' UnicodeData.txt | sed 's/.* WITH //' | sed 's/;.*//' | sort | uniq-c > 34 ACUTE > ...snip... > 4 TOPBAR > > Do we really need to expand the rule list fivefold to get rid of things > like FISHHOOK and SQUIRREL TAIL? Is removing those sorts of things even > legitimately "unaccenting"? I dunno, but I think it would be good to > have some consensus about what we want this file to do. I'm not sure > that we should be basing the transformation on minor phrasing details > in the Unicode data file. Right, that does seem a little bit weak. Instead of making assumptions about the format of those names, we could make use of the precomposed -> composed character mappings in the file. We could look for characters in the "letters" category where there is decomposition information (ie combining characters for the individual accents) and the base character is [a-zA-Z]. See attached. This produces 411 mappings (including the 14 extras). I didn't spend the time to figure out which 300 odd characters were dropped but I noticed that our Romanian characters of interest are definitely in. (There is a separate can of worms here about whether to deal with decomposed text...) -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-bugs by date: