Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs
From | Gavin Flower |
---|---|
Subject | Re: BUG #10589: hungarian.stop file spelling error |
Date | |
Msg-id | 5397CBD7.2070606@archidevsys.co.nz Whole thread Raw |
In response to | Re: BUG #10589: hungarian.stop file spelling error (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #10589: hungarian.stop file spelling error
|
List | pgsql-bugs |
On 11/06/14 15:09, Tom Lane wrote: > I wrote: >>> [ we seem to have gotten a misencoded version of hungarian.stop ] >> Actually, it looks like things are even worse than that: the Hungarian >> stemmer code seems to be confused about this too. In the first place, >> we've got a LATIN1 version of that stemmer, which I would imagine is >> entirely useless; and in the second place, the UTF8 version has no >> reference to any non-LATIN1 characters. >> Again, I'm suspecting this problem goes further than Hungarian, >> because the set of stem_ISO_8859_1_foo.c files in >> src/backend/snowball/libstemmer/ covers a lot more languages than >> I think LATIN1 is meant to cope with. I'm not sure how much of this >> is broken in the original Snowball code and how much is our error >> while importing the code. > After further analysis, it appears that: > > 1. The cause of the immediately complained-of problem is that we took > the stopword file we got from the Snowball website to be in LATIN1, > whereas it evidently was meant to be in LATIN2. The problematic > characters were code 0xF5 in the file, which we translated to U+00F5, > but the correct translation is U+0151. (There is another discrepancy > between LATIN1 and LATIN2 at code point 0xFB, but by chance there are > none of those in the stopword file.) > > 2. The Snowball people were just as confused as we were about the > appropriate encoding to use for Hungarian: their code claims that the > Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII > character codes used in it: > > /* special characters (in ISO Latin I) */ > > stringdef a' hex 'E1' //a-acute > stringdef e' hex 'E9' //e-acute > stringdef i' hex 'ED' //i-acute > stringdef o' hex 'F3' //o-acute > stringdef o" hex 'F6' //o-umlaut > stringdef oq hex 'F5' //o-double acute > stringdef u' hex 'FA' //u-acute > stringdef u" hex 'FC' //u-umlaut > stringdef uq hex 'FB' //u-double acute > > Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute > and u-double-acute don't appear in LATIN1 at all, and the codes shown here > are really for LATIN2. > > I've reported this issue upstream and there are fixes pending. > > 3. While I was concerned that there might be similar bugs in the other > Snowball stemmers, it appears after a bit of research that LATIN1 is > commonly used as an encoding for all the other languages the Snowball > code claims it can be used for, even though in a few cases there are > seldom-used characters that LATIN1 can't represent. So there's not a > clear reason to think there are any other undetected problems (and > I would certainly not be the man to find them if they exist). > > > I've gone ahead and committed the encoding fix for hungarian.stop in all > active branches. I'm going to wait for Snowball upstream to accept the > proposed patches before I think about incorporating the code changes. > > I'm not real sure whether we should consider back-patching those changes. > Right now, the Hungarian stemmer is applying rules meant for > o-double-acute to o-tilde, which probably means that those stemming rules > don't fire at all on actual Hungarian text. If we fix that then the > stemmer will behave differently, which might not be all that desirable to > change in a minor release. Perhaps we should only make the code changes > in HEAD and 9.4? > > regards, tom lane > > Not saying there is any problem, but you might like to check how the EUR currency symbol is handled (it is in LATIN2, but not in LATIN1): https://en.wikipedia.org/wiki/Euro_sign U+20AC ¤ euro sign (HTML: |€| |€|) Cheers, Gavin
pgsql-bugs by date: