Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: BUG #10589: hungarian.stop file spelling error |
Date | |
Msg-id | 6135.1402456162@sss.pgh.pa.us Whole thread Raw |
In response to | Re: BUG #10589: hungarian.stop file spelling error (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #10589: hungarian.stop file spelling error
Re: BUG #10589: hungarian.stop file spelling error |
List | pgsql-bugs |
I wrote: >> [ we seem to have gotten a misencoded version of hungarian.stop ] > Actually, it looks like things are even worse than that: the Hungarian > stemmer code seems to be confused about this too. In the first place, > we've got a LATIN1 version of that stemmer, which I would imagine is > entirely useless; and in the second place, the UTF8 version has no > reference to any non-LATIN1 characters. > Again, I'm suspecting this problem goes further than Hungarian, > because the set of stem_ISO_8859_1_foo.c files in > src/backend/snowball/libstemmer/ covers a lot more languages than > I think LATIN1 is meant to cope with. I'm not sure how much of this > is broken in the original Snowball code and how much is our error > while importing the code. After further analysis, it appears that: 1. The cause of the immediately complained-of problem is that we took the stopword file we got from the Snowball website to be in LATIN1, whereas it evidently was meant to be in LATIN2. The problematic characters were code 0xF5 in the file, which we translated to U+00F5, but the correct translation is U+0151. (There is another discrepancy between LATIN1 and LATIN2 at code point 0xFB, but by chance there are none of those in the stopword file.) 2. The Snowball people were just as confused as we were about the appropriate encoding to use for Hungarian: their code claims that the Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII character codes used in it: /* special characters (in ISO Latin I) */ stringdef a' hex 'E1' //a-acute stringdef e' hex 'E9' //e-acute stringdef i' hex 'ED' //i-acute stringdef o' hex 'F3' //o-acute stringdef o" hex 'F6' //o-umlaut stringdef oq hex 'F5' //o-double acute stringdef u' hex 'FA' //u-acute stringdef u" hex 'FC' //u-umlaut stringdef uq hex 'FB' //u-double acute Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute and u-double-acute don't appear in LATIN1 at all, and the codes shown here are really for LATIN2. I've reported this issue upstream and there are fixes pending. 3. While I was concerned that there might be similar bugs in the other Snowball stemmers, it appears after a bit of research that LATIN1 is commonly used as an encoding for all the other languages the Snowball code claims it can be used for, even though in a few cases there are seldom-used characters that LATIN1 can't represent. So there's not a clear reason to think there are any other undetected problems (and I would certainly not be the man to find them if they exist). I've gone ahead and committed the encoding fix for hungarian.stop in all active branches. I'm going to wait for Snowball upstream to accept the proposed patches before I think about incorporating the code changes. I'm not real sure whether we should consider back-patching those changes. Right now, the Hungarian stemmer is applying rules meant for o-double-acute to o-tilde, which probably means that those stemming rules don't fire at all on actual Hungarian text. If we fix that then the stemmer will behave differently, which might not be all that desirable to change in a minor release. Perhaps we should only make the code changes in HEAD and 9.4? regards, tom lane
pgsql-bugs by date: