Home > mailing lists

Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs

From	Gavin Flower
Subject	Re: BUG #10589: hungarian.stop file spelling error
Date	June 11, 2014 03:24:29
Msg-id	5397CBD7.2070606@archidevsys.co.nz Whole thread Raw
In response to	Re: BUG #10589: hungarian.stop file spelling error (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: BUG #10589: hungarian.stop file spelling error
List	pgsql-bugs

Tree view

On 11/06/14 15:09, Tom Lane wrote:
> I wrote:
>>> [ we seem to have gotten a misencoded version of hungarian.stop ]
>> Actually, it looks like things are even worse than that: the Hungarian
>> stemmer code seems to be confused about this too.  In the first place,
>> we've got a LATIN1 version of that stemmer, which I would imagine is
>> entirely useless; and in the second place, the UTF8 version has no
>> reference to any non-LATIN1 characters.
>> Again, I'm suspecting this problem goes further than Hungarian,
>> because the set of stem_ISO_8859_1_foo.c files in
>> src/backend/snowball/libstemmer/ covers a lot more languages than
>> I think LATIN1 is meant to cope with.  I'm not sure how much of this
>> is broken in the original Snowball code and how much is our error
>> while importing the code.
> After further analysis, it appears that:
>
> 1. The cause of the immediately complained-of problem is that we took
> the stopword file we got from the Snowball website to be in LATIN1,
> whereas it evidently was meant to be in LATIN2.  The problematic
> characters were code 0xF5 in the file, which we translated to U+00F5,
> but the correct translation is U+0151.  (There is another discrepancy
> between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
> none of those in the stopword file.)
>
> 2. The Snowball people were just as confused as we were about the
> appropriate encoding to use for Hungarian: their code claims that the
> Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
> character codes used in it:
>
> /* special characters (in ISO Latin I) */
>
> stringdef a'  hex 'E1'  //a-acute
> stringdef e'  hex 'E9'  //e-acute
> stringdef i'  hex 'ED'  //i-acute
> stringdef o'  hex 'F3'  //o-acute
> stringdef o"  hex 'F6'  //o-umlaut
> stringdef oq  hex 'F5'  //o-double acute
> stringdef u'  hex 'FA'  //u-acute
> stringdef u"  hex 'FC'  //u-umlaut
> stringdef uq  hex 'FB'  //u-double acute
>
> Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
> and u-double-acute don't appear in LATIN1 at all, and the codes shown here
> are really for LATIN2.
>
> I've reported this issue upstream and there are fixes pending.
>
> 3. While I was concerned that there might be similar bugs in the other
> Snowball stemmers, it appears after a bit of research that LATIN1 is
> commonly used as an encoding for all the other languages the Snowball
> code claims it can be used for, even though in a few cases there are
> seldom-used characters that LATIN1 can't represent.  So there's not a
> clear reason to think there are any other undetected problems (and
> I would certainly not be the man to find them if they exist).
>
>
> I've gone ahead and committed the encoding fix for hungarian.stop in all
> active branches.  I'm going to wait for Snowball upstream to accept the
> proposed patches before I think about incorporating the code changes.
>
> I'm not real sure whether we should consider back-patching those changes.
> Right now, the Hungarian stemmer is applying rules meant for
> o-double-acute to o-tilde, which probably means that those stemming rules
> don't fire at all on actual Hungarian text.  If we fix that then the
> stemmer will behave differently, which might not be all that desirable to
> change in a minor release.  Perhaps we should only make the code changes
> in HEAD and 9.4?
>
>             regards, tom lane
>
>
Not saying there is any problem, but you might like to check how the EUR
currency symbol is handled (it is in LATIN2, but not in LATIN1):

https://en.wikipedia.org/wiki/Euro_sign
U+20AC ¤ euro sign
(HTML: |€| |€|)


Cheers,
Gavin

pgsql-bugs by date:

From: Tom Lane
Date: 11 June 2014, 03:09:28
Subject: Re: BUG #10589: hungarian.stop file spelling error

From: Alvaro Herrera
Date: 11 June 2014, 03:30:24
Subject: Re: BUG #10589: hungarian.stop file spelling error

Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs

Previous

Next