Home > mailing lists

Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: BUG #10589: hungarian.stop file spelling error
Date	June 11, 2014 03:09:28
Msg-id	6135.1402456162@sss.pgh.pa.us Whole thread Raw
In response to	Re: BUG #10589: hungarian.stop file spelling error (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: BUG #10589: hungarian.stop file spelling error Re: BUG #10589: hungarian.stop file spelling error
List	pgsql-bugs

Tree view

I wrote:
>> [ we seem to have gotten a misencoded version of hungarian.stop ]

> Actually, it looks like things are even worse than that: the Hungarian
> stemmer code seems to be confused about this too.  In the first place,
> we've got a LATIN1 version of that stemmer, which I would imagine is
> entirely useless; and in the second place, the UTF8 version has no
> reference to any non-LATIN1 characters.

> Again, I'm suspecting this problem goes further than Hungarian,
> because the set of stem_ISO_8859_1_foo.c files in
> src/backend/snowball/libstemmer/ covers a lot more languages than
> I think LATIN1 is meant to cope with.  I'm not sure how much of this
> is broken in the original Snowball code and how much is our error
> while importing the code.

After further analysis, it appears that:

1. The cause of the immediately complained-of problem is that we took
the stopword file we got from the Snowball website to be in LATIN1,
whereas it evidently was meant to be in LATIN2.  The problematic
characters were code 0xF5 in the file, which we translated to U+00F5,
but the correct translation is U+0151.  (There is another discrepancy
between LATIN1 and LATIN2 at code point 0xFB, but by chance there are
none of those in the stopword file.)

2. The Snowball people were just as confused as we were about the
appropriate encoding to use for Hungarian: their code claims that the
Hungarian stemmer can run in LATIN1, and contains this table of non-ASCII
character codes used in it:

/* special characters (in ISO Latin I) */

stringdef a'  hex 'E1'  //a-acute
stringdef e'  hex 'E9'  //e-acute
stringdef i'  hex 'ED'  //i-acute
stringdef o'  hex 'F3'  //o-acute
stringdef o"  hex 'F6'  //o-umlaut
stringdef oq  hex 'F5'  //o-double acute
stringdef u'  hex 'FA'  //u-acute
stringdef u"  hex 'FC'  //u-umlaut
stringdef uq  hex 'FB'  //u-double acute

Most of these codes are the same in LATIN1 and LATIN2, but o-double-acute
and u-double-acute don't appear in LATIN1 at all, and the codes shown here
are really for LATIN2.

I've reported this issue upstream and there are fixes pending.

3. While I was concerned that there might be similar bugs in the other
Snowball stemmers, it appears after a bit of research that LATIN1 is
commonly used as an encoding for all the other languages the Snowball
code claims it can be used for, even though in a few cases there are
seldom-used characters that LATIN1 can't represent.  So there's not a
clear reason to think there are any other undetected problems (and
I would certainly not be the man to find them if they exist).

I've gone ahead and committed the encoding fix for hungarian.stop in all
active branches.  I'm going to wait for Snowball upstream to accept the
proposed patches before I think about incorporating the code changes.

I'm not real sure whether we should consider back-patching those changes.
Right now, the Hungarian stemmer is applying rules meant for
o-double-acute to o-tilde, which probably means that those stemming rules
don't fire at all on actual Hungarian text.  If we fix that then the
stemmer will behave differently, which might not be all that desirable to
change in a minor release.  Perhaps we should only make the code changes
in HEAD and 9.4?

            regards, tom lane

pgsql-bugs by date:

From: Tom Lane
Date: 10 June 2014, 21:08:34
Subject: Re: BUG #10589: hungarian.stop file spelling error

From: Gavin Flower
Date: 11 June 2014, 03:24:29
Subject: Re: BUG #10589: hungarian.stop file spelling error

Re: BUG #10589: hungarian.stop file spelling error - Mailing list pgsql-bugs

Previous

Next