Closed Bug 254814 Opened 20 years ago Closed 15 years ago

Spellchecker returns false positives because of suffix errors [affix file, spelling checker]

Categories

(Core :: Spelling checker, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: guanxi_i, Unassigned)

References

Details

Moz 1.7 final on Win XP Pro

The spellchecker, due to an error in the way it adds suffixes to the word
"canoe", misspells "canoeing" (which is correct) as "canoing" (which is wrong).


TO DUPLICATE

1.  Create an e-mail, type "canoing" and "canoeing"

2.  Spell check it

3.  Note it erroneously accepts "canoing", but attempts to correct "canoeing".



CAUSE

The dictionary file, ..\components\myspell\en-US.dic, shows the following (the
characters after the slash reference rules in the affix file about how suffixes
are added to this word):
   canoe/DSGM

The affix file, ..\components\myspell\en-US.aff, shows the following for rule G.
 I'm not sure what the last line means, but clearly Spellchecker is cutting the
"e" before adding the suffix.
   SFX G Y 2
   SFX G   e     ing        e
   SFX G   0     ing        [^e] 



AFFIX FILE

You can find rules for interpreting the affix file here.  Note that we use the
Open Office format, not the Ispell format, for affix files.
   http://lingucomponent.openoffice.org/affix.readme

Here's a relevant section:

----------------------------------------------------
SFX D Y 4
SFX D   0     d          e
SFX D   y     ied        [^aeiou]y
SFX D   0     ed         [^ey]
SFX D   0     ed         [aeiou]y

This file is space delimited and case sensitive.
So this information can be interpreted as follows:

The first line has 4 fields:

Field
-----
1     SFX - indicates this is a suffix
2     D   - is the name of the character which represents this suffix
3     Y   - indicates it can be combined with prefixes (cross product)
4     4   - indicates that sequence of 4 affix entries are needed to
               properly store the affix information

The remaining lines describe the unique information for the 4 affix
entries that make up this affix.  Each line can be interpreted
as follows: (note fields 1 and 2 are used as a check against line 1 info)

Field
-----
1     SFX         - indicates this is a suffix
2     D           - is the name of the character which represents this affix
3     y           - the string of chars to strip off before adding affix
                         (a 0 here indicates the NULL string)
4     ied         - the string of affix characters to add
                         (a 0 here indicates the NULL string)
5     [^aeiou]y   - the conditions which must be met before the affix
                    can be applied

Field 5 is interesting.  Since this is a suffix, field 5 tells us that
there are 2 conditions that must be met.  The first condition is that 
the next to the last character in the word must *NOT* be any of the 
following "a", "e", "i", "o" or "u".  The second condition is that
the last character of the word must end in "y".
----------------------------------------------------



SOLUTIONS

Does the latest Open Office dictionary & affix contain this error?  If not,  can
we use it legally?  Technically?

If not, do we report this to Open Office?

If not, I suppose we must fix the affix file ourselves, or simply add "canoeing"
to en-US.dic or persdict.dat.
Now I understand the G rule in the affix files:

SFX G Y 2                      ;Header

SFX G   e     ing        e     ;If the word ends in "e", strip the "e" and add
                               ;"ing"

SFX G   0     ing        [^e]  ;If the word does NOT end in "e", strip nothing
                               ;and add "ing"


In other words, Moz' Spell Checker is behaving as instructed.  How does MySpell
typically handle irregular words?
This bug remains in FF2's spellchecker.
Same problem with "GAUGING" in FF 2.0 on WinXP:

..\dictionaries\en-US.dic shows:
    gauge/SM

..\dictionaries\en-US.aff shows (see comment 0 for how to interpret them):

    SFX S Y 4
    SFX S   y     ies        [^aeiou]y
    SFX S   0     s          [aeiou]y
    SFX S   0     es         [sxzh]
    SFX S   0     s          [^sxzhy]

    SFX M Y 1
    SFX M   0     's         .

Summary: Spellchecker misspells "canoeing" as "canoing" [Spelling checker] → Spellchecker returns fall positives because of suffix errors [spelling checker]
Summary: Spellchecker returns fall positives because of suffix errors [spelling checker] → Spellchecker returns fall positives because of suffix errors [affix file, spelling checker]
Summary: Spellchecker returns fall positives because of suffix errors [affix file, spelling checker] → Spellchecker returns false positives because of suffix errors [affix file, spelling checker]
For GAUGING, I think the problem is in en-us.dic, where the correct entry should be,
    gauge/SGM

(i.e., it's missing the "G" affix, which is listed in comment #0).
Yet another error: PROACTIVELY

It's a legit word:
    http://onelook.com/?w=proactively&ls=b

..\components\myspell\en-US.dic shows:
    proactive

In other words, it specifies no affix at all for PROACTIVE.  I don't know if that means 'no affix allowed' or 'use the default affix'.
Is bugzilla.mozilla.org the right place to report spelling errors?  Or should they be reported to OOo or someplace else?

It would be helpful to know if I'm wasting my time here.
There are many problems with the dictiosnary, which isn't very good at all, and neither is the spell checker itself. The plan is to replace them with Hunspell (I forget the bug #), which should give better suggestions. It is being used by OpenOffice now, and I hope it has a better English dictionary, but I'm not sure.

So I don't think we should spend time worrying about the current dictionary. The current Hunspell one might have the same problems, though, so you might want to check.
Thanks Brett.  I take it this bug is WONTFIX, but I'll leave that to someone else in case I misunderstand.
I haven't been calling these WONTFIXes. Maybe someone will go through and fix the English dictionary to get a short-term solution, and it would be nice. But it's not something that's very critical right now.
Implementing Hunspell is bug 319778. 

FWIW I thought I'd take a glance at the dictionaries, but I'm not 100% clear on which dictionaries would be used.  http://hunspell.sourceforge.net/ refers to both of the following:
   Aspell:  ftp://ftp.gnu.org/gnu/aspell/dict/en
   Myspell: http://wiki.services.openoffice.org/wiki/Dictionaries

So Hunspell is just the engine, with no dictionaries of its own?  Looking at en_us.dic in the Myspell dictionary suggests it would have the same errors.

The question is, which dictionary will be used and who maintains it?


(In reply to comment #10)
> The question is, which dictionary will be used and who maintains it?

No idea. Probably the best place to track that is on the Hunspell bug.
Over in the Hunspell bug (bug 319778) in comment 35, Németh László, the Hunspell author, says,

( https://bugzilla.mozilla.org/show_bug.cgi?id=319778#c35 )

> Thanks a lot. I believe, I have made the last modification of MySpell en_US
> dictionary in OOo's source, but
> http://wiki.services.openoffice.org/wiki/Dictionaries hasn't updated yet. So I
> will fix the problem that you reported, and update this Wiki page. I'd like to
> use also the en_US dictionary patch from Mozilla CVS. (It seems for me that the
> best place for the future bug reports would be the OpenOffice.org Wiki, using
> Bug report pages for every languages.)
 
Per bug 319778, these words have been added to Hunspell, though it's not clear if if the affix file is fixed or the proper spellings have only been added to the dictionary.

https://bugzilla.mozilla.org/show_bug.cgi?id=319778#c51
"canoing" and "canoeing", PROACTIVELY, GAUGING, all fixed by Hunspell (Firefox 3, Thunderbird 3).

canoeing is handled as an exception to the affix rule, like freeing. The others are part of the normal affix rules.
Status: NEW → RESOLVED
Closed: 15 years ago
Depends on: 319778
Resolution: --- → FIXED
FWIW personally I see http://wordlist.sourceforge.net/ as the canonical upstream for the en_US dictionaries.
You need to log in before you can comment on or make changes to this bug.