Closed
Bug 254814
Opened 20 years ago
Closed 15 years ago
Spellchecker returns false positives because of suffix errors [affix file, spelling checker]
Categories
(Core :: Spelling checker, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: guanxi_i, Unassigned)
References
Details
Moz 1.7 final on Win XP Pro The spellchecker, due to an error in the way it adds suffixes to the word "canoe", misspells "canoeing" (which is correct) as "canoing" (which is wrong). TO DUPLICATE 1. Create an e-mail, type "canoing" and "canoeing" 2. Spell check it 3. Note it erroneously accepts "canoing", but attempts to correct "canoeing". CAUSE The dictionary file, ..\components\myspell\en-US.dic, shows the following (the characters after the slash reference rules in the affix file about how suffixes are added to this word): canoe/DSGM The affix file, ..\components\myspell\en-US.aff, shows the following for rule G. I'm not sure what the last line means, but clearly Spellchecker is cutting the "e" before adding the suffix. SFX G Y 2 SFX G e ing e SFX G 0 ing [^e] AFFIX FILE You can find rules for interpreting the affix file here. Note that we use the Open Office format, not the Ispell format, for affix files. http://lingucomponent.openoffice.org/affix.readme Here's a relevant section: ---------------------------------------------------- SFX D Y 4 SFX D 0 d e SFX D y ied [^aeiou]y SFX D 0 ed [^ey] SFX D 0 ed [aeiou]y This file is space delimited and case sensitive. So this information can be interpreted as follows: The first line has 4 fields: Field ----- 1 SFX - indicates this is a suffix 2 D - is the name of the character which represents this suffix 3 Y - indicates it can be combined with prefixes (cross product) 4 4 - indicates that sequence of 4 affix entries are needed to properly store the affix information The remaining lines describe the unique information for the 4 affix entries that make up this affix. Each line can be interpreted as follows: (note fields 1 and 2 are used as a check against line 1 info) Field ----- 1 SFX - indicates this is a suffix 2 D - is the name of the character which represents this affix 3 y - the string of chars to strip off before adding affix (a 0 here indicates the NULL string) 4 ied - the string of affix characters to add (a 0 here indicates the NULL string) 5 [^aeiou]y - the conditions which must be met before the affix can be applied Field 5 is interesting. Since this is a suffix, field 5 tells us that there are 2 conditions that must be met. The first condition is that the next to the last character in the word must *NOT* be any of the following "a", "e", "i", "o" or "u". The second condition is that the last character of the word must end in "y". ---------------------------------------------------- SOLUTIONS Does the latest Open Office dictionary & affix contain this error? If not, can we use it legally? Technically? If not, do we report this to Open Office? If not, I suppose we must fix the affix file ourselves, or simply add "canoeing" to en-US.dic or persdict.dat.
Now I understand the G rule in the affix files: SFX G Y 2 ;Header SFX G e ing e ;If the word ends in "e", strip the "e" and add ;"ing" SFX G 0 ing [^e] ;If the word does NOT end in "e", strip nothing ;and add "ing" In other words, Moz' Spell Checker is behaving as instructed. How does MySpell typically handle irregular words?
Same problem with "GAUGING" in FF 2.0 on WinXP: ..\dictionaries\en-US.dic shows: gauge/SM ..\dictionaries\en-US.aff shows (see comment 0 for how to interpret them): SFX S Y 4 SFX S y ies [^aeiou]y SFX S 0 s [aeiou]y SFX S 0 es [sxzh] SFX S 0 s [^sxzhy] SFX M Y 1 SFX M 0 's .
Summary: Spellchecker misspells "canoeing" as "canoing" [Spelling checker] → Spellchecker returns fall positives because of suffix errors [spelling checker]
Summary: Spellchecker returns fall positives because of suffix errors [spelling checker] → Spellchecker returns fall positives because of suffix errors [affix file, spelling checker]
Summary: Spellchecker returns fall positives because of suffix errors [affix file, spelling checker] → Spellchecker returns false positives because of suffix errors [affix file, spelling checker]
For GAUGING, I think the problem is in en-us.dic, where the correct entry should be, gauge/SGM (i.e., it's missing the "G" affix, which is listed in comment #0).
Yet another error: PROACTIVELY It's a legit word: http://onelook.com/?w=proactively&ls=b ..\components\myspell\en-US.dic shows: proactive In other words, it specifies no affix at all for PROACTIVE. I don't know if that means 'no affix allowed' or 'use the default affix'.
Is bugzilla.mozilla.org the right place to report spelling errors? Or should they be reported to OOo or someplace else? It would be helpful to know if I'm wasting my time here.
Comment 7•18 years ago
|
||
There are many problems with the dictiosnary, which isn't very good at all, and neither is the spell checker itself. The plan is to replace them with Hunspell (I forget the bug #), which should give better suggestions. It is being used by OpenOffice now, and I hope it has a better English dictionary, but I'm not sure. So I don't think we should spend time worrying about the current dictionary. The current Hunspell one might have the same problems, though, so you might want to check.
Thanks Brett. I take it this bug is WONTFIX, but I'll leave that to someone else in case I misunderstand.
Comment 9•18 years ago
|
||
I haven't been calling these WONTFIXes. Maybe someone will go through and fix the English dictionary to get a short-term solution, and it would be nice. But it's not something that's very critical right now.
Reporter | ||
Comment 10•18 years ago
|
||
Implementing Hunspell is bug 319778. FWIW I thought I'd take a glance at the dictionaries, but I'm not 100% clear on which dictionaries would be used. http://hunspell.sourceforge.net/ refers to both of the following: Aspell: ftp://ftp.gnu.org/gnu/aspell/dict/en Myspell: http://wiki.services.openoffice.org/wiki/Dictionaries So Hunspell is just the engine, with no dictionaries of its own? Looking at en_us.dic in the Myspell dictionary suggests it would have the same errors. The question is, which dictionary will be used and who maintains it?
Comment 11•18 years ago
|
||
(In reply to comment #10) > The question is, which dictionary will be used and who maintains it? No idea. Probably the best place to track that is on the Hunspell bug.
Reporter | ||
Comment 13•18 years ago
|
||
Over in the Hunspell bug (bug 319778) in comment 35, Németh László, the Hunspell author, says, ( https://bugzilla.mozilla.org/show_bug.cgi?id=319778#c35 ) > Thanks a lot. I believe, I have made the last modification of MySpell en_US > dictionary in OOo's source, but > http://wiki.services.openoffice.org/wiki/Dictionaries hasn't updated yet. So I > will fix the problem that you reported, and update this Wiki page. I'd like to > use also the en_US dictionary patch from Mozilla CVS. (It seems for me that the > best place for the future bug reports would be the OpenOffice.org Wiki, using > Bug report pages for every languages.)
Reporter | ||
Comment 14•17 years ago
|
||
Per bug 319778, these words have been added to Hunspell, though it's not clear if if the affix file is fixed or the proper spellings have only been added to the dictionary. https://bugzilla.mozilla.org/show_bug.cgi?id=319778#c51
Comment 15•15 years ago
|
||
"canoing" and "canoeing", PROACTIVELY, GAUGING, all fixed by Hunspell (Firefox 3, Thunderbird 3). canoeing is handled as an exception to the affix rule, like freeing. The others are part of the normal affix rules.
Comment 16•15 years ago
|
||
FWIW personally I see http://wordlist.sourceforge.net/ as the canonical upstream for the en_US dictionaries.
You need to log in
before you can comment on or make changes to this bug.
Description
•