Closed Bug 227214 Opened 21 years ago Closed 20 years ago

Try to improve the spell checker

Categories

(Thunderbird :: Message Compose Window, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mscott, Assigned: mvl)

References

Details

Attachments

(1 file, 3 obsolete files)

The spell checker does not seem to work quite as well as the 4.x spell checker. 

Examples include:

Type in Texis. Texas is not a possible suggestion in Tbird. Overall spell
checker has issues.


I noticed spellchecker does not offer suggestions on contractions. For example,
if you type shouldnt, it will not offer shouldn't as a suggestion. Same for
wouldn' couldn't, etc....
This myspell bug may help some:

http://mozdev.org/bugs/show_bug.cgi?id=1987

Also I noticed if I copied over the open office.org dictionary that came with my
version of Open Office, it is much larger than the dictionary we ship with
mozilla . That dictionary file seems to work better, picking up words like
"towards" which  our current dictionary cannot. One way to fix this bug may be
to get a new dictionary drop. 
Status: NEW → ASSIGNED
See http://mozdev.org/bugs/show_bug.cgi?id=1987 for more details.

This helps a little bit. But in order to make this work, I needed to download
the latest english US dictionary from: 

http://ftp.stardiv.de/pub/OpenOffice.org/contrib/dictionaries/download_dictionary.html


Need to find out if we can drop in a new dictionary into myspell. Are there
license issues here?
Here's what I wrote up after looking into the licensing issues:

"
Hi Mitchell, Brendan, Chris,

Mitchell asked me to investigate the licensing issues surrounding the spell
checker a bit more closely.  This was prompted by BMS asking for French and
Japanese spell checkers.

Here's where I think things stand, but of course I'm neither a technical nor
legal expert, so pls correct me if I got any of this wrong:
1.  We're OK in terms of licensing the spell checking engine.  David Einstein
basically owns that code, and whatever he doesn't own was licensed under the
BSD.  So we are able to license and distribute this code using our triple
licenses.  No issue there AFAIK;
2.  The en_US, en_UK and IT dictionaries are licensed under the LGPL AFAIK, so
we need to make sure that we clearly indicate that these files are not licensed
under the triple license.  I don't know what our policies are around how we flag
such code and whether it goes in a separate place in our CVS repository, but
wanted to flag this issue;
3.  Other dictionary files that we or localization projects may wish to borrow
from OpenOffice are licensed under the GPL.  This is a big problem since the GPL
was not meant to deal with this type of content.  Both our licensing attorney,
Heather Meeker, and the OpenOffice folks are bewildered that these files would
be under the GPL and are weary of including them along with our code.  So let's
make sure to keep those dictionary files far away from our code and, to the
extent we do wish to include them with our software, we should probably package
them up separately. Both Mail/News and Thunderbird have a button to download
additional dictionaries at Options>Check Spelling>Language>Download More but I
was unable to get that to work.  If we can make this feature work, then the
easiest short term solution may be that we not ship dictionaries for languages
other then en_US, en_UK and IT, but make it simple for end users of localized
versions to discover this download option.

We may wish to consider solving this issue more directly by creating a set of
BSD or triple licensed dictionaries, either by ourselves or in partnership with
OpenOffice.  Kevin Hendricks proposes what appears to be a fairly
straightforward way to create such a new set of dictionaries - see his
explanation below.  The thing is, people in the OO community want to license all
these files under the LGPL, not BSD.  Still, maybe we can work together and
create the tools/scripts together with Kevin.  That Kevin can run the tools to
generate LGPL dictionaries and we can run the tools to create triple licensed
dictionaries.

I have a feeling that you guys may be well ahead of me in thinking this through,
so this mail may not be useful to you, but I want to share what I found, in case
it's helpful in any way.

Cheers,

Bart
"
Blocks: 228788
I had to redo the patch to keep in synch with the unicode string changes mvl
made just before the holidays. mvl, can you look over what I've done in the
patch to make sure it looks ok to you?

I blow up the ascii rep table into ucs2 strings so I can compare it with the
ucs2 strings coming from the document. Not sure if that's right or if I should
do it the other way around.
Attachment #136642 - Attachment is obsolete: true
Comment on attachment 138430 [details] [diff] [review]
updated patch to account for mvl's string changes

time to get these patches into the tree now that 1.7a is open. 

David, I'll change this:
if(replaceTable != NULL)
to 
if (!replaceTable) before checking in :)

This checkin also includes a new dictionary and .aff file to support
replacements.
Attachment #138430 - Flags: superreview?(bienvenu)
+mozReplaceTable *myspAffixMgr::getReplaceTable() {
+  if (!replaceTable) return nsnull;
+  return replaceTable;
+}

this can just be return replaceTable;  I think.

+  if(replaceTable != NULL)
+  {
+   delete[] replaceTable;
+  }

don't need the null check at all here.

myspAffixMgr doesn't use member variable naming conventions we know and love,
e.g., replaceTable should be mReplaceTable.
Comment on attachment 138430 [details] [diff] [review]
updated patch to account for mvl's string changes

>Index: src/myspAffixmgr.cpp

>+          replaceTable[i].pattern = NS_ConvertASCIItoUCS2(cmds[1].get());
>+          replaceTable[i].replacement = NS_ConvertASCIItoUCS2(cmds[2].get());

Is the fileformat documented somewhere? At least the dict files are not in
ascii, but in some relevant native charset. The actual charset is set in the
aff file.
If the .aff is in the charset too (and i think so), you need a real charset
convertor here. Or you store the suggestions using the charset, and teach the
suggest manager about charset conversion. Both are ugly....

(maybe we should just convert everthing when reading the files, and store as
ucs2. The idea of the charset is that it is only 1 byte per char in general,
ucs2 is 2)
Hmmm...I bet you are right and the REP entries in the aff file are encoded just
like the dictionary words are, in the character set of the dictionary. So the
conversion won't work for all languages.
FYI, I just landed the new US dictionary for seamonkey so it is now using the
same dictionary as thunderbird.

I (or a volunteer) still needs to make the REP table patch work for non english
dictionaries before this patch can get checked in.
I took a look at the charset stuff, and am very tempted in changing everything
to work with either utf8 or utf16. It caused me too much pain to find out how
all this is supposed to work.
Except maybe for the internal storage of the dictionary list, but that is an
implentation detail that shouldn't go outside that one class.
I have no objections if you want to but the smack down on the spell checker :)
Oh, how i whish i could find a specification of the .aff (and .dic) file formats...
I'm sure the myspell project page has documenation somewhere.

Maybe you can find it through the open office stuff which is the main consumer
of myspell:

http://ooo.ximian.com/lxr/search?string=myspell
Attached patch Convert to unicode (obsolete) — Splinter Review
Updated patch that converts the common misspellings to utf16 before storing
them. It doesn't convert the rest, because that patch became huge.
Note that it doesn't work, due to bug 238021. A workaround would be to convert
to utf8 before doing the search for the substring in replchars(), but that
would suck. I'd rather have strings fixed :)
Assignee: mscott → mvl
Attachment #138430 - Attachment is obsolete: true
This patch works better. It makes sure the replacements patterns are lower
case, and then lower cases the candidate before searching. It is to work around
the weird api, but also it (hopefully) reduces the number of case conversions
needed. Why check case insensitivly every time when you can just lowercase it
once?
Attachment #144351 - Attachment is obsolete: true
Attachment #144678 - Flags: review?(mscott)
Comment on attachment 144678 [details] [diff] [review]
patch that should work

thanks a lot for tacking my initial patch and making it do the right thing
here. 

One comment (that may have been part of my patch not yours)...

nsAutoString pattern, replacement;

should we move those out of the for loop and just re-use them instead of
creating them every time the for loop goes back in scope?

You can probably ask bienvenu@nventure.com to sr this.
Attachment #144678 - Flags: review?(mscott) → review+
Comment on attachment 144678 [details] [diff] [review]
patch that should work

Yeah, i will fix that.
I will also fix the |if(mReplaceTable)| check, before delete[]. And one
nsString instead of nsAutoString for a temp string.
Attachment #144678 - Flags: superreview?(bienvenu)
thx a lot for doing this! Not sure which code is yours and which is mscott's,
but here goes:

some nits:

+        PRInt32 numFields = SplitString(line,cmds,3);

should have spaces after the ','s

+          if(!cmds[0].Equals("REP")) { //consistency check
+            //complain loudly
+            continue;
+          }
How about an NS_ASSERTION(PR_FALSE, "inconsistent ..." ); to start with?

+          // We don't want to convert them for ecery lookup.

"ecery" -> "every"

+  if (!mReplaceTable)
+    return nsnull;
+  return mReplaceTable;

just "return mReplaceTable;"

also,

please put the opening function { on a line by itself like so:
mozReplaceTable *myspAffixMgr::getReplaceTable() 
{
+  if (!mReplaceTable)
+    return nsnull;
+  return mReplaceTable;
+}
+
+PRUint32 myspAffixMgr::getReplaceTableLength() 
{
+  return mReplaceTableLength;
+}
+


+  // perhaps we made a typical fault of spelling
+  res = replchars(wlst, word, &nsug);

// perhaps we made a typical spelling error.

no need for braces here:

+  if ((nsug < maxSug) && NS_SUCCEEDED(res)){
+    res = forgotchar(wlst, word, &nsug);
+  }
+ 


+// suggestions for a typical fault of spelling, that
+// differs with more, than 1 letter from the right form.

// suggestions for a typical spelling error that
// differs by more than 1 letter from the right spelling

+  if (word.Length() < 2 || ! pAMgr) return NS_OK;

we try to put the returns on a line by themselves - makes it easier to set
breakpoints.

+nsresult myspSuggestMgr::replchars(PRUnichar ** wlst,const nsAFlatString &word,
PRUint32 *ns)

can you call this  myspSuggestMgr::replacechars - repl is a bit ambiguous.

thx, sr=bienvenu with those changes. I'll go add the sr now.
Comment on attachment 144678 [details] [diff] [review]
patch that should work

sr=bienvenu with aforementioned nits.
Attachment #144678 - Flags: superreview?(bienvenu) → superreview+
checked in
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
I have had a problem with spell checker not changing my "i" into a capital
letter to make it proper.  it knows that "i" is a word, but doesnt know it has
to be capitalized.  also, when i add words to it, such as "wouldn't" it doesnt
realize that that is the best suggestion when i spell "wouldnt."
Attachment #138430 - Flags: superreview?(bienvenu)
Hi,

    I noticed this topic is considered fixed, but I have the newest Thunderbird
1.0 and the dictionary is still well below par.  I still require to use other
dictionaries to spell check my documents on a common basis.

Your friend,
Scott
sclewin7@yahoo.com, try to be more specific. Is the problem with the engine, or
with the word in the dictionary? Which correct words are flagged as incorrect?
Hi,

    I noticed this topic is considered fixed, but I have the newest Thunderbird
1.0 and the dictionary is still well below par.  I still require to use other
dictionaries to spell check my documents on a common basis.

Your friend,
Scott(In reply to comment #24)
> sclewin7@yahoo.com, try to be more specific. Is the problem with the engine, or
> with the word in the dictionary? Which correct words are flagged as incorrect?
The spell checker is fine noticing misspelled words, it just does not suggest a
list of correct words at about 50% of the time.  So, the problem may be the
dictionary, but I do not know for sure as my knowledge of programming is limited
to introductory C and Basic :)

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: