227214 - Try to improve the spell checker

Reporter

Description

•

22 years ago

The spell checker does not seem to work quite as well as the 4.x spell checker. Examples include: Type in Texis. Texas is not a possible suggestion in Tbird. Overall spell checker has issues. I noticed spellchecker does not offer suggestions on contractions. For example, if you type shouldnt, it will not offer shouldn't as a suggestion. Same for wouldn' couldn't, etc....

Scott MacGregor

Reporter

Comment 1

•

22 years ago

This myspell bug may help some: http://mozdev.org/bugs/show_bug.cgi?id=1987 Also I noticed if I copied over the open office.org dictionary that came with my version of Open Office, it is much larger than the dictionary we ship with mozilla . That dictionary file seems to work better, picking up words like "towards" which our current dictionary cannot. One way to fix this bug may be to get a new dictionary drop.

Status: NEW → ASSIGNED

Scott MacGregor

Reporter

Comment 2

•

22 years ago

Attached patch add support for REP tables in the spell checker (obsolete) — Details — Splinter Review

See http://mozdev.org/bugs/show_bug.cgi?id=1987 for more details. This helps a little bit. But in order to make this work, I needed to download the latest english US dictionary from: http://ftp.stardiv.de/pub/OpenOffice.org/contrib/dictionaries/download_dictionary.html Need to find out if we can drop in a new dictionary into myspell. Are there license issues here?

bart decrem, not checking bugmail

Comment 3

•

22 years ago

Here's what I wrote up after looking into the licensing issues: " Hi Mitchell, Brendan, Chris, Mitchell asked me to investigate the licensing issues surrounding the spell checker a bit more closely. This was prompted by BMS asking for French and Japanese spell checkers. Here's where I think things stand, but of course I'm neither a technical nor legal expert, so pls correct me if I got any of this wrong: 1. We're OK in terms of licensing the spell checking engine. David Einstein basically owns that code, and whatever he doesn't own was licensed under the BSD. So we are able to license and distribute this code using our triple licenses. No issue there AFAIK; 2. The en_US, en_UK and IT dictionaries are licensed under the LGPL AFAIK, so we need to make sure that we clearly indicate that these files are not licensed under the triple license. I don't know what our policies are around how we flag such code and whether it goes in a separate place in our CVS repository, but wanted to flag this issue; 3. Other dictionary files that we or localization projects may wish to borrow from OpenOffice are licensed under the GPL. This is a big problem since the GPL was not meant to deal with this type of content. Both our licensing attorney, Heather Meeker, and the OpenOffice folks are bewildered that these files would be under the GPL and are weary of including them along with our code. So let's make sure to keep those dictionary files far away from our code and, to the extent we do wish to include them with our software, we should probably package them up separately. Both Mail/News and Thunderbird have a button to download additional dictionaries at Options>Check Spelling>Language>Download More but I was unable to get that to work. If we can make this feature work, then the easiest short term solution may be that we not ship dictionaries for languages other then en_US, en_UK and IT, but make it simple for end users of localized versions to discover this download option. We may wish to consider solving this issue more directly by creating a set of BSD or triple licensed dictionaries, either by ourselves or in partnership with OpenOffice. Kevin Hendricks proposes what appears to be a fairly straightforward way to create such a new set of dictionaries - see his explanation below. The thing is, people in the OO community want to license all these files under the LGPL, not BSD. Still, maybe we can work together and create the tools/scripts together with Kevin. That Kevin can run the tools to generate LGPL dictionaries and we can run the tools to create triple licensed dictionaries. I have a feeling that you guys may be well ahead of me in thinking this through, so this mail may not be useful to you, but I want to share what I found, in case it's helpful in any way. Cheers, Bart "

Scott MacGregor

Reporter

Updated

•

22 years ago

Blocks: 228788

Scott MacGregor

Reporter

Comment 4

•

22 years ago

Attached patch updated patch to account for mvl's string changes (obsolete) — Details — Splinter Review

I had to redo the patch to keep in synch with the unicode string changes mvl made just before the holidays. mvl, can you look over what I've done in the patch to make sure it looks ok to you? I blow up the ascii rep table into ucs2 strings so I can compare it with the ucs2 strings coming from the document. Not sure if that's right or if I should do it the other way around.

Scott MacGregor

Reporter

Updated

•

22 years ago

Attachment #136642 - Attachment is obsolete: true

Scott MacGregor

Reporter

Comment 5

•

22 years ago

Comment on attachment 138430 [details] [diff] [review] updated patch to account for mvl's string changes time to get these patches into the tree now that 1.7a is open. David, I'll change this: if(replaceTable != NULL) to if (!replaceTable) before checking in :) This checkin also includes a new dictionary and .aff file to support replacements.

Attachment #138430 - Flags: superreview?(bienvenu)

David :Bienvenu

Comment 6

•

22 years ago

+mozReplaceTable *myspAffixMgr::getReplaceTable() { + if (!replaceTable) return nsnull; + return replaceTable; +} this can just be return replaceTable; I think. + if(replaceTable != NULL) + { + delete[] replaceTable; + } don't need the null check at all here. myspAffixMgr doesn't use member variable naming conventions we know and love, e.g., replaceTable should be mReplaceTable.

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 7

•

22 years ago

Comment on attachment 138430 [details] [diff] [review] updated patch to account for mvl's string changes >Index: src/myspAffixmgr.cpp >+ replaceTable[i].pattern = NS_ConvertASCIItoUCS2(cmds[1].get()); >+ replaceTable[i].replacement = NS_ConvertASCIItoUCS2(cmds[2].get()); Is the fileformat documented somewhere? At least the dict files are not in ascii, but in some relevant native charset. The actual charset is set in the aff file. If the .aff is in the charset too (and i think so), you need a real charset convertor here. Or you store the suggestions using the charset, and teach the suggest manager about charset conversion. Both are ugly.... (maybe we should just convert everthing when reading the files, and store as ucs2. The idea of the charset is that it is only 1 byte per char in general, ucs2 is 2)

Scott MacGregor

Reporter

Comment 8

•

22 years ago

Hmmm...I bet you are right and the REP entries in the aff file are encoded just like the dictionary words are, in the character set of the dictionary. So the conversion won't work for all languages.

Scott MacGregor

Reporter

Comment 9

•

22 years ago

FYI, I just landed the new US dictionary for seamonkey so it is now using the same dictionary as thunderbird. I (or a volunteer) still needs to make the REP table patch work for non english dictionaries before this patch can get checked in.

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 10

•

22 years ago

I took a look at the charset stuff, and am very tempted in changing everything to work with either utf8 or utf16. It caused me too much pain to find out how all this is supposed to work. Except maybe for the internal storage of the dictionary list, but that is an implentation detail that shouldn't go outside that one class.

Scott MacGregor

Reporter

Comment 11

•

22 years ago

I have no objections if you want to but the smack down on the spell checker :)

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 12

•

22 years ago

Oh, how i whish i could find a specification of the .aff (and .dic) file formats...

Scott MacGregor

Reporter

Comment 13

•

22 years ago

I'm sure the myspell project page has documenation somewhere. Maybe you can find it through the open office stuff which is the main consumer of myspell: http://ooo.ximian.com/lxr/search?string=myspell

Marek Wawoczny

Comment 14

•

22 years ago

Check this: http://lingucomponent.openoffice.org/affix.readme

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 15

•

22 years ago

Attached patch Convert to unicode (obsolete) — Details — Splinter Review

Updated patch that converts the common misspellings to utf16 before storing them. It doesn't convert the rest, because that patch became huge. Note that it doesn't work, due to bug 238021. A workaround would be to convert to utf8 before doing the search for the substring in replchars(), but that would suck. I'd rather have strings fixed :)

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Updated

•

22 years ago

Assignee: mscott → mvl

Attachment #138430 - Attachment is obsolete: true

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 16

•

22 years ago

Attached patch patch that should work — Details — Splinter Review

This patch works better. It makes sure the replacements patterns are lower case, and then lower cases the candidate before searching. It is to work around the weird api, but also it (hopefully) reduces the number of case conversions needed. Why check case insensitivly every time when you can just lowercase it once?

Attachment #144351 - Attachment is obsolete: true

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Updated

•

22 years ago

Attachment #144678 - Flags: review?(mscott)

Scott MacGregor

Reporter

Comment 17

•

22 years ago

Comment on attachment 144678 [details] [diff] [review] patch that should work thanks a lot for tacking my initial patch and making it do the right thing here. One comment (that may have been part of my patch not yours)... nsAutoString pattern, replacement; should we move those out of the for loop and just re-use them instead of creating them every time the for loop goes back in scope? You can probably ask bienvenu@nventure.com to sr this.

Attachment #144678 - Flags: review?(mscott) → review+

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 18

•

22 years ago

Comment on attachment 144678 [details] [diff] [review] patch that should work Yeah, i will fix that. I will also fix the |if(mReplaceTable)| check, before delete[]. And one nsString instead of nsAutoString for a temp string.

Attachment #144678 - Flags: superreview?(bienvenu)

David :Bienvenu

Comment 19

•

22 years ago

thx a lot for doing this! Not sure which code is yours and which is mscott's, but here goes: some nits: + PRInt32 numFields = SplitString(line,cmds,3); should have spaces after the ','s + if(!cmds[0].Equals("REP")) { //consistency check + //complain loudly + continue; + } How about an NS_ASSERTION(PR_FALSE, "inconsistent ..." ); to start with? + // We don't want to convert them for ecery lookup. "ecery" -> "every" + if (!mReplaceTable) + return nsnull; + return mReplaceTable; just "return mReplaceTable;" also, please put the opening function { on a line by itself like so: mozReplaceTable *myspAffixMgr::getReplaceTable() { + if (!mReplaceTable) + return nsnull; + return mReplaceTable; +} + +PRUint32 myspAffixMgr::getReplaceTableLength() { + return mReplaceTableLength; +} + + // perhaps we made a typical fault of spelling + res = replchars(wlst, word, &nsug); // perhaps we made a typical spelling error. no need for braces here: + if ((nsug < maxSug) && NS_SUCCEEDED(res)){ + res = forgotchar(wlst, word, &nsug); + } + +// suggestions for a typical fault of spelling, that +// differs with more, than 1 letter from the right form. // suggestions for a typical spelling error that // differs by more than 1 letter from the right spelling + if (word.Length() < 2 || ! pAMgr) return NS_OK; we try to put the returns on a line by themselves - makes it easier to set breakpoints. +nsresult myspSuggestMgr::replchars(PRUnichar ** wlst,const nsAFlatString &word, PRUint32 *ns) can you call this myspSuggestMgr::replacechars - repl is a bit ambiguous. thx, sr=bienvenu with those changes. I'll go add the sr now.

David :Bienvenu

Comment 20

•

22 years ago

Comment on attachment 144678 [details] [diff] [review] patch that should work sr=bienvenu with aforementioned nits.

Attachment #144678 - Flags: superreview?(bienvenu) → superreview+

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 21

•

22 years ago

checked in

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Dean Winters

Comment 22

•

22 years ago

I have had a problem with spell checker not changing my "i" into a capital letter to make it proper. it knows that "i" is a word, but doesnt know it has to be capitalized. also, when i add words to it, such as "wouldn't" it doesnt realize that that is the best suggestion when i spell "wouldnt."

Mikael Parknert

Updated

•

21 years ago

Attachment #138430 - Flags: superreview?(bienvenu)

Scott

Comment 23

•

21 years ago

Hi, I noticed this topic is considered fixed, but I have the newest Thunderbird 1.0 and the dictionary is still well below par. I still require to use other dictionaries to spell check my documents on a common basis. Your friend, Scott

Michiel van Leeuwen (email: mvl+moz@)

Assignee

Comment 24

•

21 years ago

sclewin7@yahoo.com, try to be more specific. Is the problem with the engine, or with the word in the dictionary? Which correct words are flagged as incorrect?

Scott

Comment 25

•

21 years ago

Hi, I noticed this topic is considered fixed, but I have the newest Thunderbird 1.0 and the dictionary is still well below par. I still require to use other dictionaries to spell check my documents on a common basis. Your friend, Scott(In reply to comment #24) > sclewin7@yahoo.com, try to be more specific. Is the problem with the engine, or > with the word in the dictionary? Which correct words are flagged as incorrect? The spell checker is fine noticing misspelled words, it just does not suggest a list of correct words at about 50% of the time. So, the problem may be the dictionary, but I do not know for sure as my knowledge of programming is limited to introductory C and Basic :)

add support for REP tables in the spell checker 22 years ago Scott MacGregor 6.22 KB, patch		Details \| Diff \| Splinter Review
updated patch to account for mvl's string changes 22 years ago Scott MacGregor 6.76 KB, patch		Details \| Diff \| Splinter Review
Convert to unicode 22 years ago Michiel van Leeuwen (email: mvl+moz@) 13.17 KB, patch		Details \| Diff \| Splinter Review
patch that should work 22 years ago Michiel van Leeuwen (email: mvl+moz@) 14.04 KB, patch	mscott : review+ Bienvenu : superreview+	Details \| Diff \| Splinter Review