Last Comment Bug 519214 - Spell check suggestions are not very effective; needs Levenshtein Edit Distance
: Spell check suggestions are not very effective; needs Levenshtein Edit Distance
Status: NEW
:
Product: Core
Classification: Components
Component: Spelling checker (show other bugs)
: unspecified
: x86 Windows XP
: -- normal with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-28 08:12 PDT by SineSwiper
Modified: 2010-04-30 19:58 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments

Description SineSwiper 2009-09-28 08:12:36 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5

The spell checker suggestions are seriously lacking.  There are many, many occasions where I try to get a correction to a misspelled word, and get no or very few results.  So, I'm forced to look up the misspelling in Google.  (Oddly enough, it correctly identifies what I'm talking about 90% of the time.)

I think a major part of this is that Levenshtein Edit Distance (LED) is not implemented as part of the search suggestions.  LED determines the amount of letter changes required to change one word to another.  If the LED is small (say, less than 5), then chances are that the word is a good suggestion to the user.  For example, here are some example misspellings that fail Firefox, but have a low LED:

testimont -> testament
desimination -> dissemination
(There are others, but I would have to mark them as I find them.)

LED is described in Wikipedia here: http://en.wikipedia.org/wiki/Levenshtein_distance.  The algorithm is simple enough to implement, if given a large enough sample to run through some words.  Perhaps a looser version of the suggestion engine which is tightened up with LED?

Reproducible: Always
Comment 1 Tyler Downer [:Tyler] 2009-09-28 08:32:04 PDT
Bug 499593 is one adding words to the dictionary, and a very large patch will probably be landing on trunk in a few days that will improve the spell check. Bug 479334
Comment 2 SineSwiper 2009-09-28 10:50:25 PDT
Both of those seem to only deal with the dictionary itself, and not how it suggests words.  The words above are already in the dictionary, but the suggestion engine is ineffective enough to not figure out what I want in many occasions.
Comment 3 Curtis Bartley [:cbartley] 2009-09-30 08:58:51 PDT
We have an implementation of LevenshteinEditDistance now for Sqlite queries.  It wouldn't be hard to factor it out so that it's generally accessible, and that's something I'd like to do anyway.

Note that Levenshtein is pretty CPU-intensive, so you probably want to use it only after a faster first pass has thrown out obvious bad matches.  I think this is essentially what Brendan Byrd is suggesting at the end of comment #0.
Comment 4 SineSwiper 2009-09-30 15:52:09 PDT
Yep, give the existing engine about 100 words or so to start with and then run through a LED.  If the subroutine is already there, it sounds like it would be really easy to implement if I had a idea of where the suggestion stuff happens in the code.
Comment 5 Matt Caywood 2009-09-30 17:53:27 PDT
Mozilla code currently has nothing to do with word suggestions -- they are all suggested by Hunspell. 

Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite implementation, but it seems like it will be difficult to share that code since Hunspell is a stand-alone project. 

Brendan, can you please file an enhancement request upstream with Hunspell? 

http://sourceforge.net/tracker/?atid=756398&group_id=143754&func=browse
Comment 6 Curtis Bartley [:cbartley] 2009-09-30 18:04:48 PDT
(In reply to comment #5)
> Mozilla code currently has nothing to do with word suggestions -- they are all
> suggested by Hunspell. 
> 
> Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite
> implementation, but it seems like it will be difficult to share that code since
> Hunspell is a stand-alone project. 

The Wikipedia article on Levenshtein Distance is probably more useful than my code anyway.  If the Hunspell guys are not already using it, then I find that a little bit surprising given how well known (and how simple) it is.
Comment 7 SineSwiper 2009-10-03 10:08:15 PDT
Created bug for Hunspell here: http://sourceforge.net/tracker/?func=detail&aid=2871098&group_id=143754&atid=756398

I'm looking at the code here and I can't grep anything with the word "distance" on it.  Frankly, I'm appalled that something which is used for many many different applications isn't using something as basic as LED or even DLED.  The suggestion manager on this thing seems to rely on doing small stuff to words (like moving a letter, swapping characters) to see if it matches a good word.  Plus, I have a hard time reading C code anyway, but I'll see if this can be made into a patch somehow...

Note You need to log in before you can comment on or make changes to this bug.