Open Bug 519214 Opened 15 years ago Updated 1 year ago

Spell check suggestions are not very effective; needs Levenshtein Edit Distance

Categories

(Core :: Spelling checker, defect)

x86
Windows XP
defect

Tracking

()

REOPENED

People

(Reporter: Bugzilla-Mozilla, Unassigned)

References

(Blocks 2 open bugs)

Details

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5

The spell checker suggestions are seriously lacking.  There are many, many occasions where I try to get a correction to a misspelled word, and get no or very few results.  So, I'm forced to look up the misspelling in Google.  (Oddly enough, it correctly identifies what I'm talking about 90% of the time.)

I think a major part of this is that Levenshtein Edit Distance (LED) is not implemented as part of the search suggestions.  LED determines the amount of letter changes required to change one word to another.  If the LED is small (say, less than 5), then chances are that the word is a good suggestion to the user.  For example, here are some example misspellings that fail Firefox, but have a low LED:

testimont -> testament
desimination -> dissemination
(There are others, but I would have to mark them as I find them.)

LED is described in Wikipedia here: http://en.wikipedia.org/wiki/Levenshtein_distance.  The algorithm is simple enough to implement, if given a large enough sample to run through some words.  Perhaps a looser version of the suggestion engine which is tightened up with LED?

Reproducible: Always
Bug 499593 is one adding words to the dictionary, and a very large patch will probably be landing on trunk in a few days that will improve the spell check. Bug 479334
Both of those seem to only deal with the dictionary itself, and not how it suggests words.  The words above are already in the dictionary, but the suggestion engine is ineffective enough to not figure out what I want in many occasions.
We have an implementation of LevenshteinEditDistance now for Sqlite queries.  It wouldn't be hard to factor it out so that it's generally accessible, and that's something I'd like to do anyway.

Note that Levenshtein is pretty CPU-intensive, so you probably want to use it only after a faster first pass has thrown out obvious bad matches.  I think this is essentially what Brendan Byrd is suggesting at the end of comment #0.
Yep, give the existing engine about 100 words or so to start with and then run through a LED.  If the subroutine is already there, it sounds like it would be really easy to implement if I had a idea of where the suggestion stuff happens in the code.
Mozilla code currently has nothing to do with word suggestions -- they are all suggested by Hunspell. 

Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite implementation, but it seems like it will be difficult to share that code since Hunspell is a stand-alone project. 

Brendan, can you please file an enhancement request upstream with Hunspell? 

http://sourceforge.net/tracker/?atid=756398&group_id=143754&func=browse
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #5)
> Mozilla code currently has nothing to do with word suggestions -- they are all
> suggested by Hunspell. 
> 
> Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite
> implementation, but it seems like it will be difficult to share that code since
> Hunspell is a stand-alone project. 

The Wikipedia article on Levenshtein Distance is probably more useful than my code anyway.  If the Hunspell guys are not already using it, then I find that a little bit surprising given how well known (and how simple) it is.
Created bug for Hunspell here: http://sourceforge.net/tracker/?func=detail&aid=2871098&group_id=143754&atid=756398

I'm looking at the code here and I can't grep anything with the word "distance" on it.  Frankly, I'm appalled that something which is used for many many different applications isn't using something as basic as LED or even DLED.  The suggestion manager on this thing seems to rely on doing small stuff to words (like moving a letter, swapping characters) to see if it matches a good word.  Plus, I have a hard time reading C code anyway, but I'll see if this can be made into a patch somehow...

testimont now gives testimony and desimination now gives destination. I guess it has been improved since this bug.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME

testimont still does not suggest "testament" nor desimination suggest "dissemination". Nothing has improved in the last decade! Please reopen.

Flags: needinfo?(krosylight)

Oh sorry, I thought the examples in comment #0 were about the wrong suggestions.

Quickly checked the Chrome behavior for comparison, and while it does suggest dissemination it does not suggest destination.

:dminor, do you have an idea about this?

Status: RESOLVED → REOPENED
Flags: needinfo?(krosylight) → needinfo?(dminor)
Resolution: WORKSFORME → ---

We're still in the same situation as when this bug was opened, we get our suggestions directly from hunspell, so we'd be stuck trying to land improvements to hunspell upstream, or maintaining our own local patches for this.

Chromium is also using hunspell, but they're a version behind us, and have also patched their version extensively. It looks like most of this is to support a binary dictionary format. There are changes to the suggestion code, but it looks like this is largely to support their dictionaries. It's possible that these dictionaries have improved spell suggestions baked in, it's hard to tell without reading the code in more detail than I have time for right now.

Flags: needinfo?(dminor)

Thanks! Let's say S3 for now.

Severity: normal → S3
Duplicate of this bug: 1825304
You need to log in before you can comment on or make changes to this bug.