Spell check suggestions are not very effective; needs Levenshtein Edit Distance
Categories
(Core :: Spelling checker, defect)
Tracking
()
People
(Reporter: Bugzilla-Mozilla, Unassigned)
References
(Blocks 2 open bugs)
Details
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 GTB5 The spell checker suggestions are seriously lacking. There are many, many occasions where I try to get a correction to a misspelled word, and get no or very few results. So, I'm forced to look up the misspelling in Google. (Oddly enough, it correctly identifies what I'm talking about 90% of the time.) I think a major part of this is that Levenshtein Edit Distance (LED) is not implemented as part of the search suggestions. LED determines the amount of letter changes required to change one word to another. If the LED is small (say, less than 5), then chances are that the word is a good suggestion to the user. For example, here are some example misspellings that fail Firefox, but have a low LED: testimont -> testament desimination -> dissemination (There are others, but I would have to mark them as I find them.) LED is described in Wikipedia here: http://en.wikipedia.org/wiki/Levenshtein_distance. The algorithm is simple enough to implement, if given a large enough sample to run through some words. Perhaps a looser version of the suggestion engine which is tightened up with LED? Reproducible: Always
Comment 1•15 years ago
|
||
Bug 499593 is one adding words to the dictionary, and a very large patch will probably be landing on trunk in a few days that will improve the spell check. Bug 479334
Reporter | ||
Comment 2•15 years ago
|
||
Both of those seem to only deal with the dictionary itself, and not how it suggests words. The words above are already in the dictionary, but the suggestion engine is ineffective enough to not figure out what I want in many occasions.
Comment 3•15 years ago
|
||
We have an implementation of LevenshteinEditDistance now for Sqlite queries. It wouldn't be hard to factor it out so that it's generally accessible, and that's something I'd like to do anyway. Note that Levenshtein is pretty CPU-intensive, so you probably want to use it only after a faster first pass has thrown out obvious bad matches. I think this is essentially what Brendan Byrd is suggesting at the end of comment #0.
Reporter | ||
Comment 4•15 years ago
|
||
Yep, give the existing engine about 100 words or so to start with and then run through a LED. If the subroutine is already there, it sounds like it would be really easy to implement if I had a idea of where the suggestion stuff happens in the code.
Comment 5•15 years ago
|
||
Mozilla code currently has nothing to do with word suggestions -- they are all suggested by Hunspell. Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite implementation, but it seems like it will be difficult to share that code since Hunspell is a stand-alone project. Brendan, can you please file an enhancement request upstream with Hunspell? http://sourceforge.net/tracker/?atid=756398&group_id=143754&func=browse
Comment 6•15 years ago
|
||
(In reply to comment #5) > Mozilla code currently has nothing to do with word suggestions -- they are all > suggested by Hunspell. > > Curtis, you might want to point the Hunspell authors to the Mozilla - Sqlite > implementation, but it seems like it will be difficult to share that code since > Hunspell is a stand-alone project. The Wikipedia article on Levenshtein Distance is probably more useful than my code anyway. If the Hunspell guys are not already using it, then I find that a little bit surprising given how well known (and how simple) it is.
Reporter | ||
Comment 7•15 years ago
|
||
Created bug for Hunspell here: http://sourceforge.net/tracker/?func=detail&aid=2871098&group_id=143754&atid=756398 I'm looking at the code here and I can't grep anything with the word "distance" on it. Frankly, I'm appalled that something which is used for many many different applications isn't using something as basic as LED or even DLED. The suggestion manager on this thing seems to rely on doing small stuff to words (like moving a letter, swapping characters) to see if it matches a good word. Plus, I have a hard time reading C code anyway, but I'll see if this can be made into a patch somehow...
Comment 8•2 years ago
|
||
testimont now gives testimony and desimination now gives destination. I guess it has been improved since this bug.
Comment 9•2 years ago
|
||
testimont still does not suggest "testament" nor desimination suggest "dissemination". Nothing has improved in the last decade! Please reopen.
Comment 10•2 years ago
|
||
Oh sorry, I thought the examples in comment #0 were about the wrong suggestions.
Quickly checked the Chrome behavior for comparison, and while it does suggest dissemination
it does not suggest destination
.
:dminor, do you have an idea about this?
Comment 11•2 years ago
|
||
We're still in the same situation as when this bug was opened, we get our suggestions directly from hunspell, so we'd be stuck trying to land improvements to hunspell upstream, or maintaining our own local patches for this.
Chromium is also using hunspell, but they're a version behind us, and have also patched their version extensively. It looks like most of this is to support a binary dictionary format. There are changes to the suggestion code, but it looks like this is largely to support their dictionaries. It's possible that these dictionaries have improved spell suggestions baked in, it's hard to tell without reading the code in more detail than I have time for right now.
Description
•