Closed Bug 858138 Opened 12 years ago Closed 12 years ago

[keyboard] auto-correct needs probabilities from the prediction engine

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: djf, Unassigned)

References

Details

The prediction engine current returns three word suggestions to us. They are supposed to be the three most commonly used words that match (fuzzily) the user's input so far. For auto-correct, we can't just always use the most common word that matches. I think we only want to alter the user's input if we think it is very likely that correction is what the user wanted. If I type the letter "t", there are many common words that I might be typing. "the" is probably the most common word. But there are hundreds of other common words that begin with t also. So our auto-correction should not automatically convert "t" to "the". It is easy to make the prediction engine return the word frequency along with each suggested word, but this isn't quite what we need. For auto-correction, I'd like to know the frequency of the word "the" weighted by the frequencies of all other candidates that begin with "t". When I type "t", there is probably a < 10% chance that I actually intend to type the word "the". I probably don't want to auto-correct unless the probability is > 25% or something. So anyway, I'm not so interested in the frequency of each word in the language as a whole, but instead the frequency of the word relative to the frequencies of the other words in the search space of possible matches. Computing this will require changes to the search algorithm to retain this information during the search.
Blocks: 797170
I've decided that a simpler heuristic is good enough: if the number associated with the first suggestion is significantly higher than the number associated with the second suggestion (where "significantly" is a tuneable parameter), then the first suggestion should be used for auto-correction. This is what I did for bug 860462, and it seems to be working reasonably, so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
For the record, in case we need to come back to this and re-open it, here are the rambling thoughts I wrote up while trying to decide if we really needed to do this: Need to think about frequency some more. We need word frequencies to build this dictionary, and we may need to keep frequencies in the file for the search algorithm (do we?). But that really isn't what we want for word suggestions and auto-correction. If we know that the user has typed some prefix p, the question we're asking when doing auto correction and prediction is "of all the words beginning with p, what word is most likely and what is the liklihood that the user meant to type it?" (Or, instead of "of all the words" maybe we want to ask "of all the words up to 2 times the length of p") If there are lots of high-frequency words beginning with p, then we may be able to say which is most likely, but if there are other words that are almost as likely, our confidence in the suggestion will be low, and it will not be a good choice to auto-correct. For auto correction we need to know the confidence as well as the frequency. Or do we? Currently I just compare the weight of the first suggestion to the weight of the second. If it is significantly higher, I autocorrect. That may actually be fine. For suggestions and corrections, we're not just going to look at the prefix p, we're also going to assume that the user could have mistyped and consider other prefixes p' that are similar. If there are words that begin with p' that have higher frequencies than the 3 highest frequencies for p (even after being weighted depending on how unlikely the hypothetical mistyping is) then those words will end up on the list of suggestions. So if we're going to assign a confidence value to the suggestions, we'd have to compare words not just against the universe of words with the prefix p, but all words that begin with similar prefixes. That's not something we can precompute. It would still be elegant if we could convert from word frequencies into probabilities of some kind. (It would also help with scaling... Once we get to long, unlikely words, we wouldn't have to be up against a lower limit of 1 and 2 for frequency. If we're talking about the probability of the word among words that begin with the prefix, the numbers might be higher and we could discriminate more accurately. Can I do that? If the most common word 'the' is about 5% of english words, then its 220 frequency means 220/4400, and we could convert all of the frequencies to probabilities by dividing by 4400. (Better would probably be to add up all the frequencies to get a total and divide them all by that to give the probability of each word if words were being picked randomly by throwing darts at the corpus.) We can do the same thing for any prefix: add up the frequencies of all the words that begin with that prefix and divide each one by that total, possibly limiting the search to words that aren't too much longer than the prefix length. Somehow then we'd need to scale these numbers so that they didn't get too small. To allow comparisons with other prefixes p', we'd have to look at the weights for all prefixes of length n and scale them all the same. Then do it for prefixes of length n+1, etc. Our data structure depends on being able to find the highest probability word at a given node by traversing the center pointer chain straight down. We can't do that if the way the words are ranked changes with prefix length. If we can be certain that the highest frequency word will always remain the highest ranked this this works. But we can only be certain of that if we always consider all words and don't limit the universe of words based on the prefix length. Really, I'm no longer sure any of this matters. Word frequency may be a good enough proxy for word probability and maybe its just fine the way it is.
Blocks: 873934
You need to log in before you can comment on or make changes to this bug.