Closed
Bug 858138
Opened 12 years ago
Closed 12 years ago
[keyboard] auto-correct needs probabilities from the prediction engine
Categories
(Firefox OS Graveyard :: Gaia::Keyboard, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: djf, Unassigned)
References
Details
The prediction engine current returns three word suggestions to us. They are supposed to be the three most commonly used words that match (fuzzily) the user's input so far.
For auto-correct, we can't just always use the most common word that matches. I think we only want to alter the user's input if we think it is very likely that correction is what the user wanted.
If I type the letter "t", there are many common words that I might be typing. "the" is probably the most common word. But there are hundreds of other common words that begin with t also. So our auto-correction should not automatically convert "t" to "the".
It is easy to make the prediction engine return the word frequency along with each suggested word, but this isn't quite what we need. For auto-correction, I'd like to know the frequency of the word "the" weighted by the frequencies of all other candidates that begin with "t". When I type "t", there is probably a < 10% chance that I actually intend to type the word "the". I probably don't want to auto-correct unless the probability is > 25% or something.
So anyway, I'm not so interested in the frequency of each word in the language as a whole, but instead the frequency of the word relative to the frequencies of the other words in the search space of possible matches. Computing this will require changes to the search algorithm to retain this information during the search.
Reporter | ||
Comment 1•12 years ago
|
||
I've decided that a simpler heuristic is good enough: if the number associated with the first suggestion is significantly higher than the number associated with the second suggestion (where "significantly" is a tuneable parameter), then the first suggestion should be used for auto-correction.
This is what I did for bug 860462, and it seems to be working reasonably, so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 2•12 years ago
|
||
For the record, in case we need to come back to this and re-open it, here are the rambling thoughts I wrote up while trying to decide if we really needed to do this:
Need to think about frequency some more. We need word frequencies to
build this dictionary, and we may need to keep frequencies in the
file for the search algorithm (do we?). But that really isn't what we
want for word suggestions and auto-correction.
If we know that the user has typed some prefix p, the question we're
asking when doing auto correction and prediction is "of all the words
beginning with p, what word is most likely and what is the liklihood
that the user meant to type it?" (Or, instead of "of all the words"
maybe we want to ask "of all the words up to 2 times the length of
p") If there are lots of high-frequency words beginning with p, then
we may be able to say which is most likely, but if there are other
words that are almost as likely, our confidence in the suggestion
will be low, and it will not be a good choice to auto-correct. For
auto correction we need to know the confidence as well as the
frequency.
Or do we? Currently I just compare the weight of the first suggestion
to the weight of the second. If it is significantly higher, I
autocorrect. That may actually be fine.
For suggestions and corrections, we're not just going to look at the
prefix p, we're also going to assume that the user could have
mistyped and consider other prefixes p' that are similar. If there
are words that begin with p' that have higher frequencies than the 3
highest frequencies for p (even after being weighted depending on how
unlikely the hypothetical mistyping is) then those words will end up
on the list of suggestions. So if we're going to assign a confidence
value to the suggestions, we'd have to compare words not just against
the universe of words with the prefix p, but all words that begin
with similar prefixes. That's not something we can precompute.
It would still be elegant if we could convert from word frequencies
into probabilities of some kind. (It would also help with
scaling... Once we get to long, unlikely words, we wouldn't have to
be up against a lower limit of 1 and 2 for frequency. If we're
talking about the probability of the word among words that begin with
the prefix, the numbers might be higher and we could discriminate
more accurately.
Can I do that? If the most common word 'the' is about 5% of english
words, then its 220 frequency means 220/4400, and we could convert
all of the frequencies to probabilities by dividing by 4400. (Better
would probably be to add up all the frequencies to get a total and
divide them all by that to give the probability of each word if words
were being picked randomly by throwing darts at the corpus.)
We can do the same thing for any prefix: add up the frequencies of
all the words that begin with that prefix and divide each one by that
total, possibly limiting the search to words that aren't too much
longer than the prefix length. Somehow then we'd need to scale these
numbers so that they didn't get too small. To allow comparisons with
other prefixes p', we'd have to look at the weights for all prefixes
of length n and scale them all the same. Then do it for prefixes of
length n+1, etc.
Our data structure depends on being able to find the highest
probability word at a given node by traversing the center pointer
chain straight down. We can't do that if the way the words are
ranked changes with prefix length. If we can be certain that the
highest frequency word will always remain the highest ranked this
this works. But we can only be certain of that if we always consider
all words and don't limit the universe of words based on the prefix
length.
Really, I'm no longer sure any of this matters. Word frequency may be
a good enough proxy for word probability and maybe its just fine the
way it is.
You need to log in
before you can comment on or make changes to this bug.
Description
•