Closed Bug 859508 Opened 11 years ago Closed 11 years ago

change heuristics of predictive text engine

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

x86
macOS
defect
Not set
normal

Tracking

(b2g18 fixed)

RESOLVED FIXED
Tracking Status
b2g18 --- fixed

People

(Reporter: ckerschb, Assigned: ckerschb)

References

Details

(Whiteboard: c=keyboard)

      No description provided.
Assignee: nobody → mozilla
thinking about different strategies to make that work. My progress can be found here: https://github.com/ckerschb/gaia/tree/heuristicsBug

It would be better to evaluate more candidates, right now we just evaluate three - beceause we use the linked list data structure on top of the TST. This is our major performance bottleneck. If we decide to evaluate more candidates - we definitely lack performance here.
Based on email exchange with Andreas, I am working on providing different predictions.
tapping 's' should not only provide candidates starting with 's', like 'she', some, but also, for example 'a'.
The wordlists in XML (https://github.com/mozilla-b2g/gaia/tree/master/dictionaries) provide a frequency form 0 up to 255 for every word.
If we push just a little for specific words where the prefix matches, we also leave room to allow different predictions.

For example:
's' - 'she' (170) (just multiply frequency by 1.2)
      170 * 1.2 = 204
'a' - (surrounding key of s)
      'a' 208
      'and' 212

so we could also display other results than candidates where the prefix matches.
Christoph,

While working on this, please also see the bugs I've marked as blocking bug 797170 (auto-correct). 

Bug 857850 looks to me like a bug in the current heuristics: the words aren't ranked correctly. 

The others are feature requests that you might be able to address wile working on this one. 

Bug 858158 is mostly just a request for more documentation on the data structure and the algorithm. (For example, I don't understand the linked list on top of the TST). 

Bug 858138 has some thoughts about using probabilities instead of word frequencies. That is, instead of basing our calculations on the frequency of a given word within whatever corpus Google used when creating the dictionaries, maybe the right measure is the probability, given a prefix 'foo', that the user intends to type 'foobar'.  This probability would be computed by comparing the frequency of 'foobar' against the frequency of 'food' and all other words that begin 'foo'.

The other thought I've been having is that in order to get the 's' -> 'a' auto-correction that Andreas wants, you're going to have to do something like limit predictions to words that are no longer than twice the number of input characters.  (That actually sounds like a pretty reasonable rule to me).

I also don't think it would be all that hard to consider the actual (x,y) coordinates of the user's touches, so that for a given touch, between the A and S keys you could compute a probability that the user intended to type A and a probability that the user intended to type S. From my experiements with iOS, it seems clear that Apple is doing something like that.
(In reply to David Flanagan [:djf] from comment #4)

 > Bug 858158 is mostly just a request for more documentation on the data
> structure and the algorithm. (For example, I don't understand the linked
> list on top of the TST). 

I added documentation and did a pull request - I hope this gives you a good starting point for understanding the current implementation.
 
> Bug 858138 has some thoughts about using probabilities instead of word
> frequencies. That is, instead of basing our calculations on the frequency of
> a given word within whatever corpus Google used when creating the
> dictionaries, maybe the right measure is the probability, given a prefix
> 'foo', that the user intends to type 'foobar'.  This probability would be
> computed by comparing the frequency of 'foobar' against the frequency of
> 'food' and all other words that begin 'foo'.

You are right, we are not solely relying on the frequencies from the XML file.
We had similar probability heursitics in the old prediction engine. We used a PrefixMatchMultiplier what I remember - let's see if I can incorporate that into the new engine as well.

> The other thought I've been having is that in order to get the 's' -> 'a'
> auto-correction that Andreas wants, you're going to have to do something
> like limit predictions to words that are no longer than twice the number of
> input characters.  (That actually sounds like a pretty reasonable rule to
> me).

The problem with this is, that the current performance boost relies on the linked list on top of the TST which provides a pointer to the next best node. In order to look for candidates which are only n characters longer than the current prefix either forces us to search the whole linked list or go back to a conventional tree traversal algorithm and inspect the search space for every node.
Either solution slows down performance.
David,

I am totally with you. There is a realistic chance that this algorithm overestimates the actual frequency of a candidate. On the other hand, we return 3 suggestions total, which means we would still display 2 other suggestions in such a case. Having the same overestimation problem on all of the 3 suggestions is rather unlikely, isn't it?

As already mentioned, the dictionary size was a bottleneck in the previous implementation of the engine. Therefore I applied the following patch and compiled all dictionaries again to see what difference it would make if we only compress suffix-nodes with the same frequency (which would avoid averaging the frequency in compressed nodes).

diff --git a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
index e7ba3e1..04d0243 100644
--- a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
+++ b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
@@ -218,6 +218,7 @@ class TSTTree:
         # children are equal
         mapping[nodeA] = nodeB
         return mapping if nodeA.ch == nodeB.ch and \
+               nodeA.frequency == nodeB.frequency and \
                self.equal(nodeA.left,   nodeB.left, mapping) and \
                self.equal(nodeA.center, nodeB.center, mapping) and \
                self.equal(nodeA.right,  nodeB.right, mapping) else False

This minor fix compresses suffix nodes only if the frequency of every node matches in addition to the character, which is reflected in the bigger file size of the dictionaries:

de.dict    3.4MB VS 11.9MB
en_us.dict 1.8MB VS  5.5MB
es.dict    1.7MB VS  5.7MB
fr.dict    2.8MB VS  8.9MB
pt_br.dict 2.0MB VS  6.4MB

As we can see, compressing suffix nodes only if the character and the frequency in the node match increases the file size dramatically.

For the en_us.dict for example, we create:
  (474298 - 79632 = 394666 nodes)

in comparsion when we would average the frequency in compressed nodes:
  (474298 - 345723 = 128575 nodes)

This means, our algorithm compresses 345k nodes if only the character in the node matches whereas we can only compress 79k nodes if the character and the frequency in the nodes matches.

I guess if we really want to get more accuracy here, it's a better idea to switch algorithms again, because such big dictionaries are not an option what I remember.
uh uh, wrong bug!
Blocks: 797170
Adding whiteboard tags for tracking via srumbu.gs.
Whiteboard: c=keyboard
Blocks: 865484
No longer blocks: 797170
Fixed by parent bug 865484.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Uplifted to v1-train in bug 873934
You need to log in before you can comment on or make changes to this bug.