Closed Bug 860624 Opened 11 years ago Closed 11 years ago

[keyboard] shared suffixes in dictionary data structure lose word frequency information

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

x86
macOS
defect
Not set
normal

Tracking

(blocking-b2g:-, b2g18+ fixed)

RESOLVED FIXED
blocking-b2g -
Tracking Status
b2g18 + fixed

People

(Reporter: djf, Assigned: djf, NeedInfo)

References

Details

(Whiteboard: c=auto-suggest)

Attachments

(3 files)

[Copied from https://bugzilla.mozilla.org/show_bug.cgi?id=860550#c1]

Gregor found a good example and I just verified my/our worst case assumptions with the XML file.
user input: alway
suggestions: Alwyn (73), always (59.2), away (47)

one might think that 'always' should be ranked highest. The problem is averaging the frequency of nodes in compressed suffix nodes.

Let's have a look at the XML:
<w f="150" flags="">away</w>
<w f="145" flags="">always</w>
<w f="55" flags="">Alwyn</w>

The problem is that so many low ranked words with the postfix 'ways' exist in the dictionary, e.g.: wireways (10), layaways (10), fadeaways (10), and many more, which lower the frequency/rank of 'alway' down to 37, whereas 'Alwy' is upgraded to 73.
Blocks: 797170
I wonder what would happen to the size of the dictionaries if we only shared suffixes for words with a frequency < 100? (Or some other threshold number.  Maybe if we didnt' share suffixes for the 5000 most common words?)
(In reply to David Flanagan [:djf] from comment #1)
> I wonder what would happen to the size of the dictionaries if we only shared
> suffixes for words with a frequency < 100? (Or some other threshold number. 
> Maybe if we didnt' share suffixes for the 5000 most common words?)

Dictinaries increase in size when

sharing suffixes if frequency of node < 100:
en_us.dict 5.4MB

sharing suffixes if frequency of node < 150:
en_us.dict 5.1MB
Copying a related comment https://bugzilla.mozilla.org/show_bug.cgi?id=859508#c6 from Christoph:


As already mentioned, the dictionary size was a bottleneck in the previous implementation of the engine. Therefore I applied the following patch and compiled all dictionaries again to see what difference it would make if we only compress suffix-nodes with the same frequency (which would avoid averaging the frequency in compressed nodes).

diff --git a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
index e7ba3e1..04d0243 100644
--- a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
+++ b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py
@@ -218,6 +218,7 @@ class TSTTree:
         # children are equal
         mapping[nodeA] = nodeB
         return mapping if nodeA.ch == nodeB.ch and \
+               nodeA.frequency == nodeB.frequency and \
                self.equal(nodeA.left,   nodeB.left, mapping) and \
                self.equal(nodeA.center, nodeB.center, mapping) and \
                self.equal(nodeA.right,  nodeB.right, mapping) else False

This minor fix compresses suffix nodes only if the frequency of every node matches in addition to the character, which is reflected in the bigger file size of the dictionaries:

de.dict    3.4MB VS 11.9MB
en_us.dict 1.8MB VS  5.5MB
es.dict    1.7MB VS  5.7MB
fr.dict    2.8MB VS  8.9MB
pt_br.dict 2.0MB VS  6.4MB

As we can see, compressing suffix nodes only if the character and the frequency in the node match increases the file size dramatically.

For the en_us.dict for example, we create:
  (474298 - 79632 = 394666 nodes)

in comparsion when we would average the frequency in compressed nodes:
  (474298 - 345723 = 128575 nodes)

This means, our algorithm compresses 345k nodes if only the character in the node matches whereas we can only compress 79k nodes if the character and the frequency in the nodes matches.

I guess if we really want to get more accuracy here, it's a better idea to switch algorithms again, because such big dictionaries are not an option what I remember.
Assignee: nobody → dflanagan
(In reply to Christoph Kerschbaumer from comment #3)
> (In reply to David Flanagan [:djf] from comment #1)
> > I wonder what would happen to the size of the dictionaries if we only shared
> > suffixes for words with a frequency < 100? (Or some other threshold number. 
> > Maybe if we didnt' share suffixes for the 5000 most common words?)
> 
> Dictinaries increase in size when
> 
> sharing suffixes if frequency of node < 100:
> en_us.dict 5.4MB
> 
> sharing suffixes if frequency of node < 150:
> en_us.dict 5.1MB

We only have about 400 words with frequency > 150. That seems strange.
Would rounding frequencies help? How about rounding all frequencies up to the next 10.
I'm trying to build the dictionary using code like comment 4, but with only 16 frequency levels.  It takes a very, very long time to convert the TST to a DAG, and it is getting slower as it goes. It has removed 63K of 310K nodes out of a total of 474K nodes.  But I'm too impatient to let this run to completion.


Wow, the dictionary building script takes a very, very long time to run
If we don't compress the suffixes at all, the en and de dictionaries are 6.6 and 13.3 mb.  The english dictionary compresses to 2.7mb with gzip, so we can probably get the size down.

If I strip out all the 0 bytes in the english dictionary, the size goes from 6.6mb to 2.4mb, so maybe if I use a different binary format, I can get this down.  I'll work on that next.
We are using an int16 to store the frequency in very node, so we can increase frequencies and are not limited to 255 as in the XML file.

By applying the attached patch we push shorter words so that the next pointer prefers shorter candidates rather than longer ones.

For example:
  user input: 'r'
  old predictions: 'released', 'received', 'record'
  new predictions: 'run', 'role', 'road'

please feel free to test, IMHO this improves quality of predictions quite a bit.
Attachment #736588 - Flags: feedback?(dflanagan)
Attachment #736588 - Flags: feedback?(anygregor)
(In reply to Christoph Kerschbaumer from comment #8)
> Created attachment 736588 [details] [diff] [review]
> ranks shorter words higher up
> 
> We are using an int16 to store the frequency in very node, so we can
> increase frequencies and are not limited to 255 as in the XML file.
> 
> By applying the attached patch we push shorter words so that the next
> pointer prefers shorter candidates rather than longer ones.
> 
> For example:
>   user input: 'r'
>   old predictions: 'released', 'received', 'record'
>   new predictions: 'run', 'role', 'road'
> 
> please feel free to test, IMHO this improves quality of predictions quite a
> bit.

don't forget to do a make in the dictionaries folder.
I'm working on a different patch that uses a variable length encoding of each node and leaves all the suffixes unshared. But with the new compression, the resulting english dictionary is just 2.1m which is not much worse at all than the old 1.8m

I still have to modify the js to decode the new format before I'm confident that it works, but I'm pretty excited about it.

Part of my new serialization format is that I now use only 5 bits for frequency.  I'm planning on doing something similar to what you've done to increase the frequency of short words relative to long words... I'm thinking of subtracting the word length from the frequency.  Since frequency is now between 0 and 31 this seems like it might be a reasonable thing to do.
Christoph, this pull request moves the dictionary building script out of the app directory and converts it to use a different binary representation.

I still have to write the JS code to read the nodes in this new representation before I'll be convinced that it works, but I'd love your feedback on the format and the python code (not my strong suit) if you have the time.
Attachment #736879 - Flags: feedback?(mozilla)
That looks good to me, I just compiled all the dictionaries for the size comparison:

de.dict    3.4MB VS 3.7MB
en_us.dict 1.8MB VS 2.1MB
es.dict    1.7MB VS 2.3MB
fr.dict    2.8MB VS 3.1MB
pt_br.dict 2.0MB VS 2.4MB

I guess the tradeoff between dictionary size and quality improvement favors the new encoding. I am all up for that. Let's see how the JS counterpart comes along and then we have to do performance comparison. I am kind of worried about that though, because we have to perform more reads because of the variable sized nodes. What I remember, Josh mentioned a worst case response time of 140ms. Let's see how it performs.

Anyway, great progress.
I've updated the pull request with modified JS that reads the new dictionary format. (I only had to change readNodes() in predictions.js).  The dictionaries are updated too.

I've also tweaked the dictionary build script to scale the frequency into 5 bits and then subtract the word length, and this seems to give good results.

It doesn't feel slower, but I haven't measured performance yet.  Christoph did you just measure performance by timing the predict() call?
Comment on attachment 736879 [details]
link to patch on github

Christoph,

I think you would be the best one to review this patch, if you have time. If not, please re-assign the review to Gregor.
Attachment #736879 - Flags: feedback?(mozilla) → review?(mozilla)
One thing I'm concerned about in the attached patch is that running xml2dict.py on the english wordlist results in a dictionary file with a slightly different size on each run.  Why is there any indeterminacy in the algorithm? Maybe the python sort() function?  And even if nodes are being sorted in different orders, why does that affect the length of the file.  Any ideas about this, Christoph?
(In reply to David Flanagan [:djf] from comment #14)
> Comment on attachment 736879 [details]
> link to patch on github
> 
> Christoph,

David, I am happy to review that.

> I think you would be the best one to review this patch, if you have time. If
> not, please re-assign the review to Gregor.
Here are my performance timings for 3 trials of typing "restaurant":

R 59 65 83
Re 39 17 44
Res 39 40 29
Rest 27 30 38
Resta 24 25 32
Restau 31 23 37
Restaur 11 21 51
Restaura 23 54 68
Restauran 34 40 53
Restaurant 13 24 29
Those times don't seem bad, but they do seem to be worse than without this patch.  I'm not sure why.  The readNode() function is more complicated but it is reading fewer bytes so I wouldn't expect it to make much difference. Maybe reading bytes instead of 16-bit words is slower?

I don't think that the changes to the dictionary format have enlarged the search space at all, so predictions.js ought to be doing pretty much the same thing.

Christoph, what did you mean in comment 14 when you said that we'd have to "perform more reads because of the variable size nodes"? I think readNode() is being called exactly the same number of times.  It is probably accessing the underlying typed array more often, however.
Comment on attachment 736588 [details] [diff] [review]
ranks shorter words higher up

Cancelling the feedback requests on this patch since I've incorporated something similar into my patch that is being reviewed.
Attachment #736588 - Flags: feedback?(dflanagan)
Attachment #736588 - Flags: feedback?(anygregor)
(In reply to David Flanagan [:djf] from comment #18)
> Those times don't seem bad, but they do seem to be worse than without this
> patch.  I'm not sure why.  The readNode() function is more complicated but
> it is reading fewer bytes so I wouldn't expect it to make much difference.
> Maybe reading bytes instead of 16-bit words is slower?

Yes, that's what I am thinking, probably it's a good idea to do basically some loop-unrolling optimization, loading the _dict into a Uint16Array and fiddel out the bits we need. I guess it's worth a try.

> 
> I don't think that the changes to the dictionary format have enlarged the
> search space at all, so predictions.js ought to be doing pretty much the
> same thing.
Search space is definietly the same, but as mentioned above, we have to perform more reads from the typed array which slows things down.

> Christoph, what did you mean in comment 14 when you said that we'd have to
> "perform more reads because of the variable size nodes"? I think readNode()
> is being called exactly the same number of times.  It is probably accessing
> the underlying typed array more often, however.

Yes, that's exactly what I meant, more reads from the typed array. As I said, let's try to load it into a Uint16 Array, maybe even a Uint32Array and perform bit shift operations to read the node values.
David, since you are already touching the predictions.js file, I figured a bug which would nicely go with this patch. It's a minor performance opt too. Every key holds itself in the list of surrounding keys, which causes unnecessary double insertions of candidates. See the following patch:

@@ -87,6 +87,8 @@ var Predictions = function() {
       var list = {};
       for (var m = 0; m < keyArray.length; ++m) {
         var key2 = keyArray[m];
+        if (key1.code == key2.code)
+          continue;
         if (SpecialKey(key2))
           continue;
         if (SquaredDistanceToEdge(/* key dimensions */
(In reply to David Flanagan [:djf] from comment #17)
> Here are my performance timings for 3 trials of typing "restaurant":
> 
> R 59 65 83
> Re 39 17 44
> Res 39 40 29
> Rest 27 30 38
> Resta 24 25 32
> Restau 31 23 37
> Restaur 11 21 51
> Restaura 23 54 68
> Restauran 34 40 53
> Restaurant 13 24 29

David, did you always perform a 'adb reboot' before recording these numbers? In my previous evaluation of the engine I always rebooted the phone so we measure time on cold code, before any optimizations are performed. I reran your tests on my phone:

I/Gecko   (  108): r 166 ms
I/Gecko   (  108): re 62 ms
I/Gecko   (  108): res 63 ms
I/Gecko   (  108): rest 46 ms
I/Gecko   (  108): resta 36 ms
I/Gecko   (  108): restau 42 ms
I/Gecko   (  108): restaur 31 ms
I/Gecko   (  108): restaura 61 ms
I/Gecko   (  108): restauran 66 ms
I/Gecko   (  108): restaurant 47 ms

I/Gecko   (  108): r 162 ms
I/Gecko   (  108): re 61 ms
I/Gecko   (  108): res 79 ms
I/Gecko   (  108): rest 42 ms
I/Gecko   (  108): resta 52 ms
I/Gecko   (  108): restau 48 ms
I/Gecko   (  108): restaur 33 ms
I/Gecko   (  108): restaura 54 ms
I/Gecko   (  108): restauran 70 ms
I/Gecko   (  108): restaurant 21 ms

I performed several trial runs with similar results. The most interesting fact is that it takes more than the 140ms till a result is displayed for the first entered character, in that case 'r'.
Josh mentioned that 140ms is the upper bound for results to appear on the display so that a user still perceives the keyboard to operate smoothly.
I guess we should work on the readNode function to speed up things.
Christoph,

I'll make the prediction.js changes you suggest above.

And I'm also going to modify the xml2dict code so that it doesn't scale the frequency information to 5 bits until the very end. I'll get better sorting by doing that at the end instead of the start.

My preference is to land this as is, and try to improve performance in a separate bug.  (I'm not worried about the cold start time, as long as performance is good enough at other times.)  Is the Uint32 bit-shifting code something you'd like to work on yourself?
David, that sounds good to me. Land the patch, and I can help you going from there, trying to load the dictionary in a Uint16Array and also a Uint32Array and see how it performs
Attachment #736879 - Flags: review?(mozilla) → review+
Christoph,

I'm going to delay landing this a bit longer because I think I'm going to want to change the dictionary format again for bug 860538 and I don't want to push tens of megabytes of dictionary files to github while there is a workweek going on and hundreds of engineers with constrained bandwidth.

So feel free to work on the performance issue using the current version of the patch even before I land it.
Also, I'd like to figure out why I get a different dictionary file length each time I run it before I land it. That smells like a potential bug.
(In reply to David Flanagan [:djf] from comment #25)
> Christoph,
> 
> I'm going to delay landing this a bit longer because I think I'm going to
> want to change the dictionary format again for bug 860538 and I don't want
> to push tens of megabytes of dictionary files to github while there is a
> workweek going on and hundreds of engineers with constrained bandwidth.

That makes sense.

> So feel free to work on the performance issue using the current version of
> the patch even before I land it.

I will work on that during the week.
(In reply to David Flanagan [:djf] from comment #26)
> Also, I'd like to figure out why I get a different dictionary file length
> each time I run it before I land it. That smells like a potential bug.

That really smells like a potential bug, I will also investigate this.
(In reply to Christoph Kerschbaumer from comment #28)
> (In reply to David Flanagan [:djf] from comment #26)
> > Also, I'd like to figure out why I get a different dictionary file length
> > each time I run it before I land it. That smells like a potential bug.
> 
> That really smells like a potential bug, I will also investigate this.

I've fixed that by sorting the nodes by character code first, and then by frequency. I'm not sure why the order wasn't stable before, but now it is.
I've pushed an update to the pull request that includes the fix above, plus Christoph's fix from comment 21.  Also, it removes the code that does special node sorting for diacritics and uppercase/lowercase variants. (I may have to restore that once I understand the search algorithm more thoroughly)

Finally, this update adds a table of letters and their frequencies at the end of the dictionary. This will allow the search algorithm to be smart about diacritics and apostrophes, I think.

Only the english dictionary is updated in the pull request, so it is not ready to land.
Peter Kankowski suggests a bug fix here: https://github.com/kankowski/gaia/commit/9e98f39781d065886e36cd610413e77d14917b83

I've updated the pull request again with a variant on this fix.
With Peter's and Christoph's optimizations landed, predictions are now much faster than they were in comment 17. Christoph: it may no longer be necessary to use uint32 with bit shifting in readNode. We might be fast enough.
I'd like to land this patch now and then, in a separate bug, work on the JS implementation of the dictionary search algorithm.  But that is likely to be an iterative process, and I may end up needing to alter the dictionary format again.

Since everyone is at a workweek with limited network connectivity, I don't want to push 10s of megabytes of dictionary files to github multiple times.

So I'm going expand the scope of this bug and its pull request to include follow-up changes to the search algorithm.  When it is ready, we should be able to close multiple bugs with one patch.
(In reply to David Flanagan [:djf] from comment #32)
> With Peter's and Christoph's optimizations landed, predictions are now much
> faster than they were in comment 17. Christoph: it may no longer be
> necessary to use uint32 with bit shifting in readNode. We might be fast
> enough.

I will do some performance testing tonight. But looks good to me!
Adding whiteboard tags for tracking via srumbu.gs.
Whiteboard: c=keyboard
Preemptively asking for leo+ all the bugs blocking the keyboard auto-correct umbrella bug 797170.
blocking-b2g: --- → leo?
Blocks: 865484
No longer blocks: 797170
seem like a product/UX decision whether we need auto-correct in 1.1
Flags: needinfo?(ffos-product)
Chris is driving keyboard improvement decisions.
Flags: needinfo?(ffos-product) → needinfo?(clee)
The patch I attached to this bug turned into something bigger in bug 865484. The patch has landed there, so I'm closing this one now, too.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: c=keyboard → c=auto-suggest
Tracking since this change is wanted by UX but quality of fix needs to be assessed before decision to uplift. See below:

On 5/7/13 12:03 AM, Christopher Lee wrote:

    Thanks Francis. 

    Sounds like you and Josh are going to review the latest patches from David tomorrow and provide an assessment on the quality.  

    I'm also going to get a build tomorrow and evaluate the latest changes and we can collectively make a call on if it's ready for uplift.
blocking-b2g: leo? → -
tracking-b2g18: --- → +
Uplifted to v1-train in bug 873934
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: