Closed Bug 860624 Opened 12 years ago Closed 12 years ago

[keyboard] shared suffixes in dictionary data structure lose word frequency information

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(blocking-b2g:-, b2g18+ fixed)

Status:

RESOLVED FIXED

blocking-b2g

-

Tracking Flags:

Tracking

Status

b2g18

+

fixed

People

(Reporter: djf, Assigned: djf, NeedInfo)

References

Details

(Whiteboard: c=auto-suggest)

Attachments

(3 files)

apply this patch to get debug information why we predict what we predict 12 years ago Christoph Kerschbaumer [:ckerschb] 2.05 KB, patch		Details \| Diff \| Splinter Review
ranks shorter words higher up 12 years ago Christoph Kerschbaumer [:ckerschb] 2.06 KB, patch		Details \| Diff \| Splinter Review
link to patch on github 12 years ago David Flanagan [:djf] 108 bytes, text/html	ckerschb : review+	Details

David Flanagan [:djf]

Assignee

Description

•

12 years ago

[Copied from https://bugzilla.mozilla.org/show_bug.cgi?id=860550#c1] Gregor found a good example and I just verified my/our worst case assumptions with the XML file. user input: alway suggestions: Alwyn (73), always (59.2), away (47) one might think that 'always' should be ranked highest. The problem is averaging the frequency of nodes in compressed suffix nodes. Let's have a look at the XML: <w f="150" flags="">away</w> <w f="145" flags="">always</w> <w f="55" flags="">Alwyn</w> The problem is that so many low ranked words with the postfix 'ways' exist in the dictionary, e.g.: wireways (10), layaways (10), fadeaways (10), and many more, which lower the frequency/rank of 'alway' down to 37, whereas 'Alwy' is upgraded to 73.

David Flanagan [:djf]

Assignee

Updated

•

12 years ago

Blocks: 797170

David Flanagan [:djf]

Assignee

Comment 1

•

12 years ago

I wonder what would happen to the size of the dictionaries if we only shared suffixes for words with a frequency < 100? (Or some other threshold number. Maybe if we didnt' share suffixes for the 5000 most common words?)

Christoph Kerschbaumer [:ckerschb]

Comment 2

•

12 years ago

Attached patch apply this patch to get debug information why we predict what we predict — Details — Splinter Review

Christoph Kerschbaumer [:ckerschb]

Comment 3

•

12 years ago

(In reply to David Flanagan [:djf] from comment #1) > I wonder what would happen to the size of the dictionaries if we only shared > suffixes for words with a frequency < 100? (Or some other threshold number. > Maybe if we didnt' share suffixes for the 5000 most common words?) Dictinaries increase in size when sharing suffixes if frequency of node < 100: en_us.dict 5.4MB sharing suffixes if frequency of node < 150: en_us.dict 5.1MB

David Flanagan [:djf]

Assignee

Comment 4

•

12 years ago

Copying a related comment https://bugzilla.mozilla.org/show_bug.cgi?id=859508#c6 from Christoph: As already mentioned, the dictionary size was a bottleneck in the previous implementation of the engine. Therefore I applied the following patch and compiled all dictionaries again to see what difference it would make if we only compress suffix-nodes with the same frequency (which would avoid averaging the frequency in compressed nodes). diff --git a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py index e7ba3e1..04d0243 100644 --- a/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py +++ b/apps/keyboard/js/imes/latin/dictionaries/xml2dict.py @@ -218,6 +218,7 @@ class TSTTree: # children are equal mapping[nodeA] = nodeB return mapping if nodeA.ch == nodeB.ch and \ + nodeA.frequency == nodeB.frequency and \ self.equal(nodeA.left, nodeB.left, mapping) and \ self.equal(nodeA.center, nodeB.center, mapping) and \ self.equal(nodeA.right, nodeB.right, mapping) else False This minor fix compresses suffix nodes only if the frequency of every node matches in addition to the character, which is reflected in the bigger file size of the dictionaries: de.dict 3.4MB VS 11.9MB en_us.dict 1.8MB VS 5.5MB es.dict 1.7MB VS 5.7MB fr.dict 2.8MB VS 8.9MB pt_br.dict 2.0MB VS 6.4MB As we can see, compressing suffix nodes only if the character and the frequency in the node match increases the file size dramatically. For the en_us.dict for example, we create: (474298 - 79632 = 394666 nodes) in comparsion when we would average the frequency in compressed nodes: (474298 - 345723 = 128575 nodes) This means, our algorithm compresses 345k nodes if only the character in the node matches whereas we can only compress 79k nodes if the character and the frequency in the nodes matches. I guess if we really want to get more accuracy here, it's a better idea to switch algorithms again, because such big dictionaries are not an option what I remember.

David Flanagan [:djf]

Assignee

Updated

•

12 years ago

Assignee: nobody → dflanagan

Gregor Wagner [:gwagner]

Comment 5

•

12 years ago

(In reply to Christoph Kerschbaumer from comment #3) > (In reply to David Flanagan [:djf] from comment #1) > > I wonder what would happen to the size of the dictionaries if we only shared > > suffixes for words with a frequency < 100? (Or some other threshold number. > > Maybe if we didnt' share suffixes for the 5000 most common words?) > > Dictinaries increase in size when > > sharing suffixes if frequency of node < 100: > en_us.dict 5.4MB > > sharing suffixes if frequency of node < 150: > en_us.dict 5.1MB We only have about 400 words with frequency > 150. That seems strange. Would rounding frequencies help? How about rounding all frequencies up to the next 10.

David Flanagan [:djf]

Assignee

Comment 6

•

12 years ago

I'm trying to build the dictionary using code like comment 4, but with only 16 frequency levels. It takes a very, very long time to convert the TST to a DAG, and it is getting slower as it goes. It has removed 63K of 310K nodes out of a total of 474K nodes. But I'm too impatient to let this run to completion. Wow, the dictionary building script takes a very, very long time to run

David Flanagan [:djf]

Assignee

Comment 7

•

12 years ago

If we don't compress the suffixes at all, the en and de dictionaries are 6.6 and 13.3 mb. The english dictionary compresses to 2.7mb with gzip, so we can probably get the size down. If I strip out all the 0 bytes in the english dictionary, the size goes from 6.6mb to 2.4mb, so maybe if I use a different binary format, I can get this down. I'll work on that next.

Christoph Kerschbaumer [:ckerschb]

Comment 8

•

12 years ago

Attached patch ranks shorter words higher up — Details — Splinter Review

We are using an int16 to store the frequency in very node, so we can increase frequencies and are not limited to 255 as in the XML file. By applying the attached patch we push shorter words so that the next pointer prefers shorter candidates rather than longer ones. For example: user input: 'r' old predictions: 'released', 'received', 'record' new predictions: 'run', 'role', 'road' please feel free to test, IMHO this improves quality of predictions quite a bit.

Attachment #736588 - Flags: feedback?(dflanagan)

Attachment #736588 - Flags: feedback?(anygregor)

Christoph Kerschbaumer [:ckerschb]

Comment 9

•

12 years ago

(In reply to Christoph Kerschbaumer from comment #8) > Created attachment 736588 [details] [diff] [review] > ranks shorter words higher up > > We are using an int16 to store the frequency in very node, so we can > increase frequencies and are not limited to 255 as in the XML file. > > By applying the attached patch we push shorter words so that the next > pointer prefers shorter candidates rather than longer ones. > > For example: > user input: 'r' > old predictions: 'released', 'received', 'record' > new predictions: 'run', 'role', 'road' > > please feel free to test, IMHO this improves quality of predictions quite a > bit. don't forget to do a make in the dictionaries folder.

David Flanagan [:djf]

Assignee

Comment 10

•

12 years ago

I'm working on a different patch that uses a variable length encoding of each node and leaves all the suffixes unshared. But with the new compression, the resulting english dictionary is just 2.1m which is not much worse at all than the old 1.8m I still have to modify the js to decode the new format before I'm confident that it works, but I'm pretty excited about it. Part of my new serialization format is that I now use only 5 bits for frequency. I'm planning on doing something similar to what you've done to increase the frequency of short words relative to long words... I'm thinking of subtracting the word length from the frequency. Since frequency is now between 0 and 31 this seems like it might be a reasonable thing to do.

David Flanagan [:djf]

Assignee

Comment 11

•

12 years ago

Attached file link to patch on github — Details

Christoph, this pull request moves the dictionary building script out of the app directory and converts it to use a different binary representation. I still have to write the JS code to read the nodes in this new representation before I'll be convinced that it works, but I'd love your feedback on the format and the python code (not my strong suit) if you have the time.

Attachment #736879 - Flags: feedback?(mozilla)

Christoph Kerschbaumer [:ckerschb]

Comment 12

•

12 years ago

That looks good to me, I just compiled all the dictionaries for the size comparison: de.dict 3.4MB VS 3.7MB en_us.dict 1.8MB VS 2.1MB es.dict 1.7MB VS 2.3MB fr.dict 2.8MB VS 3.1MB pt_br.dict 2.0MB VS 2.4MB I guess the tradeoff between dictionary size and quality improvement favors the new encoding. I am all up for that. Let's see how the JS counterpart comes along and then we have to do performance comparison. I am kind of worried about that though, because we have to perform more reads because of the variable sized nodes. What I remember, Josh mentioned a worst case response time of 140ms. Let's see how it performs. Anyway, great progress.

David Flanagan [:djf]

Assignee

Comment 13

•

12 years ago

I've updated the pull request with modified JS that reads the new dictionary format. (I only had to change readNodes() in predictions.js). The dictionaries are updated too. I've also tweaked the dictionary build script to scale the frequency into 5 bits and then subtract the word length, and this seems to give good results. It doesn't feel slower, but I haven't measured performance yet. Christoph did you just measure performance by timing the predict() call?

David Flanagan [:djf]

Assignee

Comment 14

•

12 years ago

Comment on attachment 736879 [details] link to patch on github Christoph, I think you would be the best one to review this patch, if you have time. If not, please re-assign the review to Gregor.

Attachment #736879 - Flags: feedback?(mozilla) → review?(mozilla)

David Flanagan [:djf]

Assignee

Comment 15

•

12 years ago

One thing I'm concerned about in the attached patch is that running xml2dict.py on the english wordlist results in a dictionary file with a slightly different size on each run. Why is there any indeterminacy in the algorithm? Maybe the python sort() function? And even if nodes are being sorted in different orders, why does that affect the length of the file. Any ideas about this, Christoph?

Christoph Kerschbaumer [:ckerschb]

Comment 16

•

12 years ago

(In reply to David Flanagan [:djf] from comment #14) > Comment on attachment 736879 [details] > link to patch on github > > Christoph, David, I am happy to review that. > I think you would be the best one to review this patch, if you have time. If > not, please re-assign the review to Gregor.

David Flanagan [:djf]

Assignee

Comment 17

•

12 years ago

Here are my performance timings for 3 trials of typing "restaurant": R 59 65 83 Re 39 17 44 Res 39 40 29 Rest 27 30 38 Resta 24 25 32 Restau 31 23 37 Restaur 11 21 51 Restaura 23 54 68 Restauran 34 40 53 Restaurant 13 24 29

David Flanagan [:djf]

Assignee

Comment 18

•

12 years ago

Those times don't seem bad, but they do seem to be worse than without this patch. I'm not sure why. The readNode() function is more complicated but it is reading fewer bytes so I wouldn't expect it to make much difference. Maybe reading bytes instead of 16-bit words is slower? I don't think that the changes to the dictionary format have enlarged the search space at all, so predictions.js ought to be doing pretty much the same thing. Christoph, what did you mean in comment 14 when you said that we'd have to "perform more reads because of the variable size nodes"? I think readNode() is being called exactly the same number of times. It is probably accessing the underlying typed array more often, however.

David Flanagan [:djf]

Assignee

Comment 19

•

12 years ago

Comment on attachment 736588 [details] [diff] [review] ranks shorter words higher up Cancelling the feedback requests on this patch since I've incorporated something similar into my patch that is being reviewed.

Attachment #736588 - Flags: feedback?(dflanagan)

Attachment #736588 - Flags: feedback?(anygregor)

Christoph Kerschbaumer [:ckerschb]

Comment 20

•

12 years ago

(In reply to David Flanagan [:djf] from comment #18) > Those times don't seem bad, but they do seem to be worse than without this > patch. I'm not sure why. The readNode() function is more complicated but > it is reading fewer bytes so I wouldn't expect it to make much difference. > Maybe reading bytes instead of 16-bit words is slower? Yes, that's what I am thinking, probably it's a good idea to do basically some loop-unrolling optimization, loading the _dict into a Uint16Array and fiddel out the bits we need. I guess it's worth a try. > > I don't think that the changes to the dictionary format have enlarged the > search space at all, so predictions.js ought to be doing pretty much the > same thing. Search space is definietly the same, but as mentioned above, we have to perform more reads from the typed array which slows things down. > Christoph, what did you mean in comment 14 when you said that we'd have to > "perform more reads because of the variable size nodes"? I think readNode() > is being called exactly the same number of times. It is probably accessing > the underlying typed array more often, however. Yes, that's exactly what I meant, more reads from the typed array. As I said, let's try to load it into a Uint16 Array, maybe even a Uint32Array and perform bit shift operations to read the node values.

Christoph Kerschbaumer [:ckerschb]

Comment 21

•

12 years ago

David, since you are already touching the predictions.js file, I figured a bug which would nicely go with this patch. It's a minor performance opt too. Every key holds itself in the list of surrounding keys, which causes unnecessary double insertions of candidates. See the following patch: @@ -87,6 +87,8 @@ var Predictions = function() { var list = {}; for (var m = 0; m < keyArray.length; ++m) { var key2 = keyArray[m]; + if (key1.code == key2.code) + continue; if (SpecialKey(key2)) continue; if (SquaredDistanceToEdge(/* key dimensions */

Christoph Kerschbaumer [:ckerschb]

Comment 22

•

12 years ago

(In reply to David Flanagan [:djf] from comment #17) > Here are my performance timings for 3 trials of typing "restaurant": > > R 59 65 83 > Re 39 17 44 > Res 39 40 29 > Rest 27 30 38 > Resta 24 25 32 > Restau 31 23 37 > Restaur 11 21 51 > Restaura 23 54 68 > Restauran 34 40 53 > Restaurant 13 24 29 David, did you always perform a 'adb reboot' before recording these numbers? In my previous evaluation of the engine I always rebooted the phone so we measure time on cold code, before any optimizations are performed. I reran your tests on my phone: I/Gecko ( 108): r 166 ms I/Gecko ( 108): re 62 ms I/Gecko ( 108): res 63 ms I/Gecko ( 108): rest 46 ms I/Gecko ( 108): resta 36 ms I/Gecko ( 108): restau 42 ms I/Gecko ( 108): restaur 31 ms I/Gecko ( 108): restaura 61 ms I/Gecko ( 108): restauran 66 ms I/Gecko ( 108): restaurant 47 ms I/Gecko ( 108): r 162 ms I/Gecko ( 108): re 61 ms I/Gecko ( 108): res 79 ms I/Gecko ( 108): rest 42 ms I/Gecko ( 108): resta 52 ms I/Gecko ( 108): restau 48 ms I/Gecko ( 108): restaur 33 ms I/Gecko ( 108): restaura 54 ms I/Gecko ( 108): restauran 70 ms I/Gecko ( 108): restaurant 21 ms I performed several trial runs with similar results. The most interesting fact is that it takes more than the 140ms till a result is displayed for the first entered character, in that case 'r'. Josh mentioned that 140ms is the upper bound for results to appear on the display so that a user still perceives the keyboard to operate smoothly. I guess we should work on the readNode function to speed up things.

David Flanagan [:djf]

Assignee

Comment 23

•

12 years ago

Christoph, I'll make the prediction.js changes you suggest above. And I'm also going to modify the xml2dict code so that it doesn't scale the frequency information to 5 bits until the very end. I'll get better sorting by doing that at the end instead of the start. My preference is to land this as is, and try to improve performance in a separate bug. (I'm not worried about the cold start time, as long as performance is good enough at other times.) Is the Uint32 bit-shifting code something you'd like to work on yourself?

Christoph Kerschbaumer [:ckerschb]

Comment 24

•

12 years ago

David, that sounds good to me. Land the patch, and I can help you going from there, trying to load the dictionary in a Uint16Array and also a Uint32Array and see how it performs

Christoph Kerschbaumer [:ckerschb]

Updated

•

12 years ago

Attachment #736879 - Flags: review?(mozilla) → review+

David Flanagan [:djf]

Assignee

Comment 25

•

12 years ago

Christoph, I'm going to delay landing this a bit longer because I think I'm going to want to change the dictionary format again for bug 860538 and I don't want to push tens of megabytes of dictionary files to github while there is a workweek going on and hundreds of engineers with constrained bandwidth. So feel free to work on the performance issue using the current version of the patch even before I land it.

David Flanagan [:djf]

Assignee

Comment 26

•

12 years ago

Also, I'd like to figure out why I get a different dictionary file length each time I run it before I land it. That smells like a potential bug.

Christoph Kerschbaumer [:ckerschb]

Comment 27

•

12 years ago

(In reply to David Flanagan [:djf] from comment #25) > Christoph, > > I'm going to delay landing this a bit longer because I think I'm going to > want to change the dictionary format again for bug 860538 and I don't want > to push tens of megabytes of dictionary files to github while there is a > workweek going on and hundreds of engineers with constrained bandwidth. That makes sense. > So feel free to work on the performance issue using the current version of > the patch even before I land it. I will work on that during the week.

Christoph Kerschbaumer [:ckerschb]

Comment 28

•

12 years ago

(In reply to David Flanagan [:djf] from comment #26) > Also, I'd like to figure out why I get a different dictionary file length > each time I run it before I land it. That smells like a potential bug. That really smells like a potential bug, I will also investigate this.

David Flanagan [:djf]

Assignee

Comment 29

•

12 years ago

(In reply to Christoph Kerschbaumer from comment #28) > (In reply to David Flanagan [:djf] from comment #26) > > Also, I'd like to figure out why I get a different dictionary file length > > each time I run it before I land it. That smells like a potential bug. > > That really smells like a potential bug, I will also investigate this. I've fixed that by sorting the nodes by character code first, and then by frequency. I'm not sure why the order wasn't stable before, but now it is.

David Flanagan [:djf]

Assignee

Comment 30

•

12 years ago

I've pushed an update to the pull request that includes the fix above, plus Christoph's fix from comment 21. Also, it removes the code that does special node sorting for diacritics and uppercase/lowercase variants. (I may have to restore that once I understand the search algorithm more thoroughly) Finally, this update adds a table of letters and their frequencies at the end of the dictionary. This will allow the search algorithm to be smart about diacritics and apostrophes, I think. Only the english dictionary is updated in the pull request, so it is not ready to land.

David Flanagan [:djf]

Assignee

Comment 31

•

12 years ago

Peter Kankowski suggests a bug fix here: https://github.com/kankowski/gaia/commit/9e98f39781d065886e36cd610413e77d14917b83 I've updated the pull request again with a variant on this fix.

David Flanagan [:djf]

Assignee

Comment 32

•

12 years ago

With Peter's and Christoph's optimizations landed, predictions are now much faster than they were in comment 17. Christoph: it may no longer be necessary to use uint32 with bit shifting in readNode. We might be fast enough.

David Flanagan [:djf]

Assignee

Comment 33

•

12 years ago

I'd like to land this patch now and then, in a separate bug, work on the JS implementation of the dictionary search algorithm. But that is likely to be an iterative process, and I may end up needing to alter the dictionary format again. Since everyone is at a workweek with limited network connectivity, I don't want to push 10s of megabytes of dictionary files to github multiple times. So I'm going expand the scope of this bug and its pull request to include follow-up changes to the search algorithm. When it is ready, we should be able to close multiple bugs with one patch.

Christoph Kerschbaumer [:ckerschb]

Comment 34

•

12 years ago

(In reply to David Flanagan [:djf] from comment #32) > With Peter's and Christoph's optimizations landed, predictions are now much > faster than they were in comment 17. Christoph: it may no longer be > necessary to use uint32 with bit shifting in readNode. We might be fast > enough. I will do some performance testing tonight. But looks good to me!

Mike Lee [:mlee]

Comment 35

•

12 years ago

Adding whiteboard tags for tracking via srumbu.gs.

Whiteboard: c=keyboard

David Flanagan [:djf]

Assignee

Comment 36

•

12 years ago

Preemptively asking for leo+ all the bugs blocking the keyboard auto-correct umbrella bug 797170.

blocking-b2g: --- → leo?

David Flanagan [:djf]

Assignee

Updated

•

12 years ago

Blocks: 865484

David Flanagan [:djf]

Assignee

Updated

•

12 years ago

No longer blocks: 797170

Joe Cheng [:jcheng] (please needinfo)

Comment 37

•

12 years ago

seem like a product/UX decision whether we need auto-correct in 1.1

Flags: needinfo?(ffos-product)

Peter Dolanjski [:pdol]

Comment 38

•

12 years ago

Chris is driving keyboard improvement decisions.

Flags: needinfo?(ffos-product) → needinfo?(clee)

David Flanagan [:djf]

Assignee

Comment 39

•

12 years ago

The patch I attached to this bug turned into something bigger in bug 865484. The patch has landed there, so I'm closing this one now, too.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Mike Lee [:mlee]

Updated

•

12 years ago

Whiteboard: c=keyboard → c=auto-suggest

Mike Lee [:mlee]

Comment 40

•

12 years ago

Tracking since this change is wanted by UX but quality of fix needs to be assessed before decision to uplift. See below: On 5/7/13 12:03 AM, Christopher Lee wrote: Thanks Francis. Sounds like you and Josh are going to review the latest patches from David tomorrow and provide an assessment on the quality. I'm also going to get a build tomorrow and evaluate the latest changes and we can collectively make a call on if it's ready for uplift.

blocking-b2g: leo? → -

tracking-b2g18: --- → +

David Flanagan [:djf]

Assignee

Comment 41

•

12 years ago

Uplifted to v1-train in bug 873934

status-b2g18: --- → fixed

You need to log in before you can comment on or make changes to this bug.