Closed Bug 773366 Opened 12 years ago Closed 6 years ago

B2G: benchmark for predictive Text

Categories

(Core :: JavaScript Engine, defect)

ARM
All
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: ckerschb, Unassigned)

References

Details

(Whiteboard: [js:t])

Attachments

(2 files, 1 obsolete file)

Attached file pt_bug.zip (obsolete) —
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0
Build ID: 20120423122928

Steps to reproduce:

Porting Android's native part for predictive Text using emScripten. JS code runs too slow on the phone (see results.txt attached).
The attached benchmark calls 'getSuggestions()' handing in 'one letter words', 'two letter words', 'three letter words', and so on. In the first run, the dictionary contains 10,000 words (66.22 KB, 67,816 bytes), in the second run, the dictionary contains only 1 word (16 bytes). The size of the dictionary directly correlates performance. The dictionary is loaded into a Uint8Array.
Any suggestions for optimizations?

ptLibrary.js is the emScripten generated code. I also include the native code of Android's version and a shell script to compile native code using emScripten (you have to adopt the path to emScripten in the script).
cc'ing marty for ARM expertise.

Christoph, I think it would be interesting to benchmark this on nightly (16) or even IonMonkey.
OS: Linux → All
Hardware: x86_64 → ARM
We are using trunk builds on otoro.
First we should figure out why we spend 50msec for a roundtrip with a dictionary containing a single word.
here are the first benchmark results; calling getSuggestions, but don't do the actual library call. In other words, the roundtrip time just for alloc/free memory in emScripten Heap and copying memory from the JS HEAP into the emScripten HEAP.

   TEST                   RUNS   TOTAL     MAX       MIN     AVG
   alphabet                 26   359 ms    32 ms     10 ms   13.81 ms
   two letter words         26   312 ms    28 ms     10 ms   12.00 ms
   three letter words       26   293 ms    14 ms     10 ms   11.27 ms
   four letter words        26   319 ms    22 ms     10 ms   12.27 ms
   five letter words        26   307 ms    21 ms     10 ms   11.81 ms
Can you elaborate on what goes on specifically? How much memory is alloc/free'd, how much is copied from JS into the emscripten heap, and how much is transferred to the worker and back?
we modified the code and allocate memory in the emHEAP for every array only once and reuse that in the subsequent calls. However, we have to copy 5 JS arrays into the emHEAP every time we call getSuggestions where the size of arrays of course changes.
We tried to measure only the time for copying JS arrays into emHEAP, which takes 7 ms (on the phone) for Uint32Arrays of the size, 0, 0, 16, 128, 8.

@Alon: is this the fastest way to copy JS arrays into the emHEAP?

function JSArrayToEmSriptenInt32Array(dstAddress, srcArray) {
  for (var i = 0; i < srcArray.length; i++) {
    Module.HEAP32[(dstAddress>>2)+i] = srcArray[i];
  }
}
It would be significantly faster to use typed array .set()

https://developer.mozilla.org/en/JavaScript_typed_arrays/Int8Array#set%28%29
Created the same benchmark, which we use in the browser, in native code. See numbers below:

[1] (100,000 word dict, 615,554 bytes)
RUN   CUR            JS         NATIVE
  1   w              1,078 ms   35 ms 
  2   wa             1,473 ms   15 ms
  3   was              893 ms   14 ms
  4   wash             353 ms    1 ms
  5   washi            290 ms    4 ms
  6   washin           222 ms    1 ms
  7   washing          201 ms    1 ms
  8   washingt         197 ms    1 ms
  9   washingto        207 ms    5 ms
 10   washington       211 ms    5 ms

[2] (10,000 word dict, 69,780 bytes)
RUN   CUR            JS          NATIVE
  1   w              556 ms      13 ms
  2   wa             431 ms       3 ms
  3   was            283 ms       2 ms
  4   wash           138 ms       1 ms
  5   washi          110 ms       1 ms
  6   washin         109 ms      12 ms
  7   washing         76 ms       1 ms
  8   washingt        84 ms       2 ms
  9   washingto       85 ms       1 ms
 10   washington      85 ms       5 ms

[3] (1 word dict, 16 bytes)
RUN   CUR            JS        NATIVE
  1   w              12 ms     0 ms
  2   wa             11 ms     0 ms
  3   was             3 ms     0 ms
  4   wash            2 ms     0 ms
  5   washi          48 ms     0 ms
  6   washin          2 ms     0 ms
  7   washing         1 ms     0 ms
  8   washingt        2 ms     0 ms
  9   washingto      25 ms     0 ms
 10   washington      2 ms     0 ms
I forgot to mention: the 'NATIVE' numbers are from running the benchmark on the phone.
Alon, looks like we are 30x slower than native. Is there something we can do in the short term, or should we give up and use a native API?
Attachment #641548 - Attachment is obsolete: true
The JS itself looks ok, so 30x means we need JS engine people to look at when happens in the engine, I think.

What pops out is that in results.txt desktop browsers are 100x faster than mobile. Ignoring CPU speed differences, that still looks much too big. What differences are there between JS on desktop and mobile - is something like TI switched off?
The old benchmark in results.txt is not completely representative (that's why I took it out). Accurate up to date measurements can be found in Comment 7.

I ran the benchmark again, setting the 'javascript.options.typeinference' to 'true'. Same results! Are there any other options I missed to set to true or false?
JS run as chrome doesn't have TI run on it.  There's also a problem where code isn't being JITted if you have the Gecko Profiler add-on turned on, in bug 774330 (soon to be fixed).  I don't know if either of those could be a factor here.  Probably not, but I thought I'd throw it out there just in case...
bhackett mentioned that turning on TI for chrome scripts is blocked by chrome scripts not being compile and go. Apparently, this is a somewhat small change after CPG landed, so it might be worth looking into that.

(I can't find a bug for either enabling CNG or TI for chrome scripts, unfortunately.)
The current benchmark is a regular web page, so it should not be affected by limitations of chrome scripts, but I assume it would need to be a chrome script at some point.
This code lives in a "keyboard app", which is normal web content.
Attached file html benchmark
After looking at one of the web benchmarks a bit with the JIT inspector (plug: https://addons.mozilla.org/addon/jit-inspector/) there are a couple things that stand out.

- First, there several fairly hot bitops with boolean inputs, which we generate stub calls for.  Patch fixing that coming up in a bit.

- Second, there is a hot function Cd (in ptLibrary.js in the web test) with a ton of common subexpressions like 'p[b + 28 * u[c] + X]' where 'b + 28 * u[c]' is repeated about 15 times.  Doing the CSE by hand improves things a bit, nowhere near as much as fixing the stub call issue.  This CSE can't be done by either JM or IM (the reads of u[c] are interspersed with writes to the same typed array) so some optimistic optimization in emscripten might be handy.  Dunno how the original C looks.  The Cd function contains a ton of x[v + i] type accesses which bug 771864 and bug 771835 (currently on hiatus) could generate better code for.

Anyways, times I get in the shell test:

Before:

alphabet, 26 runs, total: 105ms, max: 35 ms, min: 1ms, avg: 4.04 ms
two letter words, 26 runs, total: 94ms, max: 21 ms, min: 1ms, avg: 3.62 ms
three letter words, 26 runs, total: 67ms, max: 14 ms, min: 1ms, avg: 2.58 ms
four letter words, 26 runs, total: 48ms, max: 4 ms, min: 1ms, avg: 1.85 ms
five letter words, 26 runs, total: 48ms, max: 5 ms, min: 1ms, avg: 1.85 ms

With boolean bitops fixed:

alphabet, 26 runs, total: 88ms, max: 34 ms, min: 1ms, avg: 3.38 ms
two letter words, 26 runs, total: 68ms, max: 20 ms, min: 1ms, avg: 2.62 ms
three letter words, 26 runs, total: 53ms, max: 14 ms, min: 1ms, avg: 2.04 ms
four letter words, 26 runs, total: 38ms, max: 5 ms, min: 1ms, avg: 1.46 ms
five letter words, 26 runs, total: 38ms, max: 5 ms, min: 1ms, avg: 1.46 ms

With bitops + CSE fixed:

alphabet, 26 runs, total: 85ms, max: 34 ms, min: 1ms, avg: 3.27 ms
two letter words, 26 runs, total: 65ms, max: 19 ms, min: 1ms, avg: 2.50 ms
three letter words, 26 runs, total: 52ms, max: 13 ms, min: 1ms, avg: 2.00 ms
four letter words, 26 runs, total: 38ms, max: 4 ms, min: 1ms, avg: 1.46 ms
five letter words, 26 runs, total: 36ms, max: 4 ms, min: 1ms, avg: 1.38 ms

So an improvement between 25% and 40%.

I'm confused by the comments in this bug.  How do the speed differences between JS and native compare for desktop vs. mobile.  Performance differences between the platforms do not necessarily mean something drastic like TI being turned off, but in some cases different operations are stubbed on different platforms.  e.g. mod ops (%) are always stubbed on ARM.
Depends on: 774812
Thanks bhackett!

Yes, looks like Cd could be optimized quite a bit. We might need to do that manually since it looks like LLVM could not do CSE (Christoph, this was compiled with -O2 -s INLINING_LIMIT=0, correct?).

Manually optimizing this is not hard. The first step is finding the actual function name. It might be obvious from the distinctive shape in C++ (it does a lot of 8-bit writes), if not, either compile a version without closure and find the similar shape in JS or profile that version.

Next, add that function to IGNORED_FUNCTIONS (similar syntax to EXPORTED_FUNCTIONS), so we will ignore the compiled C.

Then write the hand-optimized version in a JS file, and use emcc's --post-js option to bundle that file with the compiled code. The bundling is done *before* closure, so that you can write the hand-optimized version normally and do not need to guess at the names of HEAP etc.

As for manually optimizing, aside from CSE, a lot of those writes are to sequential bytes individually. LLVM could not prove alignment it seems, so we don't bundle them into faster big 32-bit writes. It might be worth it to do a runtime check of alignment and have 2 code paths in this function, assuming as seems likely that there will be 4-byte alignment here.
Thanks bhackett and alon!

So there is nothing big standing out that could bring our 200 - 800msec overhead down to a max response time of around 50msec. :(
Christoph will be working on the native implementation in the next few days.
(In reply to Gregor Wagner [:gwagner] from comment #20)
> So there is nothing big standing out that could bring our 200 - 800msec
> overhead down to a max response time of around 50msec. :(

I don't know that there isn't. The big open question is why ARM performance is so much worse than x86: We are seeing just a few ms on x86, even if ARM was 10x slower that should get us to the 50ms range.

I cc'd and pinged mjrosenberg about this, might want to wait on that before giving up on the JS version. We might just be missing some specific optimization not yet implemented on ARM.

Also, I would benchmark this on IonMonkey, since I assume that is what will soon matter.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Whiteboard: [js:t]
Assignee: general → nobody
Mass closing as we are no longer working on b2g/firefox os.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Mass closing as we are no longer working on b2g/firefox os.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: