773366 - B2G: benchmark for predictive Text

Reporter

Description

•

13 years ago

Attached file pt_bug.zip (obsolete) — Details

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0 Build ID: 20120423122928 Steps to reproduce: Porting Android's native part for predictive Text using emScripten. JS code runs too slow on the phone (see results.txt attached). The attached benchmark calls 'getSuggestions()' handing in 'one letter words', 'two letter words', 'three letter words', and so on. In the first run, the dictionary contains 10,000 words (66.22 KB, 67,816 bytes), in the second run, the dictionary contains only 1 word (16 bytes). The size of the dictionary directly correlates performance. The dictionary is loaded into a Uint8Array. Any suggestions for optimizations? ptLibrary.js is the emScripten generated code. I also include the native code of Android's version and a shell script to compile native code using emScripten (you have to adopt the path to emScripten in the script).

Alon Zakai (:azakai)

Comment 1

•

13 years ago

cc'ing marty for ARM expertise. Christoph, I think it would be interesting to benchmark this on nightly (16) or even IonMonkey.

OS: Linux → All

Hardware: x86_64 → ARM

Gregor Wagner [:gwagner]

Comment 2

•

13 years ago

We are using trunk builds on otoro. First we should figure out why we spend 50msec for a roundtrip with a dictionary containing a single word.

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 3

•

13 years ago

here are the first benchmark results; calling getSuggestions, but don't do the actual library call. In other words, the roundtrip time just for alloc/free memory in emScripten Heap and copying memory from the JS HEAP into the emScripten HEAP. TEST RUNS TOTAL MAX MIN AVG alphabet 26 359 ms 32 ms 10 ms 13.81 ms two letter words 26 312 ms 28 ms 10 ms 12.00 ms three letter words 26 293 ms 14 ms 10 ms 11.27 ms four letter words 26 319 ms 22 ms 10 ms 12.27 ms five letter words 26 307 ms 21 ms 10 ms 11.81 ms

Alon Zakai (:azakai)

Comment 4

•

13 years ago

Can you elaborate on what goes on specifically? How much memory is alloc/free'd, how much is copied from JS into the emscripten heap, and how much is transferred to the worker and back?

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 5

•

13 years ago

we modified the code and allocate memory in the emHEAP for every array only once and reuse that in the subsequent calls. However, we have to copy 5 JS arrays into the emHEAP every time we call getSuggestions where the size of arrays of course changes. We tried to measure only the time for copying JS arrays into emHEAP, which takes 7 ms (on the phone) for Uint32Arrays of the size, 0, 0, 16, 128, 8. @Alon: is this the fastest way to copy JS arrays into the emHEAP? function JSArrayToEmSriptenInt32Array(dstAddress, srcArray) { for (var i = 0; i < srcArray.length; i++) { Module.HEAP32[(dstAddress>>2)+i] = srcArray[i]; } }

Alon Zakai (:azakai)

Comment 6

•

13 years ago

It would be significantly faster to use typed array .set() https://developer.mozilla.org/en/JavaScript_typed_arrays/Int8Array#set%28%29

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 7

•

13 years ago

Created the same benchmark, which we use in the browser, in native code. See numbers below: [1] (100,000 word dict, 615,554 bytes) RUN CUR JS NATIVE 1 w 1,078 ms 35 ms 2 wa 1,473 ms 15 ms 3 was 893 ms 14 ms 4 wash 353 ms 1 ms 5 washi 290 ms 4 ms 6 washin 222 ms 1 ms 7 washing 201 ms 1 ms 8 washingt 197 ms 1 ms 9 washingto 207 ms 5 ms 10 washington 211 ms 5 ms [2] (10,000 word dict, 69,780 bytes) RUN CUR JS NATIVE 1 w 556 ms 13 ms 2 wa 431 ms 3 ms 3 was 283 ms 2 ms 4 wash 138 ms 1 ms 5 washi 110 ms 1 ms 6 washin 109 ms 12 ms 7 washing 76 ms 1 ms 8 washingt 84 ms 2 ms 9 washingto 85 ms 1 ms 10 washington 85 ms 5 ms [3] (1 word dict, 16 bytes) RUN CUR JS NATIVE 1 w 12 ms 0 ms 2 wa 11 ms 0 ms 3 was 3 ms 0 ms 4 wash 2 ms 0 ms 5 washi 48 ms 0 ms 6 washin 2 ms 0 ms 7 washing 1 ms 0 ms 8 washingt 2 ms 0 ms 9 washingto 25 ms 0 ms 10 washington 2 ms 0 ms

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 8

•

13 years ago

I forgot to mention: the 'NATIVE' numbers are from running the benchmark on the phone.

Andreas Gal :gal

Comment 9

•

13 years ago

Alon, looks like we are 30x slower than native. Is there something we can do in the short term, or should we give up and use a native API?

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 10

•

13 years ago

Attached file testcase for predictive text using emscripten — Details

Attachment #641548 - Attachment is obsolete: true

Alon Zakai (:azakai)

Comment 11

•

13 years ago

The JS itself looks ok, so 30x means we need JS engine people to look at when happens in the engine, I think. What pops out is that in results.txt desktop browsers are 100x faster than mobile. Ignoring CPU speed differences, that still looks much too big. What differences are there between JS on desktop and mobile - is something like TI switched off?

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 12

•

13 years ago

The old benchmark in results.txt is not completely representative (that's why I took it out). Accurate up to date measurements can be found in Comment 7. I ran the benchmark again, setting the 'javascript.options.typeinference' to 'true'. Same results! Are there any other options I missed to set to true or false?

Andrew McCreight [:mccr8]

Comment 13

•

13 years ago

JS run as chrome doesn't have TI run on it. There's also a problem where code isn't being JITted if you have the Gecko Profiler add-on turned on, in bug 774330 (soon to be fixed). I don't know if either of those could be a factor here. Probably not, but I thought I'd throw it out there just in case...

Till Schneidereit [:till]

Comment 14

•

13 years ago

bhackett mentioned that turning on TI for chrome scripts is blocked by chrome scripts not being compile and go. Apparently, this is a somewhat small change after CPG landed, so it might be worth looking into that. (I can't find a bug for either enabling CNG or TI for chrome scripts, unfortunately.)

Alon Zakai (:azakai)

Comment 15

•

13 years ago

The current benchmark is a regular web page, so it should not be affected by limitations of chrome scripts, but I assume it would need to be a chrome script at some point.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 16

•

13 years ago

This code lives in a "keyboard app", which is normal web content.

Christoph Kerschbaumer [:ckerschb}

Reporter

Comment 17

•

13 years ago

Attached file html benchmark — Details

Brian Hackett [Laid off!]

Comment 18

•

13 years ago

After looking at one of the web benchmarks a bit with the JIT inspector (plug: https://addons.mozilla.org/addon/jit-inspector/) there are a couple things that stand out. - First, there several fairly hot bitops with boolean inputs, which we generate stub calls for. Patch fixing that coming up in a bit. - Second, there is a hot function Cd (in ptLibrary.js in the web test) with a ton of common subexpressions like 'p[b + 28 * u[c] + X]' where 'b + 28 * u[c]' is repeated about 15 times. Doing the CSE by hand improves things a bit, nowhere near as much as fixing the stub call issue. This CSE can't be done by either JM or IM (the reads of u[c] are interspersed with writes to the same typed array) so some optimistic optimization in emscripten might be handy. Dunno how the original C looks. The Cd function contains a ton of x[v + i] type accesses which bug 771864 and bug 771835 (currently on hiatus) could generate better code for. Anyways, times I get in the shell test: Before: alphabet, 26 runs, total: 105ms, max: 35 ms, min: 1ms, avg: 4.04 ms two letter words, 26 runs, total: 94ms, max: 21 ms, min: 1ms, avg: 3.62 ms three letter words, 26 runs, total: 67ms, max: 14 ms, min: 1ms, avg: 2.58 ms four letter words, 26 runs, total: 48ms, max: 4 ms, min: 1ms, avg: 1.85 ms five letter words, 26 runs, total: 48ms, max: 5 ms, min: 1ms, avg: 1.85 ms With boolean bitops fixed: alphabet, 26 runs, total: 88ms, max: 34 ms, min: 1ms, avg: 3.38 ms two letter words, 26 runs, total: 68ms, max: 20 ms, min: 1ms, avg: 2.62 ms three letter words, 26 runs, total: 53ms, max: 14 ms, min: 1ms, avg: 2.04 ms four letter words, 26 runs, total: 38ms, max: 5 ms, min: 1ms, avg: 1.46 ms five letter words, 26 runs, total: 38ms, max: 5 ms, min: 1ms, avg: 1.46 ms With bitops + CSE fixed: alphabet, 26 runs, total: 85ms, max: 34 ms, min: 1ms, avg: 3.27 ms two letter words, 26 runs, total: 65ms, max: 19 ms, min: 1ms, avg: 2.50 ms three letter words, 26 runs, total: 52ms, max: 13 ms, min: 1ms, avg: 2.00 ms four letter words, 26 runs, total: 38ms, max: 4 ms, min: 1ms, avg: 1.46 ms five letter words, 26 runs, total: 36ms, max: 4 ms, min: 1ms, avg: 1.38 ms So an improvement between 25% and 40%. I'm confused by the comments in this bug. How do the speed differences between JS and native compare for desktop vs. mobile. Performance differences between the platforms do not necessarily mean something drastic like TI being turned off, but in some cases different operations are stubbed on different platforms. e.g. mod ops (%) are always stubbed on ARM.

Brian Hackett [Laid off!]

Updated

•

13 years ago

Depends on: 774812

Alon Zakai (:azakai)

Comment 19

•

13 years ago

Thanks bhackett! Yes, looks like Cd could be optimized quite a bit. We might need to do that manually since it looks like LLVM could not do CSE (Christoph, this was compiled with -O2 -s INLINING_LIMIT=0, correct?). Manually optimizing this is not hard. The first step is finding the actual function name. It might be obvious from the distinctive shape in C++ (it does a lot of 8-bit writes), if not, either compile a version without closure and find the similar shape in JS or profile that version. Next, add that function to IGNORED_FUNCTIONS (similar syntax to EXPORTED_FUNCTIONS), so we will ignore the compiled C. Then write the hand-optimized version in a JS file, and use emcc's --post-js option to bundle that file with the compiled code. The bundling is done *before* closure, so that you can write the hand-optimized version normally and do not need to guess at the names of HEAP etc. As for manually optimizing, aside from CSE, a lot of those writes are to sequential bytes individually. LLVM could not prove alignment it seems, so we don't bundle them into faster big 32-bit writes. It might be worth it to do a runtime check of alignment and have 2 code paths in this function, assuming as seems likely that there will be 4-byte alignment here.

Gregor Wagner [:gwagner]

Comment 20

•

13 years ago

Thanks bhackett and alon! So there is nothing big standing out that could bring our 200 - 800msec overhead down to a max response time of around 50msec. :( Christoph will be working on the native implementation in the next few days.

Alon Zakai (:azakai)

Comment 21

•

13 years ago

(In reply to Gregor Wagner [:gwagner] from comment #20) > So there is nothing big standing out that could bring our 200 - 800msec > overhead down to a max response time of around 50msec. :( I don't know that there isn't. The big open question is why ARM performance is so much worse than x86: We are seeing just a few ms on x86, even if ARM was 10x slower that should get us to the 50ms range. I cc'd and pinged mjrosenberg about this, might want to wait on that before giving up on the JS version. We might just be missing some specific optimization not yet implemented on ARM. Also, I would benchmark this on IonMonkey, since I assume that is what will soon matter.

Matthias Versen [:Matti]

Updated

•

13 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

David Mandelin [:dmandelin]

Updated

•

13 years ago

Whiteboard: [js:t]

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Assignee: general → nobody

Sylvestre Ledru [:Sylvestre]

Comment 22

•

7 years ago

Mass closing as we are no longer working on b2g/firefox os.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Sylvestre Ledru [:Sylvestre]

Comment 23

•

7 years ago

Mass closing as we are no longer working on b2g/firefox os.

pt_bug.zip 13 years ago Christoph Kerschbaumer [:ckerschb} 153.59 KB, application/octet-stream		Details
testcase for predictive text using emscripten 13 years ago Christoph Kerschbaumer [:ckerschb} 153.46 KB, application/octet-stream		Details
html benchmark 13 years ago Christoph Kerschbaumer [:ckerschb} 647.70 KB, application/octet-stream		Details