Closed Bug 1747265 Opened 10 months ago Closed 9 months ago

Ion instruction reordering very slow on Emscripten 3 code (was: Wasm built with Emscripten >2.0 is 8 times slower to compile than WASM built with Emscripten 1.40.1 on Android phones RAM <=6GB)

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox97 --- fixed

People

(Reporter: kbrosnan, Assigned: jandem)

Details

Attachments

(4 files)

From github: https://github.com/mozilla-mobile/fenix/issues/22947.

with firefox defaut profile, WebAssembly.compile time of built with Emscripten >2.0 is 8,9s (em3.0.1.wasm in tha attachment) ,while WASM built with Emscripten 1.40.1 is 1,2s (em1.40.1.wasm in the attachment), tested on Xiaomi K30(Ram= 6GB, 8 cores), firefox 96.0.0 beta.
wasm.zip
Then I observed the CPU used of wasm (Emscripten >2.0) compilation is 48%, I guess wasm (Emscripten >2.0) compilation maybe only use one core.
I set javascript.options.wasm_optimizingjit false and compile time for both the wasms is about 140ms compared to 100ms for Chromium 94. I saw a "wasm_verbose" switch, is there a relevant guidance document link?

Refer to * When wasm bytecode arrives, we choose the compilation strategy based on * switches and on aspects of the code and the hardware. If switches allow * tiered compilation to happen (the normal case), the following logic applies. * * If the code is sufficiently large that tiered compilation would be beneficial * but not so large that it might blow our compiled code budget and make * compilation fail, we choose tiered compilation. Otherwise we go straight to * optimized code.
it seems with default profile, the compile goes to "optimized code". is this reasonable?

  • Android device: SAMSUNG A51 (RAM 4G),Xiaomi K30(Ram= 6GB),etc
  • Fenix version: firefox android 96.0.0 beta

Change performed by the Move to Bugzilla add-on.

Component: JavaScript Engine → Javascript: WebAssembly

This could indicate:

  • tiering heuristics shunt 2.0 content over to ion directly instead of via baseline, which could happen if the 2.0 content is smaller and falls just below the limit while the 1.40 content was just over the limit
  • background compiler gets only one compile thread, which could happen because i believe we're now using the gecko scheduler
  • tiering logic could make its decision based on a faulty assumption about how many threads it is going to get

We should also try to figure out if the ram size has anything to do with it; it could, if the scheduler takes memory size into account.

"More data needed". We should prioritize this.

Severity: -- → S2
Priority: -- → P2
Attached file wasm.zip

I think the problem may be something else. Ion takes >3x longer to compile the 3.0 content than the 1.40 content on the foreground thread and >10x longer on background threads. On my x64 desktop system with this morning's mozilla-central,

Running on foreground thread only:

$ ~/m-u/obj-release/dist/bin/js --wasm-compiler=ion --no-threads em1.40.1.wasm.js
1451.291015625ms

$ ~/m-u/obj-release/dist/bin/js --wasm-compiler=ion --no-threads em3.0.1.wasm.js
5109.98388671875ms

Running multithreaded:

$ ~/m-u/obj-release/dist/bin/js --wasm-compiler=ion em1.40.1.wasm.js
349.660888671875ms

$ ~/m-u/obj-release/dist/bin/js --wasm-compiler=ion em3.0.1.wasm.js
4049.534912109375ms

Obviously this will have a profound impact on everything. The fact that multithreaded compilation slows down so much suggests that Ion gets stuck in one function somewhere and so the slowest background thread determines the runtime of the whole thing. I think focussing on the foreground compilation case first makes sense.

Attached file em3.0.1.wasm.js
Attached file em1.40.1.wasm.js

Perf says 65% of the time is spent in ReorderInstructions, mostly in MoveBefore, and indeed when disabling that pass the running time drops back down to what it should be.

Nothing interesting seems to have been done with that pass in quite some time; Jan added an early exit check this summer, but nothing since. We should try to bisect for the sake of having done it but I expect that what happens here is that emscripten 3 emits code that triggers nonlinear behavior in code that uses MoveBefore - MoveBefore has an embedded loop and itself is called from a loop, so maybe that's it, or maybe it's more complex.

Summary: Wasm built with Emscripten >2.0 is 8 times slower to compile than WASM built with Emscripten 1.40.1 on Android phones RAM <=6GB → Ion instruction reordering very slow on Emscripten 3 code (was: Wasm built with Emscripten >2.0 is 8 times slower to compile than WASM built with Emscripten 1.40.1 on Android phones RAM <=6GB)

It seems to be mostly from the code we have for moving constants, this can be O(n^2) and this is very slow for large blocks. I think we can rewrite this to be linear.

Flags: needinfo?(jdemooij)

Instruction reordering has code to move constants with a single use to the start of
the block, to allow more instruction reordering. This uses MoveBefore, which renumbers
instructions appropriately but unfortunately is also quadratic.

We can fix this by moving these constants before we do the initial renumbering.
On certain Wasm workloads this pass moves the same number of constants (and other
instructions) as before, but is now much faster.

Moving constants up this eagerly is a bit questionable and at some point we might want
to revisit the reordering pass completely, but for now this is a pretty simple rewrite
that fixes a bad perf cliff.

Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Pushed by jdemooij@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fd0ef0d3b8d3
Fix quadratic behavior when moving constants as part of instruction reordering. r=iain
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch
You need to log in before you can comment on or make changes to this bug.