js::wasm::IonCompileFunctions slow and resource consuming on onnx 1.20.x wasm
Categories
(Core :: JavaScript: WebAssembly, defect, P1)
Tracking
()
People
(Reporter: tarek, Assigned: jseward)
References
(Depends on 3 open bugs, Blocks 3 open bugs)
Details
(Whiteboard: [genai])
I am working on integrating the new Transformers.js lib v3, and it works but I noticed a huge RSS memory usage in the inference worker.
I profiled the code and found that the js::wasm::IonCompileFunctions
spins trying to compile some WASM. I could not do a memory profile because it locks the report in about:memory
.
This is the profile I got from the profiler https://share.firefox.dev/4g6ZrQh
and to reproduce, follow these steps:
- apply patch https://phabricator.services.mozilla.com/D220519
- in
about:config
, set upbrowser.ml.enable
totrue
- go in
about:inference
and select theNER
preset - run the inference and wait for the models to download and the inference to finish (you will see results on the console on that page)
That will set the inference
process to the mentioned state. You can check its activity in about:processes
Updated•16 days ago
|
Reporter | ||
Comment 1•16 days ago
|
||
Notice that if I set javascript.options.wasm_optimizingjit
to false
the problem is gone
Updated•16 days ago
|
Reporter | ||
Comment 2•16 days ago
|
||
interesting : things seems to be faster for that lib when that pref is off
Reporter | ||
Comment 3•16 days ago
|
||
the wasm used can be picked here: https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.0-dev.20240829-be76e1e1b8/dist/
ort-wasm-simd-threaded.wasm
and ort-wasm-simd-threaded.jsep.wasm
Assignee | ||
Updated•16 days ago
|
Updated•16 days ago
|
Assignee | ||
Comment 4•15 days ago
•
|
||
On x86_64-linux, I managed to Ion-compile both ort-wasm-simd-threaded.wasm and
ort-wasm-simd-threaded.jsep.wasm to completion. The latter is larger and took
about 10 minutes and around 4GB of memory.
It contains some very large functions, the largest of which is 1233871 wasm
bytecode bytes, producing 388722 LIRs in 132855 basic blocks. This takes the
allocator a long time to process (several minutes), but it doesn't loop. I
imagine it will take about twice as long on an ARM64 platform, since ARM64 has
about twice as many integer registers to search through.
Assignee | ||
Comment 5•15 days ago
|
||
(In reply to Tarek Ziadé (:tarek) from comment #2)
interesting : things seems to be faster for that lib when that pref is off
If you mean the lib appears to run faster when
javascript.options.wasm_optimizingjit is set to false, I think that is
expected, because the baseline compiled code isn't competing
against Ion's register allocator for compute resources. We know that
-- especially for large inputs -- Ion's register allocator has very poor
memory locality, and it could be that the resulting avalanche of
traffic to shared parts of the memory hierarchy -- the L3 cache and
DRAM -- slows down the baseline code.
Assignee | ||
Comment 6•15 days ago
|
||
Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?
From our point of view, the simplest "fix" would be to reduce the aggressiveness
of inlining, or whatever is causing the formation of such a huge wasm function,
so as to keep the register allocator away from such pathological behaviour.
I should add .. there are several very large functions in that file, not just one.
In order of increasing size, the top 9 sizes (in wasm bytecode bytes) are
101299, 103011, 111524, 133122, 146727, 149709, 404850, 651603, 1233871.
ort-wasm-simd-threaded.wasm also has very large functions, although
somewhat smaller than these.
Reporter | ||
Comment 7•15 days ago
|
||
Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?
No, but the cmake script is here :
https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake
if there are some obvious changes we could do there, we could build our own artifacts.
The latter is larger and took about 10 minutes and around 4GB of memory.
10 minutes and 4GB is a no go for our users. is it possible to provide already compiled versions so they skip that step ?
If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream
Assignee | ||
Comment 8•15 days ago
•
|
||
(In reply to Tarek Ziadé (:tarek) from comment #7)
10 minutes and 4GB is a no go for our users.
Oh, indeed, 10 mins / 4GB is unreasonable in any scenario.
is it possible to provide already compiled versions so they skip that step ?
We don't have any (simple) way to do that, but ..
If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream
.. yeah, something like that would be easy to do. The downside would be that
you would get performance of 60%-70% compared to that of Ion compiled
code. Is that acceptable? Are these .wasms performance-critical?
Reporter | ||
Comment 9•15 days ago
|
||
Is that acceptable? Are these .wasms performance-critical?
It is performance critical for sure.
This is the cmake file for building them https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake
I'll reach out the project to see if we can get some help
Reporter | ||
Comment 10•15 days ago
|
||
Julian, could you provide the steps you used to manually compile the WASMs ? I would also be curious to run the same thing on Chrome/Chromium to compare the time it takes
Reporter | ||
Comment 11•15 days ago
|
||
Added this for cross visibility https://github.com/microsoft/onnxruntime/issues/21978
Assignee | ||
Comment 12•15 days ago
|
||
(In reply to Tarek Ziadé (:tarek) from comment #9)
I'll reach out the project to see if we can get some help
That might be worth doing also because it might be the case that other
wasm implementations are also unhappy at having to do optimised
compilation for such huge functions.
Assignee | ||
Comment 13•15 days ago
|
||
(In reply to Tarek Ziadé (:tarek) from comment #10)
Julian, could you provide the steps you used to manually compile the WASMs ?
Put this in a file (eg testCompileWasm.js
):
if (scriptArgs.length != 1) {
print("usage: testCompileWasm /path/to/file.wasm");
quit(0);
}
print("testCompileWasm: reading");
let b2 = os.file.readFile(scriptArgs[0], "binary");
print("testCompileWasm: compiling");
let m2 = new WebAssembly.Module(b2);
print("testCompileWasm: done " + m2);
Then run (eg)
/path/to/dist/bin/js --no-ion --no-threads --wasm-compiler=ion \
-P wasm_experimental_inline_depth_limit=0 \
-P wasm_experimental_inline_size_limit=0 \
-P wasm_experimental_inline_call_ref_threshold=0 \
testCompileWasm.js /path/to/ort-wasm-simd-threaded.jsep.wasm
If it doesn't like the -P
bits, remove them.
Change --wasm-compiler=ion
to --wasm-compiler=baseline
as needed.
--no-ion
applies only to JS; has no effect on wasm.
--no-threads
makes it a lot easier to profile/benchmark/debug.
Assignee | ||
Comment 14•15 days ago
|
||
To build the shell, there are various ways; here is what I use
on x86_64-linux. Pretty old-fashioned I suspect, but it works.
Check out mozilla-central; then:
cd <mozilla-central>/js
mkdir BUILDX64OPT
cd BUILDX64OPT
CC="ccache clang" CXX="ccache clang++" ../src/configure --disable-debug --enable-optimize="-g -O2"
make -j8
Resulting binary should be BUILDX64OPT/dist/bin/js
Reporter | ||
Comment 15•15 days ago
|
||
FYI - We want to land that new lib in central asap because it unlocks a lot of features/improvements we need. In an ideal world before the end of september.
If we can't resolve this issue before that, it would be great to be able to disable that compilation for those lib and use the baseline compilation for now, until we resolve it.
Updated•15 days ago
|
Reporter | ||
Comment 16•14 days ago
|
||
I tried V8 version 12.7.224.16 engine through d8 with:
let wasm = "ort-wasm-simd-threaded.jsep.wasm";
const wasmCode = read(wasm, "binary");
const wasmModule = new WebAssembly.Module(wasmCode);
print(wasmModule);
and :
➜ compwasm /usr/bin/time -l d8 --liftoff --wasm-tier-up testCompileWasm.js
[object WebAssembly.Module]
0,04 real 0,11 user 0,01 sys
75153408 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
5038 page reclaims
3 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
314 involuntary context switches
956823949 instructions retired
286924688 cycles elapsed
60901696 peak memory footprint
it's returning the module instantly, using 70MB of RSS. Maybe I am not calling it the right way?
My understanding is that --wasm-tier-up
compiles it with TurboFan and --liftoff
is the base compilation
Comment 17•14 days ago
•
|
||
The main problem in this case with the register allocator is that for each virtual register, it maintains a linked list of ranges, sorted by start position. Inserting and removing ranges is O(n)
so this blows up when splitting large ranges.
If I change this data structure from a linked list to an AvlTree
, it improves ort-wasm-simd-threaded.jsep.wasm
from 292 seconds to 41 seconds locally. This can likely be optimized more as my patch is a naive implementation. I'm also not sure if this list really needs to be sorted; there are a few places where we depend on it but there might be better ways to handle that. I'll look into that some more.
Reporter | ||
Updated•14 days ago
|
Comment 18•14 days ago
|
||
Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.
Comment 19•14 days ago
|
||
Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.
Not yet. The timeline of the release of the experimental compilation pipeline is not clear yet. OP expecting a solution by "the end of september."
Comment 20•13 days ago
|
||
This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.
Comment 21•11 days ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #20)
This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.
I've reimplemented ComputeImmediateDominators
using Semi-NCA and it's a large improvement: 14 seconds to 7.9 seconds for ort-wasm-simd-threaded.jsep.wasm
. I also verified it produces exactly the same immediate dominators for all jit-tests and this Wasm file.
After that, the next problem is that the register allocator allocates a bitmap for each basic block with a bit for each virtual register. This scales poorly for huge graphs (both time and memory usage). Replacing this with a HashSet
per block for very large graphs improves it to 5.6 seconds.
These changes make Ion compilation >50x faster for this module.
A very large OpenOffice Wasm file of 200 MB improves only a little bit (6530 to 6450 ms) because this is mostly just fixing pathological cases we don't see with other Wasm modules. I also have to measure how each of these changes perform for normal size JS/Wasm MIR graphs.
Updated•9 days ago
|
Description
•