Open Bug 1916442 Opened 16 days ago Updated 3 days ago

js::wasm::IonCompileFunctions slow and resource consuming on onnx 1.20.x wasm

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

defect

Tracking

()

People

(Reporter: tarek, Assigned: jseward)

References

(Depends on 3 open bugs, Blocks 3 open bugs)

Details

(Whiteboard: [genai])

I am working on integrating the new Transformers.js lib v3, and it works but I noticed a huge RSS memory usage in the inference worker.

I profiled the code and found that the js::wasm::IonCompileFunctions spins trying to compile some WASM. I could not do a memory profile because it locks the report in about:memory.

This is the profile I got from the profiler https://share.firefox.dev/4g6ZrQh
and to reproduce, follow these steps:

  • apply patch https://phabricator.services.mozilla.com/D220519
  • in about:config, set up browser.ml.enable to true
  • go in about:inference and select the NER preset
  • run the inference and wait for the models to download and the inference to finish (you will see results on the console on that page)

That will set the inference process to the mentioned state. You can check its activity in about:processes

Blocks: 1913071

Notice that if I set javascript.options.wasm_optimizingjit to false the problem is gone

Component: Machine Learning → JavaScript: WebAssembly

interesting : things seems to be faster for that lib when that pref is off

the wasm used can be picked here: https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.0-dev.20240829-be76e1e1b8/dist/

ort-wasm-simd-threaded.wasm and ort-wasm-simd-threaded.jsep.wasm

Assignee: nobody → jseward

On x86_64-linux, I managed to Ion-compile both ort-wasm-simd-threaded.wasm and
ort-wasm-simd-threaded.jsep.wasm to completion. The latter is larger and took
about 10 minutes and around 4GB of memory.

It contains some very large functions, the largest of which is 1233871 wasm
bytecode bytes, producing 388722 LIRs in 132855 basic blocks. This takes the
allocator a long time to process (several minutes), but it doesn't loop. I
imagine it will take about twice as long on an ARM64 platform, since ARM64 has
about twice as many integer registers to search through.

(In reply to Tarek Ziadé (:tarek) from comment #2)

interesting : things seems to be faster for that lib when that pref is off

If you mean the lib appears to run faster when
javascript.options.wasm_optimizingjit is set to false, I think that is
expected, because the baseline compiled code isn't competing
against Ion's register allocator for compute resources. We know that
-- especially for large inputs -- Ion's register allocator has very poor
memory locality, and it could be that the resulting avalanche of
traffic to shared parts of the memory hierarchy -- the L3 cache and
DRAM -- slows down the baseline code.

Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?
From our point of view, the simplest "fix" would be to reduce the aggressiveness
of inlining, or whatever is causing the formation of such a huge wasm function,
so as to keep the register allocator away from such pathological behaviour.

I should add .. there are several very large functions in that file, not just one.
In order of increasing size, the top 9 sizes (in wasm bytecode bytes) are
101299, 103011, 111524, 133122, 146727, 149709, 404850, 651603, 1233871.
ort-wasm-simd-threaded.wasm also has very large functions, although
somewhat smaller than these.

Flags: needinfo?(tziade)

Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?

No, but the cmake script is here :
https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake

if there are some obvious changes we could do there, we could build our own artifacts.

The latter is larger and took about 10 minutes and around 4GB of memory.

10 minutes and 4GB is a no go for our users. is it possible to provide already compiled versions so they skip that step ?
If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream

Flags: needinfo?(tziade)

(In reply to Tarek Ziadé (:tarek) from comment #7)

10 minutes and 4GB is a no go for our users.

Oh, indeed, 10 mins / 4GB is unreasonable in any scenario.

is it possible to provide already compiled versions so they skip that step ?

We don't have any (simple) way to do that, but ..

If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream

.. yeah, something like that would be easy to do. The downside would be that
you would get performance of 60%-70% compared to that of Ion compiled
code. Is that acceptable? Are these .wasms performance-critical?

Is that acceptable? Are these .wasms performance-critical?

It is performance critical for sure.

This is the cmake file for building them https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake

I'll reach out the project to see if we can get some help

Julian, could you provide the steps you used to manually compile the WASMs ? I would also be curious to run the same thing on Chrome/Chromium to compare the time it takes

Flags: needinfo?(jseward)

(In reply to Tarek Ziadé (:tarek) from comment #9)

I'll reach out the project to see if we can get some help

That might be worth doing also because it might be the case that other
wasm implementations are also unhappy at having to do optimised
compilation for such huge functions.

(In reply to Tarek Ziadé (:tarek) from comment #10)

Julian, could you provide the steps you used to manually compile the WASMs ?

Put this in a file (eg testCompileWasm.js):

if (scriptArgs.length != 1) {
    print("usage: testCompileWasm /path/to/file.wasm");
    quit(0);
}
print("testCompileWasm: reading");
let b2 = os.file.readFile(scriptArgs[0], "binary");
print("testCompileWasm: compiling");
let m2 = new WebAssembly.Module(b2);
print("testCompileWasm: done " + m2);

Then run (eg)

 /path/to/dist/bin/js --no-ion --no-threads --wasm-compiler=ion \
   -P wasm_experimental_inline_depth_limit=0 \
   -P wasm_experimental_inline_size_limit=0 \
   -P wasm_experimental_inline_call_ref_threshold=0 \
   testCompileWasm.js /path/to/ort-wasm-simd-threaded.jsep.wasm

If it doesn't like the -P bits, remove them.
Change --wasm-compiler=ion to --wasm-compiler=baseline as needed.
--no-ion applies only to JS; has no effect on wasm.
--no-threads makes it a lot easier to profile/benchmark/debug.

Flags: needinfo?(jseward)

To build the shell, there are various ways; here is what I use
on x86_64-linux. Pretty old-fashioned I suspect, but it works.

Check out mozilla-central; then:

cd <mozilla-central>/js
mkdir BUILDX64OPT
cd BUILDX64OPT
CC="ccache clang" CXX="ccache clang++" ../src/configure --disable-debug --enable-optimize="-g -O2"
make -j8

Resulting binary should be BUILDX64OPT/dist/bin/js

FYI - We want to land that new lib in central asap because it unlocks a lot of features/improvements we need. In an ideal world before the end of september.

If we can't resolve this issue before that, it would be great to be able to disable that compilation for those lib and use the baseline compilation for now, until we resolve it.

Severity: -- → S3
Priority: -- → P1
See Also: → 1887312

I tried V8 version 12.7.224.16 engine through d8 with:

let wasm = "ort-wasm-simd-threaded.jsep.wasm";
const wasmCode = read(wasm, "binary");
const wasmModule = new WebAssembly.Module(wasmCode);
print(wasmModule);

and :

➜  compwasm /usr/bin/time -l d8 --liftoff --wasm-tier-up testCompileWasm.js
[object WebAssembly.Module]
        0,04 real         0,11 user         0,01 sys
            75153408  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                5038  page reclaims
                   3  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 314  involuntary context switches
           956823949  instructions retired
           286924688  cycles elapsed
            60901696  peak memory footprint

it's returning the module instantly, using 70MB of RSS. Maybe I am not calling it the right way?
My understanding is that --wasm-tier-up compiles it with TurboFan and --liftoff is the base compilation

The main problem in this case with the register allocator is that for each virtual register, it maintains a linked list of ranges, sorted by start position. Inserting and removing ranges is O(n) so this blows up when splitting large ranges.

If I change this data structure from a linked list to an AvlTree, it improves ort-wasm-simd-threaded.jsep.wasm from 292 seconds to 41 seconds locally. This can likely be optimized more as my patch is a naive implementation. I'm also not sure if this list really needs to be sorted; there are a few places where we depend on it but there might be better ways to handle that. I'll look into that some more.

Summary: js::wasm::IonCompileFunctions spins and eat RAM → js::wasm::IonCompileFunctions slow and resource consuming on onnx 1.20.x wasm

Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.

Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.

Not yet. The timeline of the release of the experimental compilation pipeline is not clear yet. OP expecting a solution by "the end of september."

This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.

(In reply to Jan de Mooij [:jandem] from comment #20)

This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.

I've reimplemented ComputeImmediateDominators using Semi-NCA and it's a large improvement: 14 seconds to 7.9 seconds for ort-wasm-simd-threaded.jsep.wasm. I also verified it produces exactly the same immediate dominators for all jit-tests and this Wasm file.

After that, the next problem is that the register allocator allocates a bitmap for each basic block with a bit for each virtual register. This scales poorly for huge graphs (both time and memory usage). Replacing this with a HashSet per block for very large graphs improves it to 5.6 seconds.

These changes make Ion compilation >50x faster for this module.

A very large OpenOffice Wasm file of 200 MB improves only a little bit (6530 to 6450 ms) because this is mostly just fixing pathological cases we don't see with other Wasm modules. I also have to measure how each of these changes perform for normal size JS/Wasm MIR graphs.

Depends on: 1917766
Depends on: 1917817
Depends on: 1918357
Depends on: 1918970
Depends on: 1919025
You need to log in before you can comment on or make changes to this bug.