Closed Bug 1916442 Opened 5 months ago Closed 4 months ago

js::wasm::IonCompileFunctions slow and resource consuming on onnx 1.20.x wasm

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: tarek, Assigned: jandem)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Whiteboard: [genai])

Attachments

(1 file)

I am working on integrating the new Transformers.js lib v3, and it works but I noticed a huge RSS memory usage in the inference worker.

I profiled the code and found that the js::wasm::IonCompileFunctions spins trying to compile some WASM. I could not do a memory profile because it locks the report in about:memory.

This is the profile I got from the profiler https://share.firefox.dev/4g6ZrQh
and to reproduce, follow these steps:

  • apply patch https://phabricator.services.mozilla.com/D220519
  • in about:config, set up browser.ml.enable to true
  • go in about:inference and select the NER preset
  • run the inference and wait for the models to download and the inference to finish (you will see results on the console on that page)

That will set the inference process to the mentioned state. You can check its activity in about:processes

Blocks: 1913071

Notice that if I set javascript.options.wasm_optimizingjit to false the problem is gone

Component: Machine Learning → JavaScript: WebAssembly

interesting : things seems to be faster for that lib when that pref is off

the wasm used can be picked here: https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.0-dev.20240829-be76e1e1b8/dist/

ort-wasm-simd-threaded.wasm and ort-wasm-simd-threaded.jsep.wasm

Assignee: nobody → jseward

On x86_64-linux, I managed to Ion-compile both ort-wasm-simd-threaded.wasm and
ort-wasm-simd-threaded.jsep.wasm to completion. The latter is larger and took
about 10 minutes and around 4GB of memory.

It contains some very large functions, the largest of which is 1233871 wasm
bytecode bytes, producing 388722 LIRs in 132855 basic blocks. This takes the
allocator a long time to process (several minutes), but it doesn't loop. I
imagine it will take about twice as long on an ARM64 platform, since ARM64 has
about twice as many integer registers to search through.

(In reply to Tarek ZiadΓ© (:tarek) from comment #2)

interesting : things seems to be faster for that lib when that pref is off

If you mean the lib appears to run faster when
javascript.options.wasm_optimizingjit is set to false, I think that is
expected, because the baseline compiled code isn't competing
against Ion's register allocator for compute resources. We know that
-- especially for large inputs -- Ion's register allocator has very poor
memory locality, and it could be that the resulting avalanche of
traffic to shared parts of the memory hierarchy -- the L3 cache and
DRAM -- slows down the baseline code.

Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?
From our point of view, the simplest "fix" would be to reduce the aggressiveness
of inlining, or whatever is causing the formation of such a huge wasm function,
so as to keep the register allocator away from such pathological behaviour.

I should add .. there are several very large functions in that file, not just one.
In order of increasing size, the top 9 sizes (in wasm bytecode bytes) are
101299, 103011, 111524, 133122, 146727, 149709, 404850, 651603, 1233871.
ort-wasm-simd-threaded.wasm also has very large functions, although
somewhat smaller than these.

Flags: needinfo?(tziade)

Are you in control of the building of ort-wasm-simd-threaded.jsep.wasm?

No, but the cmake script is here :
https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake

if there are some obvious changes we could do there, we could build our own artifacts.

The latter is larger and took about 10 minutes and around 4GB of memory.

10 minutes and 4GB is a no go for our users. is it possible to provide already compiled versions so they skip that step ?
If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream

Flags: needinfo?(tziade)

(In reply to Tarek ZiadΓ© (:tarek) from comment #7)

10 minutes and 4GB is a no go for our users.

Oh, indeed, 10 mins / 4GB is unreasonable in any scenario.

is it possible to provide already compiled versions so they skip that step ?

We don't have any (simple) way to do that, but ..

If not, would it be possible for now to run that specific WASM with Ion deactivated? In case the fix takes a long time to happen since it's upstream

.. yeah, something like that would be easy to do. The downside would be that
you would get performance of 60%-70% compared to that of Ion compiled
code. Is that acceptable? Are these .wasms performance-critical?

Is that acceptable? Are these .wasms performance-critical?

It is performance critical for sure.

This is the cmake file for building them https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_webassembly.cmake

I'll reach out the project to see if we can get some help

Julian, could you provide the steps you used to manually compile the WASMs ? I would also be curious to run the same thing on Chrome/Chromium to compare the time it takes

Flags: needinfo?(jseward)

(In reply to Tarek ZiadΓ© (:tarek) from comment #9)

I'll reach out the project to see if we can get some help

That might be worth doing also because it might be the case that other
wasm implementations are also unhappy at having to do optimised
compilation for such huge functions.

(In reply to Tarek ZiadΓ© (:tarek) from comment #10)

Julian, could you provide the steps you used to manually compile the WASMs ?

Put this in a file (eg testCompileWasm.js):

if (scriptArgs.length != 1) {
    print("usage: testCompileWasm /path/to/file.wasm");
    quit(0);
}
print("testCompileWasm: reading");
let b2 = os.file.readFile(scriptArgs[0], "binary");
print("testCompileWasm: compiling");
let m2 = new WebAssembly.Module(b2);
print("testCompileWasm: done " + m2);

Then run (eg)

 /path/to/dist/bin/js --no-ion --no-threads --wasm-compiler=ion \
   -P wasm_experimental_inline_depth_limit=0 \
   -P wasm_experimental_inline_size_limit=0 \
   -P wasm_experimental_inline_call_ref_threshold=0 \
   testCompileWasm.js /path/to/ort-wasm-simd-threaded.jsep.wasm

If it doesn't like the -P bits, remove them.
Change --wasm-compiler=ion to --wasm-compiler=baseline as needed.
--no-ion applies only to JS; has no effect on wasm.
--no-threads makes it a lot easier to profile/benchmark/debug.

Flags: needinfo?(jseward)

To build the shell, there are various ways; here is what I use
on x86_64-linux. Pretty old-fashioned I suspect, but it works.

Check out mozilla-central; then:

cd <mozilla-central>/js
mkdir BUILDX64OPT
cd BUILDX64OPT
CC="ccache clang" CXX="ccache clang++" ../src/configure --disable-debug --enable-optimize="-g -O2"
make -j8

Resulting binary should be BUILDX64OPT/dist/bin/js

FYI - We want to land that new lib in central asap because it unlocks a lot of features/improvements we need. In an ideal world before the end of september.

If we can't resolve this issue before that, it would be great to be able to disable that compilation for those lib and use the baseline compilation for now, until we resolve it.

Severity: -- → S3
Priority: -- → P1
See Also: → 1887312

I tried V8 version 12.7.224.16 engine through d8 with:

let wasm = "ort-wasm-simd-threaded.jsep.wasm";
const wasmCode = read(wasm, "binary");
const wasmModule = new WebAssembly.Module(wasmCode);
print(wasmModule);

and :

➜  compwasm /usr/bin/time -l d8 --liftoff --wasm-tier-up testCompileWasm.js
[object WebAssembly.Module]
        0,04 real         0,11 user         0,01 sys
            75153408  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                5038  page reclaims
                   3  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 314  involuntary context switches
           956823949  instructions retired
           286924688  cycles elapsed
            60901696  peak memory footprint

it's returning the module instantly, using 70MB of RSS. Maybe I am not calling it the right way?
My understanding is that --wasm-tier-up compiles it with TurboFan and --liftoff is the base compilation

The main problem in this case with the register allocator is that for each virtual register, it maintains a linked list of ranges, sorted by start position. Inserting and removing ranges is O(n) so this blows up when splitting large ranges.

If I change this data structure from a linked list to an AvlTree, it improves ort-wasm-simd-threaded.jsep.wasm from 292 seconds to 41 seconds locally. This can likely be optimized more as my patch is a naive implementation. I'm also not sure if this list really needs to be sorted; there are a few places where we depend on it but there might be better ways to handle that. I'll look into that some more.

Summary: js::wasm::IonCompileFunctions spins and eat RAM → js::wasm::IonCompileFunctions slow and resource consuming on onnx 1.20.x wasm

Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.

Now that we are tiering progressively, maybe a solution would be to never tier-up extremely large functions where we know that this would have more performance/battery life impact than can be outweigh by the compilation with Ion.

Not yet. The timeline of the release of the experimental compilation pipeline is not clear yet. OP expecting a solution by "the end of september."

This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.

(In reply to Jan de Mooij [:jandem] from comment #20)

This runs in 14 seconds for me locally with some more changes. The next issue is MIR dominator-tree building. The algorithm we have doesn't scale well with large MIR graphs and I want to see if we can change that to Semi-NCA which is a more recent algorithm that's also used by LLVM.

I've reimplemented ComputeImmediateDominators using Semi-NCA and it's a large improvement: 14 seconds to 7.9 seconds for ort-wasm-simd-threaded.jsep.wasm. I also verified it produces exactly the same immediate dominators for all jit-tests and this Wasm file.

After that, the next problem is that the register allocator allocates a bitmap for each basic block with a bit for each virtual register. This scales poorly for huge graphs (both time and memory usage). Replacing this with a HashSet per block for very large graphs improves it to 5.6 seconds.

These changes make Ion compilation >50x faster for this module.

A very large OpenOffice Wasm file of 200 MB improves only a little bit (6530 to 6450 ms) because this is mostly just fixing pathological cases we don't see with other Wasm modules. I also have to measure how each of these changes perform for normal size JS/Wasm MIR graphs.

Depends on: 1917766
Depends on: 1917817
Blocks: sm-regalloc
No longer blocks: 1913071
Blocks: 1913071
No longer blocks: sm-jits
Depends on: 1918357
Depends on: 1918970
Depends on: 1919025

Tarek, this should be much better on the latest mozilla-central.

I'll land a few more performance improvements. After that I also want to look into the memory usage - not sure how much we can do there but there's probably some low-hanging fruit.

Thanks so much for all the work you've done here, much appreciated! I've landed the lib as it's now much more efficient. Looking forward for more improvements. I have also reached out to the onnx devs and they have a inference test suite they run on their side, we could try against Ion to see how things are looking.

Attaching a zipped copy of ort-wasm-simd-threaded.jsep.wasm to make sure we don't lose it. It has been useful to find perf cliffs in the compiler backend.

Notice that onnx pushes its builds at https://cdn.jsdelivr.net/npm/onnxruntime-web@1.20.0-dev.20240917-afd642a194/dist/ so that place can be used to grab them anytime

Depends on: 1920430
Depends on: 1920951

With patches in the open bugs this is down to 3.9 seconds on my machine, 75x faster than before. That's good enough until we enable lazy tiering.

I still want to look into the memory usage a little to see if there's low hanging fruit there.

(In reply to Jan de Mooij [:jandem] from comment #26)

I still want to look into the memory usage a little to see if there's low hanging fruit there.

The high memory usage was mostly caused by the dense bit sets: 132856 basic blocks with a bit set for 199477 virtual registers is at least 3.1 GB total. Bug 1920430 is making this a sparse bit set.

Comparing a JS shell build with these changes to a shell build before these changes:

before: Maximum resident set size (kbytes): 4349284
after:  Maximum resident set size (kbytes): 1278296

Without parallel compilation it's 720 MB max. Most of that is memory allocated by Ion compilation of the largest function. We could optimize this more but I didn't see low-hanging fruit that'd reduce this a lot.

Thanks for the report. Quick question: for some hardware where the inference engine will take around 200MB of RSS, we would like to avoid using more RAM.

Could we have a mode where we bypass ION for this wasm ? I know it'll be at the cost of speed but if we don't have that option, we won't be able to offer inference at all on lower end devices where memory is sparse

Flags: needinfo?(jdemooij)

Please file a separate bug for that in the WebAssembly component :) It's possible we should wait for the lazy tiering work to land first. That will also make it easier for us to not tier up very large functions (everywhere or only on low-end devices).

Flags: needinfo?(jdemooij)
Depends on: 1922227
Depends on: 1922499
Assignee: jseward → jdemooij
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
No longer depends on: 1918357
Depends on: 1918357
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: