Closed Bug 1921425 Opened 1 year ago Closed 1 year ago

Allow optional Ion compilation for the inference engine

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
134 Branch
Tracking Status
firefox134 --- fixed

People

(Reporter: tarek, Assigned: rhunt)

Details

Attachments

(1 file)

Per Bug 1916442, we've found out that the ONXX wasm runtime has huge functions that are comsuming lots of RSS when Ion compiles them.

The team did a great job reducing the issue, but we're still facing a challenge.

Some of the inference task that takes less than 300MB in RSS that could run on low-end devices, will require +700MB of RSS for the compilation, which might go over the memory ceiling we're going to set. So we might not be able to do any inference on those, unless we can skip the ion compilation (or part of it) even if it makes it a bit slower.

Thanks!

Hi Tarek, thanks for the report!

Our compilation pipeline does have OOM handling code. So theoretically, if we hit an OOM while Ion compiling a module, we will fallback to just only baseline compiling the code. This is not well tested, so it might not work in practice. If you see crashes happening, we may be missing a case in our OOM handling code (a report or STR would be great)

We are experimenting with a feature to only Ion compile hot functions, and that could also help out here. But we are a bit of a ways from shipping that.

Hey Ryan, thanks.
I am not looking at deferring the OOM handling to Ion, but disabling it altogether or giving it a RSS ceiling WAY before the OOM point because we need ram for other stuff too.

In practice, we will have a max RSS per inference process, and once the model has been loaded in ram, the remaining ram will be what's available for the JS execution and all the other work happening -- and that could be decided upfront

Flags: needinfo?(rhunt)

Could we (or: we could?) provide some kind of mechanism that optionally causes
SM not to Ion-compile functions over a certain size, hence avoiding the large
memory overhead of Ion-compiling huge functions? That would also have the
benefit of allowing the models to run Ion-generated code for all other
functions, rather than being stuck in baseline code for all functions.

A cheap hack for the current code base would be to put the switching in
ExecuteCompileTask() (WasmGenerator.cpp), so we use BaselineCompileFunctions
even when task->compilerEnv.tier() == Tier::Optimized if the function size is
over some threshold.

Looking forwards, with the new compilation pipeline (lazy tiering), a more
principled scheme would be to have WasmHandleRequestTierUp() ignore tier-up
requests for functions over a certain size, if requested to do so.

I'm not saying we would want this kludgery for wasm inputs in general; only as
a special-request for our own language-model stuff. So there would have to be
a way to restrict it to our own internal use.

I'll take a look at this.

Tarek, do you think it'll be sufficient to just disable Ion compilation for the large functions or do you think we may need to disable it all together?

I'd also like to find a way to limit this to just low end devices. I think we can get the physical memory size for the device we're on. Would limiting this to only apply to devices with less than X GiB of memory work? Maybe with X == 2GiB? It could also be configurable with a browser pref.

Flags: needinfo?(rhunt) → needinfo?(tziade)
Assignee: nobody → rhunt

Actually, skipping Ion compilation for just a subset of functions is going to be difficult here. Our second tier must be serializable. Baseline currently compiles a tier-up check in the prologue that won't work when the module is deserialized in the future. There may be other things that prevent baseline code from being serializable, we haven't tested this before. So it feels high-risk for something that should be a quick fix. I think it makes more sense to just disable Ion all together on a low end device.

I am working on looking up the memory before creating the worker, if you look at https://bugzilla.mozilla.org/show_bug.cgi?id=1924499
So I do have the memory info upfront and use it to make decisions. I could definitely pass it along before the WASM is loaded.

Happy to do a alternative call for non-ion execution

Flags: needinfo?(tziade)

Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.

The easiest thing for us to do is to extend the compile API's like:

WebAssembly.compileStreaming(response, { disableOptimizingCompiler: true });
WebAssembly.instantiateStreaming(response, { disableOptimizingCompiler: true });
etc

Would that work for you?

Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.

For my use case I think it can be interesting to drive that decision because there will be multiple workers running in parallel, each one with the wasm loaded -- so even if the device has more memory, I could take this decision on the spot depending on the memory pressure and the type of task that is running.

For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless

(In reply to Tarek Ziadé (:tarek) from comment #9)

For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless

Currently, there is (HTTP) caching logic present. If everything set right, no Ion/baseline compilation shall happen at all, just loading of machine code. We cache only Ion compiled code, which is more compact in comparison with baseline one.

There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):

A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.

If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.

(In reply to Yury Delendik (:yury) from comment #11)

There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):

A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.

If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.

I don't believe we support caching modules in IndexDB anymore. We do support caching of HTTP requests though.

thanks! the WASM are picked from RemoteSettings and then put into IndexDB.
So maybe I would need to trigger the storage of the cached version manually in my code?

Pushed by rhunt@eqrion.net: https://hg.mozilla.org/integration/autoland/rev/0a0bf085c738 wasm: Add 'disableOptimizingCompiler' option to compile for privileged code. r=yury
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 134 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: