Allow optional Ion compilation for the inference engine
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox134 | --- | fixed |
People
(Reporter: tarek, Assigned: rhunt)
Details
Attachments
(1 file)
Per Bug 1916442, we've found out that the ONXX wasm runtime has huge functions that are comsuming lots of RSS when Ion compiles them.
The team did a great job reducing the issue, but we're still facing a challenge.
Some of the inference task that takes less than 300MB in RSS that could run on low-end devices, will require +700MB of RSS for the compilation, which might go over the memory ceiling we're going to set. So we might not be able to do any inference on those, unless we can skip the ion compilation (or part of it) even if it makes it a bit slower.
Thanks!
| Assignee | ||
Comment 1•1 year ago
|
||
Hi Tarek, thanks for the report!
Our compilation pipeline does have OOM handling code. So theoretically, if we hit an OOM while Ion compiling a module, we will fallback to just only baseline compiling the code. This is not well tested, so it might not work in practice. If you see crashes happening, we may be missing a case in our OOM handling code (a report or STR would be great)
We are experimenting with a feature to only Ion compile hot functions, and that could also help out here. But we are a bit of a ways from shipping that.
| Reporter | ||
Comment 2•1 year ago
|
||
Hey Ryan, thanks.
I am not looking at deferring the OOM handling to Ion, but disabling it altogether or giving it a RSS ceiling WAY before the OOM point because we need ram for other stuff too.
In practice, we will have a max RSS per inference process, and once the model has been loaded in ram, the remaining ram will be what's available for the JS execution and all the other work happening -- and that could be decided upfront
| Reporter | ||
Updated•1 year ago
|
Comment 3•1 year ago
|
||
Could we (or: we could?) provide some kind of mechanism that optionally causes
SM not to Ion-compile functions over a certain size, hence avoiding the large
memory overhead of Ion-compiling huge functions? That would also have the
benefit of allowing the models to run Ion-generated code for all other
functions, rather than being stuck in baseline code for all functions.
A cheap hack for the current code base would be to put the switching in
ExecuteCompileTask() (WasmGenerator.cpp), so we use BaselineCompileFunctions
even when task->compilerEnv.tier() == Tier::Optimized if the function size is
over some threshold.
Looking forwards, with the new compilation pipeline (lazy tiering), a more
principled scheme would be to have WasmHandleRequestTierUp() ignore tier-up
requests for functions over a certain size, if requested to do so.
I'm not saying we would want this kludgery for wasm inputs in general; only as
a special-request for our own language-model stuff. So there would have to be
a way to restrict it to our own internal use.
| Assignee | ||
Comment 4•1 year ago
|
||
I'll take a look at this.
Tarek, do you think it'll be sufficient to just disable Ion compilation for the large functions or do you think we may need to disable it all together?
I'd also like to find a way to limit this to just low end devices. I think we can get the physical memory size for the device we're on. Would limiting this to only apply to devices with less than X GiB of memory work? Maybe with X == 2GiB? It could also be configurable with a browser pref.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 5•1 year ago
•
|
||
Actually, skipping Ion compilation for just a subset of functions is going to be difficult here. Our second tier must be serializable. Baseline currently compiles a tier-up check in the prologue that won't work when the module is deserialized in the future. There may be other things that prevent baseline code from being serializable, we haven't tested this before. So it feels high-risk for something that should be a quick fix. I think it makes more sense to just disable Ion all together on a low end device.
| Reporter | ||
Comment 6•1 year ago
|
||
I am working on looking up the memory before creating the worker, if you look at https://bugzilla.mozilla.org/show_bug.cgi?id=1924499
So I do have the memory info upfront and use it to make decisions. I could definitely pass it along before the WASM is loaded.
Happy to do a alternative call for non-ion execution
| Assignee | ||
Comment 7•1 year ago
|
||
Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.
The easiest thing for us to do is to extend the compile API's like:
WebAssembly.compileStreaming(response, { disableOptimizingCompiler: true });
WebAssembly.instantiateStreaming(response, { disableOptimizingCompiler: true });
etc
Would that work for you?
| Assignee | ||
Comment 8•1 year ago
|
||
| Reporter | ||
Comment 9•1 year ago
|
||
Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.
For my use case I think it can be interesting to drive that decision because there will be multiple workers running in parallel, each one with the wasm loaded -- so even if the device has more memory, I could take this decision on the spot depending on the memory pressure and the type of task that is running.
For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless
Comment 10•1 year ago
•
|
||
(In reply to Tarek Ziadé (:tarek) from comment #9)
For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless
Currently, there is (HTTP) caching logic present. If everything set right, no Ion/baseline compilation shall happen at all, just loading of machine code. We cache only Ion compiled code, which is more compact in comparison with baseline one.
Comment 11•1 year ago
|
||
There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):
A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.
If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.
| Assignee | ||
Comment 12•1 year ago
|
||
(In reply to Yury Delendik (:yury) from comment #11)
There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):
A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.
I don't believe we support caching modules in IndexDB anymore. We do support caching of HTTP requests though.
| Reporter | ||
Comment 13•1 year ago
|
||
thanks! the WASM are picked from RemoteSettings and then put into IndexDB.
So maybe I would need to trigger the storage of the cached version manually in my code?
Comment 14•1 year ago
|
||
Comment 15•1 year ago
|
||
| bugherder | ||
Description
•