Closed Bug 1921425 Opened 1 year ago Closed 1 year ago

Allow optional Ion compilation for the inference engine

Tracking

()

Status:

RESOLVED FIXED

Milestone:

134 Branch

Tracking Flags:

Tracking

Status

firefox134

---

fixed

People

(Reporter: tarek, Assigned: rhunt)

Details

Attachments

(1 file)

Bug 1921425 - wasm: Add 'disableOptimizingCompiler' option to compile for privileged code. r?yury 1 year ago Ryan Hunt [:rhunt] 48 bytes, text/x-phabricator-request		Details \| Review

Tarek Ziadé (:tarek)

Reporter

Description

•

1 year ago

Per Bug 1916442, we've found out that the ONXX wasm runtime has huge functions that are comsuming lots of RSS when Ion compiles them.

The team did a great job reducing the issue, but we're still facing a challenge.

Some of the inference task that takes less than 300MB in RSS that could run on low-end devices, will require +700MB of RSS for the compilation, which might go over the memory ceiling we're going to set. So we might not be able to do any inference on those, unless we can skip the ion compilation (or part of it) even if it makes it a bit slower.

Thanks!

Ryan Hunt [:rhunt]

Assignee

Comment 1

•

1 year ago

Hi Tarek, thanks for the report!

Our compilation pipeline does have OOM handling code. So theoretically, if we hit an OOM while Ion compiling a module, we will fallback to just only baseline compiling the code. This is not well tested, so it might not work in practice. If you see crashes happening, we may be missing a case in our OOM handling code (a report or STR would be great)

We are experimenting with a feature to only Ion compile hot functions, and that could also help out here. But we are a bit of a ways from shipping that.

Tarek Ziadé (:tarek)

Reporter

Comment 2

•

1 year ago

Hey Ryan, thanks.
I am not looking at deferring the OOM handling to Ion, but disabling it altogether or giving it a RSS ceiling WAY before the OOM point because we need ram for other stuff too.

In practice, we will have a max RSS per inference process, and once the model has been loaded in ram, the remaining ram will be what's available for the JS execution and all the other work happening -- and that could be decided upfront

Tarek Ziadé (:tarek)

Reporter

Updated

•

1 year ago

Flags: needinfo?(rhunt)

Julian Seward [:jseward]

Comment 3

•

1 year ago

Could we (or: we could?) provide some kind of mechanism that optionally causes
SM not to Ion-compile functions over a certain size, hence avoiding the large
memory overhead of Ion-compiling huge functions? That would also have the
benefit of allowing the models to run Ion-generated code for all other
functions, rather than being stuck in baseline code for all functions.

A cheap hack for the current code base would be to put the switching in
ExecuteCompileTask() (WasmGenerator.cpp), so we use BaselineCompileFunctions
even when task->compilerEnv.tier() == Tier::Optimized if the function size is
over some threshold.

Looking forwards, with the new compilation pipeline (lazy tiering), a more
principled scheme would be to have WasmHandleRequestTierUp() ignore tier-up
requests for functions over a certain size, if requested to do so.

I'm not saying we would want this kludgery for wasm inputs in general; only as
a special-request for our own language-model stuff. So there would have to be
a way to restrict it to our own internal use.

Ryan Hunt [:rhunt]

Assignee

Comment 4

•

1 year ago

I'll take a look at this.

Tarek, do you think it'll be sufficient to just disable Ion compilation for the large functions or do you think we may need to disable it all together?

I'd also like to find a way to limit this to just low end devices. I think we can get the physical memory size for the device we're on. Would limiting this to only apply to devices with less than X GiB of memory work? Maybe with X == 2GiB? It could also be configurable with a browser pref.

Flags: needinfo?(rhunt) → needinfo?(tziade)

Ryan Hunt [:rhunt]

Assignee

Updated

•

1 year ago

Assignee: nobody → rhunt

Ryan Hunt [:rhunt]

Assignee

Comment 5

•

1 year ago

•

Edited

Actually, skipping Ion compilation for just a subset of functions is going to be difficult here. Our second tier must be serializable. Baseline currently compiles a tier-up check in the prologue that won't work when the module is deserialized in the future. There may be other things that prevent baseline code from being serializable, we haven't tested this before. So it feels high-risk for something that should be a quick fix. I think it makes more sense to just disable Ion all together on a low end device.

Tarek Ziadé (:tarek)

Reporter

Comment 6

•

1 year ago

I am working on looking up the memory before creating the worker, if you look at https://bugzilla.mozilla.org/show_bug.cgi?id=1924499
So I do have the memory info upfront and use it to make decisions. I could definitely pass it along before the WASM is loaded.

Happy to do a alternative call for non-ion execution

Flags: needinfo?(tziade)

Ryan Hunt [:rhunt]

Assignee

Comment 7

•

1 year ago

Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.

The easiest thing for us to do is to extend the compile API's like:

WebAssembly.compileStreaming(response, { disableOptimizingCompiler: true });
WebAssembly.instantiateStreaming(response, { disableOptimizingCompiler: true });
etc

Would that work for you?

Ryan Hunt [:rhunt]

Assignee

Comment 8

•

1 year ago

Attached file Bug 1921425 - wasm: Add 'disableOptimizingCompiler' option to compile for privileged code. r?yury — Details

Tarek Ziadé (:tarek)

Reporter

Comment 9

•

1 year ago

Ah, I was thinking of making the decision internally in the wasm engine. But if you want to make the decision yourself I can expose a chrome-only API for you.

For my use case I think it can be interesting to drive that decision because there will be multiple workers running in parallel, each one with the wasm loaded -- so even if the device has more memory, I could take this decision on the spot depending on the memory pressure and the type of task that is running.

For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless

Yury Delendik (:yury)

Comment 10

•

1 year ago

•

Edited

(In reply to Tarek Ziadé (:tarek) from comment #9)

For example some task might be doing inference in less than 3 seconds total and finish and get removed, so for them running Ion would be useless

Currently, there is (HTTP) caching logic present. If everything set right, no Ion/baseline compilation shall happen at all, just loading of machine code. We cache only Ion compiled code, which is more compact in comparison with baseline one.

Yury Delendik (:yury)

Comment 11

•

1 year ago

There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):

A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.

If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.

Ryan Hunt [:rhunt]

Assignee

Comment 12

•

1 year ago

(In reply to Yury Delendik (:yury) from comment #11)

There is also mechanism to quickly retrieve compiled WebAssembly.Module from storage of your choice (see https://webassembly.org/getting-started/js-api/):
A Module object is stateless and supports structured cloning which means that the compiled code can be stored in IndexedDB and/or shared between windows and workers via postMessage.
If it is in an extension context, the module can be lazily retrieved from the cache, held for some time, and supplied for short tasks.

I don't believe we support caching modules in IndexDB anymore. We do support caching of HTTP requests though.

Tarek Ziadé (:tarek)

Reporter

Comment 13

•

1 year ago

thanks! the WASM are picked from RemoteSettings and then put into IndexDB.
So maybe I would need to trigger the storage of the cached version manually in my code?

Pulsebot

Comment 14

•

1 year ago

Pushed by rhunt@eqrion.net: https://hg.mozilla.org/integration/autoland/rev/0a0bf085c738 wasm: Add 'disableOptimizingCompiler' option to compile for privileged code. r=yury

Serban Stanca [:SerbanS]

Comment 15

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0a0bf085c738

Status: NEW → RESOLVED

Closed: 1 year ago

status-firefox134: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 134 Branch

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Allow optional Ion compilation for the inference engine

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: tarek, Assigned: rhunt)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Attachment

General

Description

File Name

Content Type