Enable GPU support for onnx-native
Categories
(Core :: Machine Learning: On Device, enhancement, P3)
Tracking
()
People
(Reporter: gregtatum, Unassigned, NeedInfo)
References
(Blocks 1 open bug)
Details
With onnx-native, we can potentially leverage GPUs and MPUs for doing work, bringing an order of magnitude speed increase.
:padenot, :tarek can you provide some additional details here for requirements, and I can split out some more bugs for any of the particulars.
I think we've discussed running GPU-enabled onnx-native in its own process. Is this accurate? Is it the HWInference process in Bug 1940906? I'm guessing that "HW" stands for hardware here.
We'd also need some tests on GPU-capable machines to run in CI to make this happen.
| Reporter | ||
Updated•5 days ago
|
| Reporter | ||
Comment 1•5 days ago
|
||
I had Claude Code specify the work here. Take it with a grain of salt of course. The biggest concern I have here is binary size estimates.
Current ONNX-Native Implementation
Architecture Overview
Firefox currently ships two ONNX backends:
-
onnx(WASM backend)
JavaScript/WebAssembly-based ONNX Runtime -
onnx-native(Native backend)
C++ native ONNX Runtime loaded as a shared library
How ONNX-Native Works
1. WebIDL Interface
Location: dom/webidl/ONNX.webidl
This interface exposes two main classes to JavaScript:
Tensor
Represents multi-dimensional tensor data.
-
Supports multiple data types (e.g.
int64,float32) -
Has a
locationattribute indicating where data lives:cpugpu-bufferml-tensor- etc.
InferenceSession
Manages model loading and inference.
create()– Loads an ONNX model from bytesrun()– Executes inference with input tensors- Supports session options, including
executionProviders
2. C++ Implementation
Location: dom/onnx/InferenceSession.cpp
Dynamic Library Loading (lines 151–211)
OrtApi* GetOrtAPI() {
// Dynamically loads libonnxruntime.so/.dylib/.dll from Firefox installation
PRLibrary* handle = PR_LoadLibraryWithFlags(lspec, PR_LD_NOW | PR_LD_LOCAL);
// Gets ONNX Runtime C API
const OrtApiBase* apiBase = ortGetApiBaseFnPtr();
OrtApi* ortAPI = apiBase->GetApi(ORT_API_VERSION);
}
Current Device Support
CPU-only, hardcoded in ONNXPipeline.mjs (lines 99–100):
supportedDevices: ["cpu"],
defaultDevices: ["cpu"],
Session Configuration (lines 70–149)
Supported options include:
-
Threading
intraOpNumThreadsinterOpNumThreads
-
Optimization
graphOptimizationLevel(basic / extended / all)
-
Memory
enableCpuMemArenaenableMemPattern
-
Execution mode
- Sequential
- Parallel
3. Integration with Transformers.js
Location: toolkit/components/ml/content/backends/ONNXPipeline.mjs
The native backend is exposed via a global symbol:
const onnxruntime = {
InferenceSession,
Tensor,
supportedDevices: ["cpu"], // currently hardcoded
defaultDevices: ["cpu"],
};
globalThis[Symbol.for("onnxruntime")] = onnxruntime;
What GPU Support Would Require
1. Execution Providers
ONNX Runtime supports multiple Execution Providers (EPs) for hardware acceleration:
| Provider | Platform | Hardware |
|---|---|---|
| CoreML | macOS / iOS | Apple Neural Engine + GPU |
| DirectML | Windows | DirectX 12 compatible GPUs |
| CUDA | Linux / Windows | NVIDIA GPUs |
| ROCm | Linux | AMD GPUs |
| TensorRT | Linux / Windows | NVIDIA GPUs (optimized) |
| OpenVINO | Multi-platform | Intel hardware |
2. Code Changes Needed
A. Update WebIDL Options
The following already exists but is unused:
dictionary InferenceSessionSessionOptions {
sequence<any> executionProviders; // exists but unused
// ...
}
B. Implement Execution Provider Selection
Location: InferenceSession.cpp (around line 70)
if (aOptions.mExecutionProviders.WasPassed()) {
for (const auto& provider : aOptions.mExecutionProviders.Value()) {
// Parse provider name and options
// Example: "CoreML", "DirectML", "CUDA"
status =
sAPI->SessionOptionsAppendExecutionProvider_XXX(
sessionOptions, ...
);
}
}
C. Update Supported Devices
Location: ONNXPipeline.mjs (lines 99–100)
const onnxruntime = {
InferenceSession,
Tensor,
supportedDevices: await detectAvailableDevices(), // ["cpu", "gpu", "npu"]
defaultDevices: ["gpu", "cpu"], // GPU preferred, CPU fallback
};
3. Build System Changes
Current state:
- Firefox ships
libonnxruntimeas a shared library - Built without GPU execution providers
Required changes:
- Build ONNX Runtime with EP support (CoreML, DirectML, etc.)
- Handle platform-specific dependencies
- Manage binary size growth (GPU EPs significantly increase size)
4. Process Architecture Considerations
Based on Bug 1940906, moving to a HWInference process is a likely path.
Current State
onnx-nativeruns in the Inference process
(INFERENCE_REMOTE_TYPE)
Future GPU-Enabled State
-
Could run in HWInference process with:
- GPU access
- Hardware acceleration permissions
- Better isolation from content processes
Relevant Code (lines 213–219)
bool InferenceSession::InInferenceProcess(JSContext*, JSObject*) {
if (!ContentChild::GetSingleton()) {
return false;
}
return ContentChild::GetSingleton()->GetRemoteType().Equals(
INFERENCE_REMOTE_TYPE);
}
This logic would need updating if migrating to HWInference.
5. Testing Requirements
To support GPU acceleration in CI:
-
CI machines with real GPUs
-
Platform-specific EP testing:
- CoreML on macOS CI
- DirectML on Windows CI
-
Fallback behavior testing (GPU requested but unavailable)
-
Performance benchmarking
Technical Challenges
-
Binary size
GPU EPs add ~10–50 MB+ per platform -
Platform fragmentation
Different EPs per OS (CoreML, DirectML, CUDA, ROCm) -
Tensor data location
TensorDataLocationenum includesgpu-buffer, but it’s not implemented -
Zero-copy execution
Avoid CPU ↔ GPU memory copies to get real gains -
Driver dependencies
Some EPs require specific GPU driver versions
Performance Expectations
From the bug description: “order of magnitude speed increase”
Typical improvements:
- Image models (ResNet, ViT): 5–20× faster
- Transformer / LLM models: 10–50× faster with optimization
- Small models: May be slower due to GPU overhead
Next Steps for Implementation
- Investigate how
libonnxruntimeis currently built and shipped - Decide EP priority (e.g. CoreML first on macOS)
- Prototype in the HWInference process (Bug 1940906)
- Expose EP selection to
ONNXPipeline.mjs - Integrate EPs into the build system
- Add GPU-capable CI infrastructure
Comment 2•5 days ago
|
||
Claude isn't far off and most of the message above is useful, but I can provide some details that it couldn't have known.
Here's what we need to do (the order matters, but some items can be done in parallel):
- Land https://bugzilla.mozilla.org/show_bug.cgi?id=1940906. This contains the code to create a new process type that has a dedicated sandbox policy, that will only do hardware accelerated inference (like the GPU process, without connection to the display server, no IO except shader cache, etc.). I'm addressing review comments and expect to land in a matter of weeks (I'm slightly sidetracked by other work), it works well on macOS already
- Write a sandbox policy for Linux Desktop, Windows (more or less the same as the GPU process with a few bits removed, this isn't particularly involved). I plan to do this, because we need it for llama/whisper
- Decide on execution providers for the
onnx-runtime, for each platform. Adjust our build scripts to compile them. We have a fork at https://github.com/mozilla/onnxruntime/, building is done on treeherder, everything is set up already, script at and around https://searchfox.org/firefox-main/source/taskcluster/scripts/misc/build-onnxruntime.sh. There's a good chance it will be a bit big. I'm investigating options, since I have the same issue forlibggml/llama/whisper. I have done some quick tests and it compresses well. Something I'm not sure is how a particular Firefox version fetches and uses a particularonnx-runtimeversion. We might need to do something here. In the past, we have received help from build system folks on this - Land https://bugzilla.mozilla.org/show_bug.cgi?id=1970667, https://bugzilla.mozilla.org/show_bug.cgi?id=1968939. If we don't have this, it might well be that we continue to compile graphs too much, and that will make our gains less good. Because we use transformers.js, we want to wait until they merge v4 (https://github.com/huggingface/transformers.js/tree/v4, imminent as far as I know), because we can't break ABI, and this needs a more recent API than what is currently in use. I have no idea how long it takes to update transformers.js. Newer
transformers.jsclaims to fix the GPU->CPU->GPU roundtrips that make any performance improvement because of GPU acceleration essentially vanish - Subsequently, improve
InferenceSession.cppto work with GPU resources (what claude said). This isn't particularly involved, requires plumbing API between transformers.js and our C++ code, via WebIDL. - If we don't care about transformers.js, we can sidestep all that, directly make calls into
onnx-runtimefrom C++. - Decide on how it works in terms of process topology. For now, the Inference (non-HW) process is a content process. It runs
transformers.js, and I think we like that library. This Inference (non-HW) process cannot run GPU workloads. What it could do is to talk to theHWInferenceprocess (simply forwarding calls via IPC). This means we'd have 2 inference process. We need to discuss this, but there's a tension in between having a single process to save resources, and limiting attack surface in theHWInferenceprocess: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities. - In terms of CI, it's not hard, we have been running a few workloads on GPU-equipped machines for a while now, for example to do real testing hardware accelerated media encoders and decoders -- you'll find taskcluster files referencing worker types that usually have
-gpu-somewhere in the name -- for media it was just a matter of moving some jobs to that type of machines
| Reporter | ||
Comment 3•5 days ago
|
||
If we don't care about transformers.js, we can sidestep all that, directly make calls into onnx-runtime from C++.
Transformer.js is in charge of things like tokenization and basically making it so that it's ergonomic to call into the onnx-runtime. If we remove it, then we'll be asking anyone who is interacting with the Firefox AI Runtime to care a lot about implementation details for getting their data into the proper tensor shape to be ready to run, and then interpret the tensor results when they come out. I guess it depends on how much of that we want to own, and the benefit/risk of moving away from it to use onnx runtime directly.
Perhaps we can revisit this after Bug 1992255 which could move tokenizers to C++.
HWInference process: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities
I'm happy with having a JS free process. A lot of the overhead of the Inference process is on loading a full JS runtime and the dependencies in the process. For instance, I see font loading and other things we absolutely don't need in the Inference process. Having lower level code executing in a sandboxed environment feels good to me.
Description
•