Open Bug 2010070 Opened 5 days ago Updated 3 days ago

Enable GPU support for onnx-native

Tracking

()

Status:

NEW

People

(Reporter: gregtatum, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

Greg Tatum [:gregtatum]

Reporter

Description

•

5 days ago

With onnx-native, we can potentially leverage GPUs and MPUs for doing work, bringing an order of magnitude speed increase.

:padenot, :tarek can you provide some additional details here for requirements, and I can split out some more bugs for any of the particulars.

I think we've discussed running GPU-enabled onnx-native in its own process. Is this accurate? Is it the HWInference process in Bug 1940906? I'm guessing that "HW" stands for hardware here.

We'd also need some tests on GPU-capable machines to run in CI to make this happen.

Greg Tatum [:gregtatum]

Reporter

Updated

•

5 days ago

Flags: needinfo?(tarek)

Flags: needinfo?(padenot)

Greg Tatum [:gregtatum]

Reporter

Comment 1

•

5 days ago

I had Claude Code specify the work here. Take it with a grain of salt of course. The biggest concern I have here is binary size estimates.

Current ONNX-Native Implementation

Architecture Overview

Firefox currently ships two ONNX backends:

onnx (WASM backend)
JavaScript/WebAssembly-based ONNX Runtime
onnx-native (Native backend)
C++ native ONNX Runtime loaded as a shared library

How ONNX-Native Works

1. WebIDL Interface

Location: dom/webidl/ONNX.webidl

This interface exposes two main classes to JavaScript:

`Tensor`

Represents multi-dimensional tensor data.

Supports multiple data types (e.g. int64, float32)
Has a location attribute indicating where data lives:
- cpu
- gpu-buffer
- ml-tensor
- etc.

`InferenceSession`

Manages model loading and inference.

create() – Loads an ONNX model from bytes
run() – Executes inference with input tensors
Supports session options, including executionProviders

2. C++ Implementation

Location: dom/onnx/InferenceSession.cpp

Dynamic Library Loading (lines 151–211)

OrtApi* GetOrtAPI() {
  // Dynamically loads libonnxruntime.so/.dylib/.dll from Firefox installation
  PRLibrary* handle = PR_LoadLibraryWithFlags(lspec, PR_LD_NOW | PR_LD_LOCAL);

  // Gets ONNX Runtime C API
  const OrtApiBase* apiBase = ortGetApiBaseFnPtr();
  OrtApi* ortAPI = apiBase->GetApi(ORT_API_VERSION);
}

Current Device Support

CPU-only, hardcoded in ONNXPipeline.mjs (lines 99–100):

supportedDevices: ["cpu"],
defaultDevices: ["cpu"],

Session Configuration (lines 70–149)

Supported options include:

Threading
- intraOpNumThreads
- interOpNumThreads
Optimization
- graphOptimizationLevel (basic / extended / all)
Memory
- enableCpuMemArena
- enableMemPattern
Execution mode
- Sequential
- Parallel

3. Integration with Transformers.js

Location: toolkit/components/ml/content/backends/ONNXPipeline.mjs

The native backend is exposed via a global symbol:

const onnxruntime = {
  InferenceSession,
  Tensor,
  supportedDevices: ["cpu"], // currently hardcoded
  defaultDevices: ["cpu"],
};

globalThis[Symbol.for("onnxruntime")] = onnxruntime;

What GPU Support Would Require

1. Execution Providers

ONNX Runtime supports multiple Execution Providers (EPs) for hardware acceleration:

Provider	Platform	Hardware
CoreML	macOS / iOS	Apple Neural Engine + GPU
DirectML	Windows	DirectX 12 compatible GPUs
CUDA	Linux / Windows	NVIDIA GPUs
ROCm	Linux	AMD GPUs
TensorRT	Linux / Windows	NVIDIA GPUs (optimized)
OpenVINO	Multi-platform	Intel hardware

2. Code Changes Needed

A. Update WebIDL Options

The following already exists but is unused:

dictionary InferenceSessionSessionOptions {
  sequence<any> executionProviders; // exists but unused
  // ...
}

B. Implement Execution Provider Selection

Location: InferenceSession.cpp (around line 70)

if (aOptions.mExecutionProviders.WasPassed()) {
  for (const auto& provider : aOptions.mExecutionProviders.Value()) {
    // Parse provider name and options
    // Example: "CoreML", "DirectML", "CUDA"
    status =
      sAPI->SessionOptionsAppendExecutionProvider_XXX(
        sessionOptions, ...
      );
  }
}

C. Update Supported Devices

Location: ONNXPipeline.mjs (lines 99–100)

const onnxruntime = {
  InferenceSession,
  Tensor,
  supportedDevices: await detectAvailableDevices(), // ["cpu", "gpu", "npu"]
  defaultDevices: ["gpu", "cpu"], // GPU preferred, CPU fallback
};

3. Build System Changes

Current state:

Firefox ships libonnxruntime as a shared library
Built without GPU execution providers

Required changes:

Build ONNX Runtime with EP support (CoreML, DirectML, etc.)
Handle platform-specific dependencies
Manage binary size growth (GPU EPs significantly increase size)

4. Process Architecture Considerations

Based on Bug 1940906, moving to a HWInference process is a likely path.

Current State

onnx-native runs in the Inference process
(INFERENCE_REMOTE_TYPE)

Future GPU-Enabled State

Could run in HWInference process with:
- GPU access
- Hardware acceleration permissions
- Better isolation from content processes

Relevant Code (lines 213–219)

bool InferenceSession::InInferenceProcess(JSContext*, JSObject*) {
  if (!ContentChild::GetSingleton()) {
    return false;
  }
  return ContentChild::GetSingleton()->GetRemoteType().Equals(
      INFERENCE_REMOTE_TYPE);
}

This logic would need updating if migrating to HWInference.

5. Testing Requirements

To support GPU acceleration in CI:

CI machines with real GPUs
Platform-specific EP testing:
- CoreML on macOS CI
- DirectML on Windows CI
Fallback behavior testing (GPU requested but unavailable)
Performance benchmarking

Technical Challenges

Binary size
GPU EPs add ~10–50 MB+ per platform
Platform fragmentation
Different EPs per OS (CoreML, DirectML, CUDA, ROCm)
Tensor data location
TensorDataLocation enum includes gpu-buffer, but it’s not implemented
Zero-copy execution
Avoid CPU ↔ GPU memory copies to get real gains
Driver dependencies
Some EPs require specific GPU driver versions

Performance Expectations

From the bug description: “order of magnitude speed increase”

Typical improvements:

Image models (ResNet, ViT): 5–20× faster
Transformer / LLM models: 10–50× faster with optimization
Small models: May be slower due to GPU overhead

Next Steps for Implementation

Investigate how libonnxruntime is currently built and shipped
Decide EP priority (e.g. CoreML first on macOS)
Prototype in the HWInference process (Bug 1940906)
Expose EP selection to ONNXPipeline.mjs
Integrate EPs into the build system
Add GPU-capable CI infrastructure

Paul Adenot (:padenot)

Comment 2

•

5 days ago

Claude isn't far off and most of the message above is useful, but I can provide some details that it couldn't have known.

Here's what we need to do (the order matters, but some items can be done in parallel):

Land https://bugzilla.mozilla.org/show_bug.cgi?id=1940906. This contains the code to create a new process type that has a dedicated sandbox policy, that will only do hardware accelerated inference (like the GPU process, without connection to the display server, no IO except shader cache, etc.). I'm addressing review comments and expect to land in a matter of weeks (I'm slightly sidetracked by other work), it works well on macOS already
Write a sandbox policy for Linux Desktop, Windows (more or less the same as the GPU process with a few bits removed, this isn't particularly involved). I plan to do this, because we need it for llama/whisper
Decide on execution providers for the onnx-runtime, for each platform. Adjust our build scripts to compile them. We have a fork at https://github.com/mozilla/onnxruntime/, building is done on treeherder, everything is set up already, script at and around https://searchfox.org/firefox-main/source/taskcluster/scripts/misc/build-onnxruntime.sh. There's a good chance it will be a bit big. I'm investigating options, since I have the same issue for libggml / llama / whisper. I have done some quick tests and it compresses well. Something I'm not sure is how a particular Firefox version fetches and uses a particular onnx-runtime version. We might need to do something here. In the past, we have received help from build system folks on this
Land https://bugzilla.mozilla.org/show_bug.cgi?id=1970667, https://bugzilla.mozilla.org/show_bug.cgi?id=1968939. If we don't have this, it might well be that we continue to compile graphs too much, and that will make our gains less good. Because we use transformers.js, we want to wait until they merge v4 (https://github.com/huggingface/transformers.js/tree/v4, imminent as far as I know), because we can't break ABI, and this needs a more recent API than what is currently in use. I have no idea how long it takes to update transformers.js. Newer transformers.js claims to fix the GPU->CPU->GPU roundtrips that make any performance improvement because of GPU acceleration essentially vanish
Subsequently, improve InferenceSession.cpp to work with GPU resources (what claude said). This isn't particularly involved, requires plumbing API between transformers.js and our C++ code, via WebIDL.
If we don't care about transformers.js, we can sidestep all that, directly make calls into onnx-runtime from C++.
Decide on how it works in terms of process topology. For now, the Inference (non-HW) process is a content process. It runs transformers.js, and I think we like that library. This Inference (non-HW) process cannot run GPU workloads. What it could do is to talk to the HWInference process (simply forwarding calls via IPC). This means we'd have 2 inference process. We need to discuss this, but there's a tension in between having a single process to save resources, and limiting attack surface in the HWInference process: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities.
In terms of CI, it's not hard, we have been running a few workloads on GPU-equipped machines for a while now, for example to do real testing hardware accelerated media encoders and decoders -- you'll find taskcluster files referencing worker types that usually have -gpu- somewhere in the name -- for media it was just a matter of moving some jobs to that type of machines

Flags: needinfo?(padenot)

Greg Tatum [:gregtatum]

Reporter

Comment 3

•

5 days ago

If we don't care about transformers.js, we can sidestep all that, directly make calls into onnx-runtime from C++.

Transformer.js is in charge of things like tokenization and basically making it so that it's ergonomic to call into the onnx-runtime. If we remove it, then we'll be asking anyone who is interacting with the Firefox AI Runtime to care a lot about implementation details for getting their data into the proper tensor shape to be ready to run, and then interpret the tensor results when they come out. I guess it depends on how much of that we want to own, and the benefit/risk of moving away from it to use onnx runtime directly.

Perhaps we can revisit this after Bug 1992255 which could move tokenizers to C++.

HWInference process: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities

I'm happy with having a JS free process. A lot of the overhead of the Inference process is on loading a full JS runtime and the dependencies in the process. For instance, I see font loading and other things we absolutely don't need in the Inference process. Having lower level code executing in a sandboxed environment feels good to me.

You need to log in before you can comment on or make changes to this bug.