Open Bug 2010070 Opened 5 days ago Updated 3 days ago

Enable GPU support for onnx-native

Categories

(Core :: Machine Learning: On Device, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: gregtatum, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

With onnx-native, we can potentially leverage GPUs and MPUs for doing work, bringing an order of magnitude speed increase.

:padenot, :tarek can you provide some additional details here for requirements, and I can split out some more bugs for any of the particulars.

I think we've discussed running GPU-enabled onnx-native in its own process. Is this accurate? Is it the HWInference process in Bug 1940906? I'm guessing that "HW" stands for hardware here.

We'd also need some tests on GPU-capable machines to run in CI to make this happen.

Flags: needinfo?(tarek)
Flags: needinfo?(padenot)

I had Claude Code specify the work here. Take it with a grain of salt of course. The biggest concern I have here is binary size estimates.

Current ONNX-Native Implementation

Architecture Overview

Firefox currently ships two ONNX backends:

  1. onnx (WASM backend)
    JavaScript/WebAssembly-based ONNX Runtime

  2. onnx-native (Native backend)
    C++ native ONNX Runtime loaded as a shared library


How ONNX-Native Works

1. WebIDL Interface

Location: dom/webidl/ONNX.webidl

This interface exposes two main classes to JavaScript:

Tensor

Represents multi-dimensional tensor data.

  • Supports multiple data types (e.g. int64, float32)

  • Has a location attribute indicating where data lives:

    • cpu
    • gpu-buffer
    • ml-tensor
    • etc.

InferenceSession

Manages model loading and inference.

  • create() – Loads an ONNX model from bytes
  • run() – Executes inference with input tensors
  • Supports session options, including executionProviders

2. C++ Implementation

Location: dom/onnx/InferenceSession.cpp

Dynamic Library Loading (lines 151–211)

OrtApi* GetOrtAPI() {
  // Dynamically loads libonnxruntime.so/.dylib/.dll from Firefox installation
  PRLibrary* handle = PR_LoadLibraryWithFlags(lspec, PR_LD_NOW | PR_LD_LOCAL);

  // Gets ONNX Runtime C API
  const OrtApiBase* apiBase = ortGetApiBaseFnPtr();
  OrtApi* ortAPI = apiBase->GetApi(ORT_API_VERSION);
}

Current Device Support

CPU-only, hardcoded in ONNXPipeline.mjs (lines 99–100):

supportedDevices: ["cpu"],
defaultDevices: ["cpu"],

Session Configuration (lines 70–149)

Supported options include:

  • Threading

    • intraOpNumThreads
    • interOpNumThreads
  • Optimization

    • graphOptimizationLevel (basic / extended / all)
  • Memory

    • enableCpuMemArena
    • enableMemPattern
  • Execution mode

    • Sequential
    • Parallel

3. Integration with Transformers.js

Location: toolkit/components/ml/content/backends/ONNXPipeline.mjs

The native backend is exposed via a global symbol:

const onnxruntime = {
  InferenceSession,
  Tensor,
  supportedDevices: ["cpu"], // currently hardcoded
  defaultDevices: ["cpu"],
};

globalThis[Symbol.for("onnxruntime")] = onnxruntime;

What GPU Support Would Require

1. Execution Providers

ONNX Runtime supports multiple Execution Providers (EPs) for hardware acceleration:

Provider Platform Hardware
CoreML macOS / iOS Apple Neural Engine + GPU
DirectML Windows DirectX 12 compatible GPUs
CUDA Linux / Windows NVIDIA GPUs
ROCm Linux AMD GPUs
TensorRT Linux / Windows NVIDIA GPUs (optimized)
OpenVINO Multi-platform Intel hardware

2. Code Changes Needed

A. Update WebIDL Options

The following already exists but is unused:

dictionary InferenceSessionSessionOptions {
  sequence<any> executionProviders; // exists but unused
  // ...
}

B. Implement Execution Provider Selection

Location: InferenceSession.cpp (around line 70)

if (aOptions.mExecutionProviders.WasPassed()) {
  for (const auto& provider : aOptions.mExecutionProviders.Value()) {
    // Parse provider name and options
    // Example: "CoreML", "DirectML", "CUDA"
    status =
      sAPI->SessionOptionsAppendExecutionProvider_XXX(
        sessionOptions, ...
      );
  }
}

C. Update Supported Devices

Location: ONNXPipeline.mjs (lines 99–100)

const onnxruntime = {
  InferenceSession,
  Tensor,
  supportedDevices: await detectAvailableDevices(), // ["cpu", "gpu", "npu"]
  defaultDevices: ["gpu", "cpu"], // GPU preferred, CPU fallback
};

3. Build System Changes

Current state:

  • Firefox ships libonnxruntime as a shared library
  • Built without GPU execution providers

Required changes:

  1. Build ONNX Runtime with EP support (CoreML, DirectML, etc.)
  2. Handle platform-specific dependencies
  3. Manage binary size growth (GPU EPs significantly increase size)

4. Process Architecture Considerations

Based on Bug 1940906, moving to a HWInference process is a likely path.

Current State

  • onnx-native runs in the Inference process
    (INFERENCE_REMOTE_TYPE)

Future GPU-Enabled State

  • Could run in HWInference process with:

    • GPU access
    • Hardware acceleration permissions
    • Better isolation from content processes

Relevant Code (lines 213–219)

bool InferenceSession::InInferenceProcess(JSContext*, JSObject*) {
  if (!ContentChild::GetSingleton()) {
    return false;
  }
  return ContentChild::GetSingleton()->GetRemoteType().Equals(
      INFERENCE_REMOTE_TYPE);
}

This logic would need updating if migrating to HWInference.


5. Testing Requirements

To support GPU acceleration in CI:

  • CI machines with real GPUs

  • Platform-specific EP testing:

    • CoreML on macOS CI
    • DirectML on Windows CI
  • Fallback behavior testing (GPU requested but unavailable)

  • Performance benchmarking


Technical Challenges

  1. Binary size
    GPU EPs add ~10–50 MB+ per platform

  2. Platform fragmentation
    Different EPs per OS (CoreML, DirectML, CUDA, ROCm)

  3. Tensor data location
    TensorDataLocation enum includes gpu-buffer, but it’s not implemented

  4. Zero-copy execution
    Avoid CPU ↔ GPU memory copies to get real gains

  5. Driver dependencies
    Some EPs require specific GPU driver versions


Performance Expectations

From the bug description: “order of magnitude speed increase”

Typical improvements:

  • Image models (ResNet, ViT): 5–20× faster
  • Transformer / LLM models: 10–50× faster with optimization
  • Small models: May be slower due to GPU overhead

Next Steps for Implementation

  1. Investigate how libonnxruntime is currently built and shipped
  2. Decide EP priority (e.g. CoreML first on macOS)
  3. Prototype in the HWInference process (Bug 1940906)
  4. Expose EP selection to ONNXPipeline.mjs
  5. Integrate EPs into the build system
  6. Add GPU-capable CI infrastructure

Claude isn't far off and most of the message above is useful, but I can provide some details that it couldn't have known.

Here's what we need to do (the order matters, but some items can be done in parallel):

  • Land https://bugzilla.mozilla.org/show_bug.cgi?id=1940906. This contains the code to create a new process type that has a dedicated sandbox policy, that will only do hardware accelerated inference (like the GPU process, without connection to the display server, no IO except shader cache, etc.). I'm addressing review comments and expect to land in a matter of weeks (I'm slightly sidetracked by other work), it works well on macOS already
  • Write a sandbox policy for Linux Desktop, Windows (more or less the same as the GPU process with a few bits removed, this isn't particularly involved). I plan to do this, because we need it for llama/whisper
  • Decide on execution providers for the onnx-runtime, for each platform. Adjust our build scripts to compile them. We have a fork at https://github.com/mozilla/onnxruntime/, building is done on treeherder, everything is set up already, script at and around https://searchfox.org/firefox-main/source/taskcluster/scripts/misc/build-onnxruntime.sh. There's a good chance it will be a bit big. I'm investigating options, since I have the same issue for libggml / llama / whisper. I have done some quick tests and it compresses well. Something I'm not sure is how a particular Firefox version fetches and uses a particular onnx-runtime version. We might need to do something here. In the past, we have received help from build system folks on this
  • Land https://bugzilla.mozilla.org/show_bug.cgi?id=1970667, https://bugzilla.mozilla.org/show_bug.cgi?id=1968939. If we don't have this, it might well be that we continue to compile graphs too much, and that will make our gains less good. Because we use transformers.js, we want to wait until they merge v4 (https://github.com/huggingface/transformers.js/tree/v4, imminent as far as I know), because we can't break ABI, and this needs a more recent API than what is currently in use. I have no idea how long it takes to update transformers.js. Newer transformers.js claims to fix the GPU->CPU->GPU roundtrips that make any performance improvement because of GPU acceleration essentially vanish
  • Subsequently, improve InferenceSession.cpp to work with GPU resources (what claude said). This isn't particularly involved, requires plumbing API between transformers.js and our C++ code, via WebIDL.
  • If we don't care about transformers.js, we can sidestep all that, directly make calls into onnx-runtime from C++.
  • Decide on how it works in terms of process topology. For now, the Inference (non-HW) process is a content process. It runs transformers.js, and I think we like that library. This Inference (non-HW) process cannot run GPU workloads. What it could do is to talk to the HWInference process (simply forwarding calls via IPC). This means we'd have 2 inference process. We need to discuss this, but there's a tension in between having a single process to save resources, and limiting attack surface in the HWInference process: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities.
  • In terms of CI, it's not hard, we have been running a few workloads on GPU-equipped machines for a while now, for example to do real testing hardware accelerated media encoders and decoders -- you'll find taskcluster files referencing worker types that usually have -gpu- somewhere in the name -- for media it was just a matter of moving some jobs to that type of machines
Flags: needinfo?(padenot)

If we don't care about transformers.js, we can sidestep all that, directly make calls into onnx-runtime from C++.

Transformer.js is in charge of things like tokenization and basically making it so that it's ergonomic to call into the onnx-runtime. If we remove it, then we'll be asking anyone who is interacting with the Firefox AI Runtime to care a lot about implementation details for getting their data into the proper tensor shape to be ready to run, and then interpret the tensor results when they come out. I guess it depends on how much of that we want to own, and the benefit/risk of moving away from it to use onnx runtime directly.

Perhaps we can revisit this after Bug 1992255 which could move tokenizers to C++.

HWInference process: my original idea is that it wouldn't run JS, for example, but that it could call more dangerous syscalls and have more capabilities

I'm happy with having a JS free process. A lot of the overhead of the Inference process is on loading a full JS runtime and the dependencies in the process. For instance, I see font loading and other things we absolutely don't need in the Inference process. Having lower level code executing in a sandboxed environment feels good to me.

You need to log in before you can comment on or make changes to this bug.