Closed Bug 1746631 Opened 3 years ago Closed 2 years ago

Add matrix multiply intrinsics for Firefox Translations

Categories

(Core :: JavaScript: WebAssembly, task, P2)

task

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox98 --- fixed

People

(Reporter: aaggarwal, Assigned: aaggarwal)

References

Details

Attachments

(4 files)

Assignee: nobody → aaggarwal
Status: NEW → ASSIGNED
Depends on: 1722102

As a sanity check, I tested my implementation of the matrix-multiply intrinsics along with the intgemm sources in Firefox Nightly for Bergamot project and I could do the inference successfully on MacOS.

Further, I benchmarked the following setups to compare the translation speeds:

  1. Wasm Gemm : Gemm library (intgemm) compiled to wasm
  2. Wormhole : Gemm library (intgemm) compiled to wasm but using wormhole for 3 most expensive Intel instructions
  3. Native Firefox gemm : Entire Gemm library (intgemm) exported as intrinsics from within Firefox
    1. I could benchmark both SSSE3 and AVX2

I am using the same translator configuration for benchmarking.

Models used for evaluation: English -> German, English -> Spanish

Length of the text used for translation: ~5000 words
System: MacBook Pro (15-inch, 2017), MacOS version 11.6.2, 3.1 GHz Quad-Core Intel Core i7 processor, 16 GB 2133 MHz RAM
wps: Translation speed measured in words per second

Results for English -> German:

  1. Wasm Gemm : 95 wps Profiler
  2. Wormhole : 390 wps (+310% to Wasm Gemm), Profiler
  3. Native Firefox gemm
    1. SSSE3 : 490 wps (+25% to Wormhole, +415% to Wasm Gemm), Profiler
    2. AVX2 : 560 wps (+43% to Wormhole, +489% to Wasm Gemm), Profiler

Results for English -> Spanish:

  1. Wasm Gemm : 105 wps
  2. Wormhole : 440 wps (+319% to Wasm Gemm)
  3. Native Firefox gemm
    1. SSSE3 : 550 wps (+25% to Wormhole, +423% to Wasm Gemm)
    2. AVX2 : 625 wps (+43% to Wormhole, +495% to Wasm Gemm)
Severity: -- → N/A
Priority: -- → P2

Some information to take in account: we are mixing 128-bit and 256-bits SIMD instructions here. SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.

Can we test this theory by adding _mm256_zeroupper() at the beginning and ending of multiply intrinsic?

Dividing the whole change into following units as per my discussion with Yury:

  • Add support for intrinsics having more no. of arguments (12) than currently supported (4)
  • Add integer gemm intrinsics and its implementation
  • Test cases for the implementation

Earlier, functions with 4 arguments were supported.
Now, the support has been extended to 12 arguments.

  • Implements 7 intrinsic functions
  • These intrinsics are only enabled for x86/x86-64 platform and for
    privileged extensions
  • These intrinsics should never be accessible to web-pages
    -- Added corresponding mochitest

Depends on D136023

  • Test cases for all 7 intrinsic functions

Depends on D136430

  • The tests were previously running for beta and release
    • Now they run only for Nightly
    • Refactored the whole test skipping code
      • Moved that code from CommonTestSetup to a directive
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch
Regressions: 1754207

(In reply to Yury Delendik (:yury) from comment #3)

SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.

Checked build artifacts, the C++ compiler properly inserts VZEROUPPER instructions before existing the functions that use YMM registers, so looks like we are good here and there is no danger of slowdown.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: