Closed Bug 1746631 Opened 3 years ago Closed 3 years ago

Add matrix multiply intrinsics for Firefox Translations

Tracking

()

Status:

RESOLVED FIXED

Milestone:

98 Branch

Tracking Flags:

Tracking

Status

firefox98

---

fixed

People

(Reporter: aaggarwal, Assigned: aaggarwal)

References

Details

Attachments

(4 files)

Bug 1746631 - Support intrinsic functions with more arguments. r?rhunt,yury 3 years ago Abhishek 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1746631 - Implement integer gemm intrinsic functions. r?rhunt,yury 3 years ago Abhishek 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1746631 - JIT tests for integer gemm intrinsic functions. r?rhunt,yury 3 years ago Abhishek 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1746631 - Disabled JIT tests for integer gemm in release and beta. r?rhunt,yury 3 years ago Abhishek 48 bytes, text/x-phabricator-request		Details \| Review

Abhishek

Assignee

Description

•

3 years ago

Add actual matrix multiply intrinsics in mozIntGemm() module (added via https://bugzilla.mozilla.org/show_bug.cgi?id=1720514 and https://bugzilla.mozilla.org/show_bug.cgi?id=1721686)
Actual intrinsics to be added here

Abhishek

Assignee

Updated

•

3 years ago

Blocks: wasm-bergamot

Abhishek

Assignee

Updated

•

3 years ago

Assignee: nobody → aaggarwal

Status: NEW → ASSIGNED

Abhishek

Assignee

Updated

•

3 years ago

Depends on: 1722102

Abhishek

Assignee

Comment 1

•

3 years ago

•

Edited

As a sanity check, I tested my implementation of the matrix-multiply intrinsics along with the intgemm sources in Firefox Nightly for Bergamot project and I could do the inference successfully on MacOS.

Further, I benchmarked the following setups to compare the translation speeds:

Wasm Gemm : Gemm library (intgemm) compiled to wasm
Wormhole : Gemm library (intgemm) compiled to wasm but using wormhole for 3 most expensive Intel instructions
Native Firefox gemm : Entire Gemm library (intgemm) exported as intrinsics from within Firefox
1. I could benchmark both SSSE3 and AVX2

I am using the same translator configuration for benchmarking.

Models used for evaluation: English -> German, English -> Spanish

Length of the text used for translation: ~5000 words
System: MacBook Pro (15-inch, 2017), MacOS version 11.6.2, 3.1 GHz Quad-Core Intel Core i7 processor, 16 GB 2133 MHz RAM
wps: Translation speed measured in words per second

Results for English -> German:

Wasm Gemm : 95 wps Profiler
Wormhole : 390 wps (+310% to Wasm Gemm), Profiler
Native Firefox gemm
1. SSSE3 : 490 wps (+25% to Wormhole, +415% to Wasm Gemm), Profiler
2. AVX2 : 560 wps (+43% to Wormhole, +489% to Wasm Gemm), Profiler

Results for English -> Spanish:

Wasm Gemm : 105 wps
Wormhole : 440 wps (+319% to Wasm Gemm)
Native Firefox gemm
1. SSSE3 : 550 wps (+25% to Wormhole, +423% to Wasm Gemm)
2. AVX2 : 625 wps (+43% to Wormhole, +495% to Wasm Gemm)

Lars T Hansen [:lth]

Updated

•

3 years ago

Severity: -- → N/A

Priority: -- → P2

Abhishek

Assignee

Comment 2

•

3 years ago

•

Edited

Yury Delendik (:yury)

Comment 3

•

3 years ago

Some information to take in account: we are mixing 128-bit and 256-bits SIMD instructions here. SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.

Can we test this theory by adding _mm256_zeroupper() at the beginning and ending of multiply intrinsic?

Abhishek

Assignee

Comment 4

•

3 years ago

Dividing the whole change into following units as per my discussion with Yury:

Add support for intrinsics having more no. of arguments (12) than currently supported (4)
Add integer gemm intrinsics and its implementation
Test cases for the implementation

Abhishek

Assignee

Comment 5

•

3 years ago

Attached file Bug 1746631 - Support intrinsic functions with more arguments. r?rhunt,yury — Details

Earlier, functions with 4 arguments were supported.
Now, the support has been extended to 12 arguments.

Abhishek

Assignee

Comment 6

•

3 years ago

Attached file Bug 1746631 - Implement integer gemm intrinsic functions. r?rhunt,yury — Details

Implements 7 intrinsic functions
These intrinsics are only enabled for x86/x86-64 platform and for
privileged extensions
These intrinsics should never be accessible to web-pages
-- Added corresponding mochitest

Depends on D136023

Abhishek

Assignee

Comment 7

•

3 years ago

Attached file Bug 1746631 - JIT tests for integer gemm intrinsic functions. r?rhunt,yury — Details

Test cases for all 7 intrinsic functions

Depends on D136430

Abhishek

Assignee

Comment 8

•

3 years ago

Attached file Bug 1746631 - Disabled JIT tests for integer gemm in release and beta. r?rhunt,yury — Details

The tests were previously running for beta and release
- Now they run only for Nightly
- Refactored the whole test skipping code
  - Moved that code from CommonTestSetup to a directive

Andreea Pavel [:apavel]

Comment 9

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/82953c5e0fc2
https://hg.mozilla.org/mozilla-central/rev/11cf2672ce7f
https://hg.mozilla.org/mozilla-central/rev/d4be0d9bbd68

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox98: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Sandor Molnar[:smolnar]

Comment 10

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7907a10250a1

Andra Esanu (needinfo me)

Updated

•

3 years ago

Regressions: 1754207

Yury Delendik (:yury)

Comment 11

•

3 years ago

(In reply to Yury Delendik (:yury) from comment #3)

SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.

Checked build artifacts, the C++ compiler properly inserts VZEROUPPER instructions before existing the functions that use YMM registers, so looks like we are good here and there is no danger of slowdown.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add matrix multiply intrinsics for Firefox Translations

Categories

(Core :: JavaScript: WebAssembly, task, P2)

Tracking

()

People

(Reporter: aaggarwal, Assigned: aaggarwal)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Attachment

General

Description

File Name

Content Type