Add matrix multiply intrinsics for Firefox Translations
Categories
(Core :: JavaScript: WebAssembly, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox98 | --- | fixed |
People
(Reporter: aaggarwal, Assigned: aaggarwal)
References
Details
Attachments
(4 files)
-
Add actual matrix multiply intrinsics in mozIntGemm() module (added via https://bugzilla.mozilla.org/show_bug.cgi?id=1720514 and https://bugzilla.mozilla.org/show_bug.cgi?id=1721686)
-
Actual intrinsics to be added here
As a sanity check, I tested my implementation of the matrix-multiply intrinsics along with the intgemm sources in Firefox Nightly for Bergamot project and I could do the inference successfully on MacOS.
Further, I benchmarked the following setups to compare the translation speeds:
Wasm Gemm
: Gemm library (intgemm) compiled to wasmWormhole
: Gemm library (intgemm) compiled to wasm but using wormhole for 3 most expensive Intel instructionsNative Firefox gemm
: Entire Gemm library (intgemm) exported as intrinsics from within Firefox- I could benchmark both SSSE3 and AVX2
I am using the same translator configuration for benchmarking.
Models used for evaluation: English -> German, English -> Spanish
Length of the text used for translation: ~5000 words
System: MacBook Pro (15-inch, 2017), MacOS version 11.6.2, 3.1 GHz Quad-Core Intel Core i7 processor, 16 GB 2133 MHz RAM
wps: Translation speed measured in words per second
Results for English -> German:
Results for English -> Spanish:
Wasm Gemm
: 105 wpsWormhole
: 440 wps (+319% to Wasm Gemm)Native Firefox gemm
- SSSE3 : 550 wps (+25% to Wormhole, +423% to Wasm Gemm)
- AVX2 : 625 wps (+43% to Wormhole, +495% to Wasm Gemm)
Updated•3 years ago
|
Comment 3•3 years ago
|
||
Some information to take in account: we are mixing 128-bit and 256-bits SIMD instructions here. SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.
Can we test this theory by adding _mm256_zeroupper() at the beginning and ending of multiply intrinsic?
Dividing the whole change into following units as per my discussion with Yury:
- Add support for intrinsics having more no. of arguments (12) than currently supported (4)
- Add integer gemm intrinsics and its implementation
- Test cases for the implementation
Earlier, functions with 4 arguments were supported.
Now, the support has been extended to 12 arguments.
- Implements 7 intrinsic functions
- These intrinsics are only enabled for x86/x86-64 platform and for
privileged extensions - These intrinsics should never be accessible to web-pages
-- Added corresponding mochitest
Depends on D136023
- Test cases for all 7 intrinsic functions
Depends on D136430
- The tests were previously running for beta and release
- Now they run only for Nightly
- Refactored the whole test skipping code
- Moved that code from CommonTestSetup to a directive
Comment 9•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/82953c5e0fc2
https://hg.mozilla.org/mozilla-central/rev/11cf2672ce7f
https://hg.mozilla.org/mozilla-central/rev/d4be0d9bbd68
Comment 10•3 years ago
|
||
bugherder |
Comment 11•3 years ago
|
||
(In reply to Yury Delendik (:yury) from comment #3)
SM is using 128-bit and intgemm AVX2 can use 256-bit. The Intel documentation say that this will slow down the execution. The recommendation is to use _mm256_zeroupper() and the boundary.
Checked build artifacts, the C++ compiler properly inserts VZEROUPPER instructions before existing the functions that use YMM registers, so looks like we are good here and there is no danger of slowdown.
Description
•