[meta] WebAssembly optimizations for Firefox Translations
Categories
(Firefox :: Translations, enhancement)
Tracking
()
People
(Reporter: anatal, Unassigned)
References
(Depends on 2 open bugs)
Details
(Keywords: meta)
This meta bug aims to capture and group all possible performance improvements in the form of bugs for the neural machine translation wasm component [1] utilized by Firefox Translations. [2]
[1] https://github.com/mozilla/bergamot-translator
[2] https://phabricator.services.mozilla.com/D114810
| Reporter | ||
Comment 1•4 years ago
|
||
:lars, could you please add the other open bugs you have on this subject as blockers of this one so we can keep track of all of them?
Comment 2•4 years ago
|
||
Bug 1699355 is a discussion about exposing certain Firefox builtins to Wasm code. We have a design for this that is credible (mozilla-internal; ask me for a link). We would use a system like this to expose architecture-optimal (specialized to AVX/AVX2) and audited implementations of very hot functions (eg the matrix multiplication primitives) to wasm code roughly under the same terms as the wormhole, ie, available to privileged extensions. See also https://github.com/mozilla-extensions/bergamot-browser-extension/issues/75 which dives into this in some detail. The idea meshes API-wise with some aspects of WebNN (https://webmachinelearning.github.io/webnn/#api-mlgraphbuilder-gemm) and is attractive for that reason. It also exposes no machine dependencies and ports readily to ARM64 without future Bergamot changes. This idea is modest in terms of scope and work (2-3 weeks realistically).
Bug 1699192 is a discussion about improving the wormhole in order to unlock better performance without any changes to Bergamot, eg, by specializing the wormhole to available hardware so that we could make use of AVX/AVX2 for the wormhole instructions in isolation. This would probably yield somewhat better matrix multiply performance (AVX has three-address instructions and AVX code therefore has lower register pressure and less register shuffling) but there are a number of uncertainties about how much better. This would be a small amount of work (a few days).
Related to the previous item (but no bug), it would be possible for appropriately privileged code to sniff the architecture the program is running on and to ask for wormhole instructions that are available on that architecture (or equivalently to ask whether certain instructions are available before using them). This would be a small amount of work for us (a few days) and could also yield better performance but would require Bergamot changes to switch among mmul implementations depending on the hardware.
The wormhole instructions may become standardized through the relaxed SIMD proposal (bug 1706922), especially if we make a case for them.
Finally, bug 1708743 tracks porting our entire SIMD implementation to target AVX2 when AVX2 is available. This is probably a fairly large amount of work (multiple weeks) with a fair amount of uncertainty in the estimate.
For all these ideas we are up against an uncertainty, namely, that last time we tried to enable AVX in our JIT we saw performance regressions, as described in more detail in bug 1708743. There is a risk that even if we implement these changes we won't be able to ship them for that reason, or that we will have to spend significant time investigating and mitigating such regressions.
I think we have two paths to really excellent (and cross-platform) performance. One combines an AVX2 retargeting of the JIT with enhanced wormhole instructions and some infrastructure around that; eventually the instructions become standardized and cross-platform. This is in a sense my preferred solution as it has a standards-track future, but it's a significant amount of work and has significant uncertainty. The other exposes audited matrix multiply primitives to wasm under an appropriate flag/privilege system; this is cross-platform in its nature (we can provide a portable C++ fallback solution if we want), is a modest amount of work, delivers known-good performance, can be polyfilled with pure wasm code, and does not create uncertainty about how the JIT will adapt to AVX/AVX2 (as the JIT remains unchanged).
| Reporter | ||
Updated•4 years ago
|
Comment 3•4 years ago
|
||
From my side, I am happy to make Bergamot changes to improve performance; so don't worry too much about whether changes to Bergamot are required. Switching between 8-bit matrix multiply implementations that make different approximations (such as saturating on maddubs or not saturating at all with ARM SDOT) is fine for me, even if done as one opaque call. Though WebNN will want to make this distinction.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•1 year ago
|
Description
•