1713056 - [meta] WebAssembly optimizations for Firefox Translations

Reporter

Description

•

4 years ago

This meta bug aims to capture and group all possible performance improvements in the form of bugs for the neural machine translation wasm component [1] utilized by Firefox Translations. [2]

[1] https://github.com/mozilla/bergamot-translator
[2] https://phabricator.services.mozilla.com/D114810

André Natal

Reporter

Comment 1

•

4 years ago

:lars, could you please add the other open bugs you have on this subject as blockers of this one so we can keep track of all of them?

Flags: needinfo?(lhansen)

André Natal

Reporter

Updated

•

4 years ago

Depends on: 1713057

André Natal

Reporter

Updated

•

4 years ago

Depends on: 1713062

André Natal

Reporter

Updated

•

4 years ago

Depends on: 1713066

Lars T Hansen [:lth]

Comment 2

•

4 years ago

Bug 1699355 is a discussion about exposing certain Firefox builtins to Wasm code. We have a design for this that is credible (mozilla-internal; ask me for a link). We would use a system like this to expose architecture-optimal (specialized to AVX/AVX2) and audited implementations of very hot functions (eg the matrix multiplication primitives) to wasm code roughly under the same terms as the wormhole, ie, available to privileged extensions. See also https://github.com/mozilla-extensions/bergamot-browser-extension/issues/75 which dives into this in some detail. The idea meshes API-wise with some aspects of WebNN (https://webmachinelearning.github.io/webnn/#api-mlgraphbuilder-gemm) and is attractive for that reason. It also exposes no machine dependencies and ports readily to ARM64 without future Bergamot changes. This idea is modest in terms of scope and work (2-3 weeks realistically).

Bug 1699192 is a discussion about improving the wormhole in order to unlock better performance without any changes to Bergamot, eg, by specializing the wormhole to available hardware so that we could make use of AVX/AVX2 for the wormhole instructions in isolation. This would probably yield somewhat better matrix multiply performance (AVX has three-address instructions and AVX code therefore has lower register pressure and less register shuffling) but there are a number of uncertainties about how much better. This would be a small amount of work (a few days).

Related to the previous item (but no bug), it would be possible for appropriately privileged code to sniff the architecture the program is running on and to ask for wormhole instructions that are available on that architecture (or equivalently to ask whether certain instructions are available before using them). This would be a small amount of work for us (a few days) and could also yield better performance but would require Bergamot changes to switch among mmul implementations depending on the hardware.

The wormhole instructions may become standardized through the relaxed SIMD proposal (bug 1706922), especially if we make a case for them.

Finally, bug 1708743 tracks porting our entire SIMD implementation to target AVX2 when AVX2 is available. This is probably a fairly large amount of work (multiple weeks) with a fair amount of uncertainty in the estimate.

For all these ideas we are up against an uncertainty, namely, that last time we tried to enable AVX in our JIT we saw performance regressions, as described in more detail in bug 1708743. There is a risk that even if we implement these changes we won't be able to ship them for that reason, or that we will have to spend significant time investigating and mitigating such regressions.

I think we have two paths to really excellent (and cross-platform) performance. One combines an AVX2 retargeting of the JIT with enhanced wormhole instructions and some infrastructure around that; eventually the instructions become standardized and cross-platform. This is in a sense my preferred solution as it has a standards-track future, but it's a significant amount of work and has significant uncertainty. The other exposes audited matrix multiply primitives to wasm under an appropriate flag/privilege system; this is cross-platform in its nature (we can provide a portable C++ fallback solution if we want), is a modest amount of work, delivers known-good performance, can be polyfilled with pure wasm code, and does not create uncertainty about how the JIT will adapt to AVX/AVX2 (as the JIT remains unchanged).

Depends on: 1699192, 1699355, simd-avx2

Flags: needinfo?(lhansen)

André Natal

Reporter

Updated

•

4 years ago

Depends on: 1714438

André Natal

Reporter

Updated

•

4 years ago

Blocks: 1714438

No longer depends on: 1714438

Kenneth Heafield

Comment 3

•

4 years ago

From my side, I am happy to make Bergamot changes to improve performance; so don't worry too much about whether changes to Bergamot are required. Switching between 8-bit matrix multiply implementations that make different approximations (such as saturating on maddubs or not saturating at all with ARM SDOT) is fine for me, even if done as one opaque call. Though WebNN will want to make this distinction.

Lars T Hansen [:lth]

Comment 4

•

4 years ago

SDOT/UDOT wormhole work item: bug 1715704.

Depends on: 1715704

Yury Delendik (:yury)

Updated

•

3 years ago

Depends on: 1762409

Yury Delendik (:yury)

Updated

•

3 years ago

Depends on: 1762413

Marco Castelluccio [:marco]

Updated

•

2 years ago

Blocks: fx-translation

Greg Tatum [:gregtatum]

Updated

•

2 years ago

No longer blocks: fx-translation

Greg Tatum [:gregtatum]

Updated

•

1 year ago

Summary: [META] WebAssembly optimizations for Firefox Translations → [meta] WebAssembly optimizations for Firefox Translations

Bugzilla

[meta] WebAssembly optimizations for Firefox Translations

Categories

(Firefox :: Translations, enhancement)

Tracking

()

People

(Reporter: anatal, Unassigned)

References

(Depends on 2 open bugs)

Details

(Keywords: meta)

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Updated

Comment 2

Updated

Updated

Comment 3

Comment 4

Updated

Updated

Updated

Updated

Updated