MozIntGemm is not working for larger translation models
Categories
(Firefox :: Translations, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox135 | --- | verified |
People
(Reporter: nordzilla, Assigned: sergesanspaille)
Details
Attachments
(6 files)
Description
We are experimenting with the tradeoffs between model size, translation quality, and translation performance. In #169 we added a new en-ru
model that is roughly twice the size (~40MB) of the current en-ru
model (~20MB).
Unfortunately, the new larger-size model doesn't seem to be working in Firefox's Translations ecosystem. I've narrowed down the cause to be related to MozIntGemm. The model translates correctly when the javascript.options.wasm_moz_intgemm
pref is set to false
, but the Translation fails when the pref is set to true
.
MozIntGemm
is a critical component for maintaining performant translations, so it is imperative that we attempt to fix the issue if we are going to support larger-size models.
I ran our tentative manual benchmarks with the following results:
Small Model | MozIntGemm
- 9.03 seconds (base metric)
- 786.66 words per second (base metric)
Small Model | Fallback IntGemm
- 16.49 seconds (82.61% increase from base)
- 420.92 words per second (46.49% decrease from base)
Large Model | Fallback IntGemm
- 29.19 seconds(223.25% increase from base)
- 237.79 words per second (69.77% decrease from base)
Steps to reproduce
Note: This STR is based on the current state of the new larger-size
en-ru
model being available in theProd (preview)
channel of Remote Settings. When this bug is resolved and that model is no longer actively in theProd (preview)
channel, this STR will no longer be valid.
- Open a Nightly or locally built version of Firefox to
about:config
and ensure that thebrowser.translations.LogLevel
pref is set toAll
instead of its defaultError
. This will enable Translations logging in the browser console. Also ensure that thejavascript.options.wasm_moz_intgemm
pref is set totrue
, which should be the default. - Navigate to the
remote-settings-devtools
releases page and download and install the latest XPI file for the extension. - Open the Remote Settings Devtools extension. The currently selected Environment should be
Prod
. Click theClear all
button, then switch the environment toProd (preview)
. Click theClear all
button again in theProd (preview)
environment and then click theSync
button once the clear finishes. Note, if theSync
button appears to do nothing, then you may have to restart the browser, re-open the extension, and try the sync again. I think this is a bug in Remote Settings. - Navigate to a translatable web page, e.g. https://en.wikipedia.org/wiki/Firefox.
- Open the Application (hamburger) menu near the top-right of the browser.
- Click the "Translate page..." menu item in the Application menu.
- When the Full Page Translations Panel opens, set the source language to English and the target language to Russian.
- Click Translate.
- Check the browser console log to ensure that the
en-ru
models that are downloaded are version1.0a3
.
Expected Results
The page translates from English to Russian
Actual Results
Much of the page content is lost due to Translation requests coming back empty.
- Open
about:config
and set thejavascript.options.wasm_moz_intgemm
pref tofalse
. - Restart the browser.
- Repeat steps 4) through 9).
- Observe that the translation from English to Russian works with the
javascript.options.wasm_moz_intgemm
set tofalse.
Reporter | ||
Updated•10 months ago
|
Comment 1•9 months ago
|
||
Comment 2•9 months ago
|
||
So it appears that there is a NaN being computed somewhere during the running of the Marian ExpressionGraph code while decoding. There is a behavior difference between how this is handled in the normal Wasm fallback instructions and the gemmology instructions, which causes the final failures. The NaNs are produced in int8_prepare_bias
.
In the larger Russian model, a bias matrix is being computed that results in 0xffffffff
NaN values during the computation.
Steps to reproduce:
- Apply
instr_berg3.diff
- Build and run Firefox
- Install remote settings devtools
- Switch to Production (Preview)
- Go to about:translations
- Type "test" in the left side
- Select "Russian" on the right
- Logging information will be in the browser console
I wrote a tool to analyze the binary values of the model, and have determined there are no NaN values in the source model. There are also checks in Marian code for NaNs, although I don't know where the training log is for the model.
I think the fix here is to update gemmology to have the same behavior as our fallback when dealing with NaNs.
:yury can you double check that what I have summarized here is accurate
:sergesanspaille Would you be able to look into this?
Comment 3•9 months ago
|
||
I'm updating this to a P1 on our side, since it's potentially going to block shipping many Balto-Slavic languages we're targeting to train here in the next month or so.
Updated•9 months ago
|
Comment 4•9 months ago
|
||
I just realized I had myself assigned, but this is something that serge will most likely need to look at.
Comment 5•9 months ago
|
||
:yury can you double check that what I have summarized here is accurate
It is accurate.
Few things to add. The 0xffffffff (for NaN) is quite unusual result from float32 arithmetic operations. It seams that fallback operations ignore such values since NaN is not present in the result.
Assignee | ||
Comment 6•9 months ago
|
||
The NaNs are produced in int8_prepare_bias
This function ultimately calls
GEMMOLOGY_DISPATCH(Shift::PrepareBias)(
(const int8_t*)inputMatrixBPreparedPtr,
rowsB,
colsB,
gemmology::callbacks::UnquantizeAndAddBiasAndWrite(unquantFactor, (const float*)inputBiasPtr, (float*)outputPtr)
);
Shift::PrepareBias itself only manipulates integers (so no NaN), until it calls the callback routine gemmology::callbacks::UnquantizeAndAddBiasAndWrite
passed as last parameter.
When inspecting UnquantizeAndAddBiasAndWrite
I don't see any operation that can produce NaN, except maybe if inputBiasPtr
conains NaN. Is that possible?
I think the fix here is to update gemmology to have the same behavior as our fallback when dealing with NaNs.
I can replace NaN by another value as output of Shift::PrepareBias
(or we could do that in the calling code. What is the behavior of our fallback in that case?
Comment 7•9 months ago
|
||
Yes, it looks like inputBiasPtr contains a NaN.
This hack seems to fix the output:
int8_prepare_bias(...args) {
const [ input_B_prepared, scaleA, zeroPointA, scaleB, zeroPointB, width, cols_b, input_bias, output, ] = args;
const bias_view = new Float32Array(wasmMemory.buffer, input_bias, cols_b)
for (let i = 0; i < bias_view.length; i++) {
const value = bias_view[i]
if (Number.isNaN(value)) {
bias_view[i] = 0
}
}
return r.int8_prepare_bias(...args);
}
In the code path this is getting called, and input_bias
has two NaN at index 256 and 257, which is a concerningly round number when the bias matrix is 512 in length.
Index: 256
input_bias: 0xffffffff
output: 0xb8ab0d2
Index: 257
input_bias: 0xffffffff
output: 0xaa0d0aa
This is happening here:
intgemm::Int8Shift::PrepareBias(
input_B_prepared,
width,
cols_B,
intgemm::callbacks::UnquantizeAndAddBiasAndWrite(unquant_factor, input_bias, output));
All the other scalar values going in seem fine.
!!! { scaleA: 9.656866073608398, zeroPointA: 0, scaleB: 119.19967651367188, zeroPointB: 0, width: 512, cols_b: 512 }
I'm going to poke a bit more at the model being used to see if I can find this FF FF FF FF FF FF FF FF
inside of the bias matrix, which feels an awful lot like some kind of memory write outside of an array bounds. Poking through the code this input is inside of the ExpressionGraph definition of the model. I wrote a tool to try to find NaN values in the model, and didn't see any. I'll look again.
Comment 8•9 months ago
|
||
I poked again, and I'm reasonably sure the biases do not have NaN in the serialized ExpressionGraph data. In fact, the input_bias is full of 0 values, except for the two NaNs.
Comment 9•9 months ago
•
|
||
Additional discovery:
- The input_bias == 0 (and it is a nullptr in wasm code).
- The fallback has different logic for this case: https://github.com/browsermt/marian-dev/blob/2781d735d4a10dca876d61be587afdab2726293c/src/tensors/cpu/wasm_intgemm_fallback.cpp#L57 does not use UnquantizeAndAddBiasAndWrite, but uses UnquantizeAndWrite
- The code at
js::intgemm::IntrI8PrepareBias
https://searchfox.org/mozilla-central/source/js/src/intgemm/IntegerGemmIntrinsic.cpp#294 does not expect inputBias be 0, which results into reading random bytes from wasm memory (that are mostly 0, except these NaNs). Not a security issue.
The input_bias == 0 case needs to be properly handled here:
- Call
UnquantizeAndWrite
instead - Don't check data bounds of
inputBias
.
Updated•9 months ago
|
Assignee | ||
Comment 10•9 months ago
|
||
Otherwise random bytes are read, which can lead to various kind of
issues, including reading nan values.
Updated•9 months ago
|
Assignee | ||
Comment 11•9 months ago
|
||
Patch implementing the approach suggested by Yury submitted, I've followed OP steps to reproduce the error, and the translation works fine.
Assignee | ||
Updated•9 months ago
|
Updated•9 months ago
|
Comment 12•9 months ago
|
||
Comment 13•9 months ago
|
||
bugherder |
Updated•7 months ago
|
Comment 14•7 months ago
|
||
Trying to reproduce this issue using an old build before the fix before I verify that is fixed. I noticed that Translation does not contain the Russian language in the Translate to field. Is there a way to make it available or any other lang to use in order to repro?
Assignee | ||
Comment 15•7 months ago
|
||
mmh, Russian should "just appear", at least that was the case on my setup. You can check and download it under about:preferences#general under Translations
Comment 16•6 months ago
|
||
Yeah, still not appearing on the old Nightly build I'm using in order to reproduce (2024-11-25), even downloading it from about:preferences will not make it appear in the Translate to, I do have it in the "from" field though (about:translations is the same)
Checking on latest Nightly the language is available from the start without needing to download anything extra.
Any ideas? (I am using the steps from comment 0 since I can't actually build Fx at this time as per steps from comment 2).
Reporter | ||
Comment 17•6 months ago
|
||
Bogdan, the issue is from comment 0 where it says:
Note: This STR is based on the current state of the new larger-size en-ru model being available in the Prod (preview) channel of Remote Settings. When this bug is resolved and that model is no longer actively in the Prod (preview) channel, this STR will no longer be valid.
I have re-uploaded the problematic en-ru
model to the Dev (Preview)
Remote Settings environment for you to test with.
You will have to switch to the Dev (Preview)
environment using the Remote Settings Devtools Extension.
(See attached video for an example.)
Comment 18•6 months ago
|
||
(In reply to Erik Nordin [:nordzilla] from comment #17)
Created attachment 9462278 [details]
rs-devtools-dev-preview.webmBogdan, the issue is from comment 0 where it says:
Note: This STR is based on the current state of the new larger-size en-ru model being available in the Prod (preview) channel of Remote Settings. When this bug is resolved and that model is no longer actively in the Prod (preview) channel, this STR will no longer be valid.
I have re-uploaded the problematic
en-ru
model to theDev (Preview)
Remote Settings environment for you to test with.You will have to switch to the
Dev (Preview)
environment using the Remote Settings Devtools Extension.(See attached video for an example.)
Thanks for all that, but the issue I am having still persists even in Dev (Preview), the Russian lang is still missing for me in the Nightly build before the fix. It is available in the latest Nightly from today though. I've attached a video showing all this. Both Nightly builds are opened with clean profiles and have the prefs from comment 0 + the Remote Settings Devtools addon.
Reporter | ||
Comment 19•6 months ago
|
||
Bogdan, that is really baffling to me, and I am sorry that it is giving you trouble reproducing.
I've attached a step-by-step video of myself reproducing the issue on a fresh download of Firefox Nightly 134.
I hope that helps!
Comment 20•6 months ago
|
||
Yeah no worries, do you mind checking if Firefox 135.0rc is fixed for you when you have time so we can close this one as verified fixed?
Reporter | ||
Comment 21•6 months ago
|
||
Sure, I'd be happy to. Can you confirm that this is the right one before I do it?
https://ftp.mozilla.org/pub/firefox/candidates/135.0-candidates/build1/linux-x86_64/en-US/
Comment 22•6 months ago
|
||
(In reply to Erik Nordin [:nordzilla] from comment #21)
Sure, I'd be happy to. Can you confirm that this is the right one before I do it?
https://ftp.mozilla.org/pub/firefox/candidates/135.0-candidates/build1/linux-x86_64/en-US/
Thanks, that's the one indeed.
Reporter | ||
Comment 23•6 months ago
|
||
I can confirm that everything is working in 135.0rc (see video).
I'm using Prod (Preview)
, instead of Dev (Preview)
, because some of our other configurations are a bit different now in production, and no longer compatible with the old 134 configuration, which is the intent.
We are planning to release this en-ru
model in 135, so the Russian model is the same one, and it works.
Comment 24•6 months ago
|
||
(In reply to Erik Nordin [:nordzilla] from comment #23)
Created attachment 9462816 [details]
release-candidate-test.webmI can confirm that everything is working in 135.0rc (see video).
I'm using
Prod (Preview)
, instead ofDev (Preview)
, because some of our other configurations are a bit different now in production, and no longer compatible with the old 134 configuration, which is the intent.We are planning to release this
en-ru
model in 135, so the Russian model is the same one, and it works.
Awesome. Thanks for verifying and sorry for all this this back and forth. Marking as verified based on your testing.
Description
•