Soft-hyphens are breaking translations because of bad tokenization
Categories
(Firefox :: Translations, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox130 | --- | fixed |
People
(Reporter: chakradraku, Assigned: gregtatum)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
290.27 KB,
image/png
|
Details | |
Bug 1908045 - Strip out soft hyphens to improve translation tokenization; r?#translations-reviewers!
48 bytes,
text/x-phabricator-request
|
Details | Review |
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0
Steps to reproduce:
Visit the URL https://smp-bund.greenbone.net/ and use the tranlsation from German to English, French, or any other language.
Actual results:
The translation is not only obviously wrong, but contains what seems to be malicious translation attacking the organization or it's employees.
As far as I understand, the translation ML feature was just added to Firefox in version 118 in March 2024 and the translations are done locally on the user's machine - no data is transmitted over the network to complete the translation. This means there must be some malicious code embedded in the translation ML algorythm that targets this URL.
Expected results:
When the text being translated on the URL https://smp-bund.greenbone.net/ is moved to another page and injected into the DOM, the translation is accurate. The translate of the text on the target page is accurate when it's moved to another page on the domain greenbone.net or another domain altogether.
While this is not a bug that many users could be harmed from in itself, I think that this should be investigated internally to determine who did this and how it happened.
Reporter | ||
Comment 1•7 months ago
|
||
Maybe this isn't typically classified as a security sensitive bug, but if insiders controlling Mozilla Firefox's code can enter poisoned ML, the risk to the general public or anyone they choose to target is extremely high.
Reporter | ||
Comment 2•7 months ago
|
||
I also believe this bug was revealed in a suspicious way by an individual who may be responsible for it or part of a group responsible for it.
Reporter | ||
Comment 3•7 months ago
|
||
Comment 4•7 months ago
|
||
We suspect that the ML model is getting a garbled string and thus provides a bad translation with encoding errors
When copying the data directly from the page into about:translations
it translates just fine.
This is likely not a security issue but just a correctness issue with the translation implementation. I'll keep it hidden but non-security for now.
Reporter | ||
Comment 5•7 months ago
|
||
Got it, thanks. Upon more work, I think its because the translator is not stripping the UTF-8 soft-hyphen characters (0xC2 0xAD) from the DOM text?
Not a security issue, but perhaps bug.
Comment 6•7 months ago
|
||
Google Translate is also failing on this page.
Assignee | ||
Comment 7•7 months ago
|
||
The soft hyphens are breaking the tokenization, and the model can't understand the input. The models can apply normalization tables in SentencePiece to account for it, but we can easily do a mitigation in the translation engine in Gecko by stripping them out.
Assignee | ||
Comment 8•7 months ago
|
||
Assignee | ||
Updated•7 months ago
|
Comment 10•7 months ago
|
||
bugherder |
Description
•