Closed Bug 1908045 Opened 7 months ago Closed 7 months ago

Soft-hyphens are breaking translations because of bad tokenization

Categories

(Firefox :: Translations, defect)

Firefox 128
defect

Tracking

()

RESOLVED FIXED
130 Branch
Tracking Status
firefox130 --- fixed

People

(Reporter: chakradraku, Assigned: gregtatum)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0

Steps to reproduce:

Visit the URL https://smp-bund.greenbone.net/ and use the tranlsation from German to English, French, or any other language.

Actual results:

The translation is not only obviously wrong, but contains what seems to be malicious translation attacking the organization or it's employees.

As far as I understand, the translation ML feature was just added to Firefox in version 118 in March 2024 and the translations are done locally on the user's machine - no data is transmitted over the network to complete the translation. This means there must be some malicious code embedded in the translation ML algorythm that targets this URL.

Expected results:

When the text being translated on the URL https://smp-bund.greenbone.net/ is moved to another page and injected into the DOM, the translation is accurate. The translate of the text on the target page is accurate when it's moved to another page on the domain greenbone.net or another domain altogether.

While this is not a bug that many users could be harmed from in itself, I think that this should be investigated internally to determine who did this and how it happened.

Maybe this isn't typically classified as a security sensitive bug, but if insiders controlling Mozilla Firefox's code can enter poisoned ML, the risk to the general public or anyone they choose to target is extremely high.

I also believe this bug was revealed in a suspicious way by an individual who may be responsible for it or part of a group responsible for it.

We suspect that the ML model is getting a garbled string and thus provides a bad translation with encoding errors
When copying the data directly from the page into about:translations it translates just fine.

This is likely not a security issue but just a correctness issue with the translation implementation. I'll keep it hidden but non-security for now.

Group: firefox-core-security → mozilla-employee-confidential
Status: UNCONFIRMED → NEW
Component: Untriaged → Translations
Ever confirmed: true
Summary: The Firefox translation feature is targeting a specific site with a malicious translation → The Firefox translation feature is producing broken translations on a specific site

Got it, thanks. Upon more work, I think its because the translator is not stripping the UTF-8 soft-hyphen characters (0xC2 0xAD) from the DOM text?
Not a security issue, but perhaps bug.

Blocks: 1845772
Group: mozilla-employee-confidential

Google Translate is also failing on this page.

The soft hyphens are breaking the tokenization, and the model can't understand the input. The models can apply normalization tables in SentencePiece to account for it, but we can easily do a mitigation in the translation engine in Gecko by stripping them out.

Assignee: nobody → gtatum
Severity: -- → S3
Summary: The Firefox translation feature is producing broken translations on a specific site → Soft-hyphens are breaking translations because of bad tokenization
Pushed by gtatum@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a08fff923dd7 Strip out soft hyphens to improve translation tokenization; r=translations-reviewers,nordzilla
Status: NEW → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 130 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: