Open Bug 1904628 Opened 8 months ago Updated 3 days ago

Investigate making language detection cheap enough to run on every page load

Tracking

()

Status:

NEW

People

(Reporter: gregtatum, Unassigned)

References

(Blocks 1 open bug)

Details

Greg Tatum [:gregtatum]

Reporter

Description

•

8 months ago

Currently we don't run language detection on every page load, but maybe we should. It's really hard to get translations correct with just the language tag. I think the solution would be to integrate some kind of ML solution in the C++ side of things to make it cheap enough not to keep some big wasm thing around in the background. This bug is a central place to collect that work, and decision. We have Bug 1842777 to switch to fastText, but I'm skeptical it will be fast enough. CLD3 can be run from the C++ side, we have onnx runtime in Nightly through Wasm, and onnx runtime runs in C++ as well. We shouldlook at all of these options and figure out how best to solve this language ID problem.

Marco Castelluccio [:marco]

Updated

•

8 months ago

Blocks: 1845772

Comment 1

•

3 months ago

A colleague of mine did some benchmarks. He found this to be the fastest:
https://github.com/saffsd/langid.c

Here are his benchmarks, but note that his use-case was running it from Python:
https://github.com/LiableFish/langid.pyc/blob/master/benchmark/benchmark.ipynb

The execution time also remains low with longer text, see second to last image of the benchmark.

He also noted that CLD3 is much less accurate than CLD2.

Erik Nordin [:nordzilla]

Comment 2

•

2 months ago

:adam, thank you for this thoughtful contribution, and sorry for the delay in replying.

When we get to the point of exploring this more concretely, we will surely take these benchmarks into consideration as part of the process.

Greg Tatum [:gregtatum]

Reporter

Updated

•

2 months ago

Blocks: 1938135

Marco Castelluccio [:marco]

Comment 3

•

10 days ago

There's also https://github.com/ZJaume/heliport which we could investigate, and github.com/mbanon/fastspell. Lots of options :)

Also worth noting that there were some recent optimizations on fastText that make it faster than before: https://github.com/facebookresearch/fastText/pull/1341.

Greg Tatum [:gregtatum]

Reporter

Comment 4

•

3 days ago

In Bug 971047 CLD2 was originally landed. There seems to be some contention on adding it as a Wasm library, and it was a novel choice for the time. It might be worth making it built native so we don't have to spin up a worker every time. CLD2 may be good enough for our purposes, as it uses a naive Bayesian classifier. We should investigate how competitive it is.

I removed all of the data tables manually and computed that it's around ~5000 lines of C++ code that would need to be audited, tested, fuzzed, etc.

~/dev/firefox/toolkit/components/translation/cld2/internal on firefox/fxt-lang-id
➤ tokei --exclude cld_generated_*
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C++                     2         7772         5366         1215         1191
===============================================================================
 Total                   2         7772         5366         1215         1191
===============================================================================

We could also look at other sandboxing strategies we have available these days.

In addition, the data tables could be stored in new ways that could either be compressed, or memory mapped.

Marco Castelluccio [:marco]

Comment 5

•

3 days ago

From https://modelpredict.com/language-identification-survey and https://github.com/pemistahl/lingua-py, it looks like CLD2 has lower accuracy than fastText. Though it's mostly for short sentences?
We could build our own benchmark made of recent web pages and check.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Investigate making language detection cheap enough to run on every page load

Categories

(Firefox :: Translations, enhancement)

Tracking

()

People

(Reporter: gregtatum, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5