Open Bug 1904628 Opened 8 months ago Updated 3 days ago

Investigate making language detection cheap enough to run on every page load

Categories

(Firefox :: Translations, enhancement)

enhancement

Tracking

()

People

(Reporter: gregtatum, Unassigned)

References

(Blocks 1 open bug)

Details

Currently we don't run language detection on every page load, but maybe we should. It's really hard to get translations correct with just the language tag. I think the solution would be to integrate some kind of ML solution in the C++ side of things to make it cheap enough not to keep some big wasm thing around in the background. This bug is a central place to collect that work, and decision. We have Bug 1842777 to switch to fastText, but I'm skeptical it will be fast enough. CLD3 can be run from the C++ side, we have onnx runtime in Nightly through Wasm, and onnx runtime runs in C++ as well. We shouldlook at all of these options and figure out how best to solve this language ID problem.

Blocks: 1845772
See Also: → 1842777

A colleague of mine did some benchmarks. He found this to be the fastest:
https://github.com/saffsd/langid.c

Here are his benchmarks, but note that his use-case was running it from Python:
https://github.com/LiableFish/langid.pyc/blob/master/benchmark/benchmark.ipynb

The execution time also remains low with longer text, see second to last image of the benchmark.

He also noted that CLD3 is much less accurate than CLD2.

:adam, thank you for this thoughtful contribution, and sorry for the delay in replying.

When we get to the point of exploring this more concretely, we will surely take these benchmarks into consideration as part of the process.

Blocks: 1938135

There's also https://github.com/ZJaume/heliport which we could investigate, and github.com/mbanon/fastspell. Lots of options :)

Also worth noting that there were some recent optimizations on fastText that make it faster than before: https://github.com/facebookresearch/fastText/pull/1341.

In Bug 971047 CLD2 was originally landed. There seems to be some contention on adding it as a Wasm library, and it was a novel choice for the time. It might be worth making it built native so we don't have to spin up a worker every time. CLD2 may be good enough for our purposes, as it uses a naive Bayesian classifier. We should investigate how competitive it is.

I removed all of the data tables manually and computed that it's around ~5000 lines of C++ code that would need to be audited, tested, fuzzed, etc.

~/dev/firefox/toolkit/components/translation/cld2/internal on firefox/fxt-lang-id
➤ tokei --exclude cld_generated_*
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C++                     2         7772         5366         1215         1191
===============================================================================
 Total                   2         7772         5366         1215         1191
===============================================================================

We could also look at other sandboxing strategies we have available these days.

In addition, the data tables could be stored in new ways that could either be compressed, or memory mapped.

From https://modelpredict.com/language-identification-survey and https://github.com/pemistahl/lingua-py, it looks like CLD2 has lower accuracy than fastText. Though it's mostly for short sentences?
We could build our own benchmark made of recent web pages and check.

You need to log in before you can comment on or make changes to this bug.