Figure out how to localize the interventions QueryScorer
Categories
(Firefox :: Address Bar, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox75 | --- | wontfix |
People
(Reporter: bugzilla, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
We should involve l10n to understand how to handle the matches with Fluent.
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
Mike, did you end up meeting with the l10n team about this? iirc you mentioned a meeting last week.
Comment 2•4 years ago
|
||
We haven't talked yet, but we should. Hopefully next week?
Here's a few topics to cover:
CJKT
These languages have scripts that use a single glyph per word, so the Levenshtein distance between "my house is broken" and "my browser is broken" is 1. Basically, we can't use Levenshtein for those scripts.
Multiple Scripts
Some languages have multiple scripts, for example Japanese (one word-based, one syllable-based), or Serbian (Latin and Cyrillic).
Impure use of Scripts
I wonder if Vietnamese use their proper script when searching or if they just use the closest normal Latin character.
https://en.wikipedia.org/wiki/Vietnamese_alphabet
Multiple Languages in Searches
The search language will often not match the UI language. I'm using the German UI, but search in English for tech terms. Russian UI is used in the former Soviet Union widely, so in the Baltics I also expect a mix of searches in local language, Russian, or English.
Mixed Language Searches
Overhearing conversations on Indian buses, folks in India freely mix English and local language. I would expect that to be the case for searches, too, and I wouldn't know how deterministic that is.
I'm not convinced that this is an exhaustive list, it's more like stuff I've picked up over time.
Comment 3•4 years ago
|
||
We can at least use exact string matching for languages where edit distance isn't well defined, or where we just don't implement it. We can also recognize phrases in multiple languages/scripts for any given locale/language -- for example, recognizing both German and English phrases for German-speaking locales, or both kana and kanji phrases for Japanese, although I can see that leading to an explosion of the number of phrases we'd need to recognize. Along those lines, would it make sense to recognize English in all locales/languages?
Comment 4•4 years ago
|
||
Following-up with the results of the discussion with l10n, we can proceed having our own object that defines matching keywords per locale (or group of locales, like "en-XX"), or alternatively we can provide a global list for all the locales to better support mixed language matching. That object will be managed by us, not by localizers.
It's critical which source we pick for the keywords, Sumo may be a good source. The source could provide hints whether it's better to p group matchers by locales or just make a big blob.
The fuzzy matching algo may depend on the locale though, indeed we can use levenshtein only for western languages. If we go with a per-locale definition, then it could also define which fuzzy matcher to use among the ones available. If no fuzzy matcher is provided we will just do exact matching.
We also discussed using RemoteSettings for this object, because if some of the matchers ends up being "bogus", we wantt o be able to fix it out of the usual product update time frames. Mark provided me and Harry with some additional insight into using RemoteSettings.
If I missed or misunderstood something from the discussion, please correct me.
Comment 5•4 years ago
|
||
Another point that I forgot, the l10n team asked us to apply strings to tips and interventions using data-l10n-id in the DOM, rather than passing translated strings from the providers. This is another bug that should be filed, along with the RemoteSettings one.
Comment 6•4 years ago
|
||
I'm looking into using NLP.js after Mike mentioned it to me, both for this bug and bug 1606915: https://github.com/axa-group/nlp.js
Comment 7•4 years ago
•
|
||
Another interesting project there, that is being used by Chrome, is https://github.com/google/cld3 (if we care about detecting language of the user typed words)
Updated•4 years ago
|
Comment 8•4 years ago
|
||
Making this a bit more general than "matching data."
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•2 years ago
|
Updated•8 months ago
|
Description
•