Open Bug 1606916 Opened 4 years ago Updated 8 months ago

Figure out how to localize the interventions QueryScorer

Categories

(Firefox :: Address Bar, enhancement, P3)

enhancement
Points:
5

Tracking

()

Iteration:
76.2 - Mar 23 - Apr 5
Tracking Status
firefox75 --- wontfix

People

(Reporter: bugzilla, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

We should involve l10n to understand how to handle the matches with Fluent.

Priority: -- → P2

Mike, did you end up meeting with the l10n team about this? iirc you mentioned a meeting last week.

Flags: needinfo?(mdeboer)

We haven't talked yet, but we should. Hopefully next week?

Here's a few topics to cover:

CJKT

These languages have scripts that use a single glyph per word, so the Levenshtein distance between "my house is broken" and "my browser is broken" is 1. Basically, we can't use Levenshtein for those scripts.

Multiple Scripts

Some languages have multiple scripts, for example Japanese (one word-based, one syllable-based), or Serbian (Latin and Cyrillic).

Impure use of Scripts

I wonder if Vietnamese use their proper script when searching or if they just use the closest normal Latin character.
https://en.wikipedia.org/wiki/Vietnamese_alphabet

Multiple Languages in Searches

The search language will often not match the UI language. I'm using the German UI, but search in English for tech terms. Russian UI is used in the former Soviet Union widely, so in the Baltics I also expect a mix of searches in local language, Russian, or English.

Mixed Language Searches

Overhearing conversations on Indian buses, folks in India freely mix English and local language. I would expect that to be the case for searches, too, and I wouldn't know how deterministic that is.

I'm not convinced that this is an exhaustive list, it's more like stuff I've picked up over time.

We can at least use exact string matching for languages where edit distance isn't well defined, or where we just don't implement it. We can also recognize phrases in multiple languages/scripts for any given locale/language -- for example, recognizing both German and English phrases for German-speaking locales, or both kana and kanji phrases for Japanese, although I can see that leading to an explosion of the number of phrases we'd need to recognize. Along those lines, would it make sense to recognize English in all locales/languages?

Following-up with the results of the discussion with l10n, we can proceed having our own object that defines matching keywords per locale (or group of locales, like "en-XX"), or alternatively we can provide a global list for all the locales to better support mixed language matching. That object will be managed by us, not by localizers.
It's critical which source we pick for the keywords, Sumo may be a good source. The source could provide hints whether it's better to p group matchers by locales or just make a big blob.
The fuzzy matching algo may depend on the locale though, indeed we can use levenshtein only for western languages. If we go with a per-locale definition, then it could also define which fuzzy matcher to use among the ones available. If no fuzzy matcher is provided we will just do exact matching.
We also discussed using RemoteSettings for this object, because if some of the matchers ends up being "bogus", we wantt o be able to fix it out of the usual product update time frames. Mark provided me and Harry with some additional insight into using RemoteSettings.

If I missed or misunderstood something from the discussion, please correct me.

Another point that I forgot, the l10n team asked us to apply strings to tips and interventions using data-l10n-id in the DOM, rather than passing translated strings from the providers. This is another bug that should be filed, along with the RemoteSettings one.

Depends on: 1612496

I'm looking into using NLP.js after Mike mentioned it to me, both for this bug and bug 1606915: https://github.com/axa-group/nlp.js

Assignee: nobody → adw
Status: NEW → ASSIGNED
Iteration: --- → 75.1 - Feb 10 - Feb 23

Another interesting project there, that is being used by Chrome, is https://github.com/google/cld3 (if we care about detecting language of the user typed words)

Iteration: 75.1 - Feb 10 - Feb 23 → 75.2 - Feb 24 - Mar 8

Making this a bit more general than "matching data."

Summary: Figure out how to localize the QueryScorer matching data → Figure out how to localize the QueryScorer
Flags: needinfo?(mdeboer)
Summary: Figure out how to localize the QueryScorer → Figure out how to localize the interventions QueryScorer
Iteration: 75.2 - Feb 24 - Mar 8 → 76.1 - Mar 9 - Mar 22
Iteration: 76.1 - Mar 9 - Mar 22 → 76.2 - Mar 23 - Apr 5
Severity: normal → S3
Assignee: adw → nobody
Status: ASSIGNED → NEW
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.