Open Bug 1358628 Opened 7 years ago Updated 2 years ago

Rethink multi-negotiation in LocaleService

Categories

(Core :: Internationalization, enhancement)

enhancement

Tracking

()

People

(Reporter: zbraniecki, Unassigned)

References

(Blocks 1 open bug)

Details

At the moment the LocaleService's language negotiation has three strategies we can use:

 - lookup
 - matching
 - filtering

But all three return the available locale(s) in result.

This works great for when we're matching a single list of requested locales to a single list of available locales. For example:

requested: ['en']
available: ['en-US', 'en-GB', 'de', 'fr']
result: ['en-US']

is the right call because it can evaluate the closest match for 'en' and guess that probably 'en-US' is the right locale to use. It uses `likelySubtags` list from CLDR for that.

But there's a limitation of that approach when we negotiate between multiple available lists.
An example of such scenario is when we have L10nRegistry locales and ChromeRegistry locales, or Java resources and L10nRegistry locales, or Intl.DateTimeFormat.availableLocales and L10nRegistry resources.

In all those scenarios we have to do multi-negotiation and there are generally two approaches we can take:

1) We can create an intesection of `available1 x available2 x available 3` and then negotiate the availableIntersection against requested

2) We can negotiate available1 against requested, and available2 against requested and available3 against requested and then create an intersection of the results.



The former is tricky to achieve. Imagine this:

available1: ['en']
available2: ['en-GB']

What is the result of ['en'] ∩ ['en-GB']? Is it an empty set? Or Do we do some fuzzy-matching to get both and pretend that the intersection of those two available is ['en', 'en-GB'] ?

The latter seems to work better for identifying the exact resource we'll use:

available1: ['en']
available2: ['en-GB']
requested: ['en-US']

supported1: ['en-US'] x ['en'] = ['en']
supported2: ['en-US'] x ['en-GB'] = ['en-GB']

But the problem then is that we need to filter out the requested that we don't have a match in all availables for to avoid this:

available1: ['de', 'en']
available2: ['en-GB']
requested: ['de', 'en-US']

supported1: ['de', 'en-US'] x ['de', 'en'] = ['de', 'en']
supported2: ['de', 'en-US'] x ['en-GB'] = ['en-GB']

which would lead to supported1 using 'de', while supported2 'en-GB'.

My proposal is to introduce a new parameter to languageNegotiation for filtering strategy only that allows to filter requested locales, instead of available.

The end result would look like this:

available1: ['de', 'en']
available2: ['en-GB']
requested: ['de', 'en-US']

supported1: ['de', 'en-US'] x ['de', 'en'] = ['de', 'en-US']
supported2: ['de', 'en-US'] x ['en-GB'] = ['en-US']

supported: ['de', 'en-US'] ∩ ['en-US'] = ['en-US']

and this is what "appLocales" will be.

Then, when we look into particular component, this may happen:

supported x ChromeRegistry.availableLocales = ['en']
supported x L10NRegistry.availableLocales = ['en-GB']

This is also what Intl API is doing - their language negotiation actually returns to you which of the requested locales they found a match for irrelevant of what the actual locale ID they used.
It basically means - you asked for ['de-AT', 'fr'-CA'], we have locale resources that will match 'de-AT' (even if the actual resources are for 'de').

This allows you to go around and see if you can find a match for 'de-AT' for another Intl API or maybe locale resources, or maybe fonts, whatever.

This would require a little bit more of logic at language negotiation, but I believe that since we're still negotiating a small number of requestedLocales (<10?) and language switching is rare, I hope it won't be a significant problem.
:jfkthame, :pike, :stas, :rnewman - thoughts?

I'm getting close to start the review for L10nRegistry and for the upcoming Fennec project we need to figure out how to negotiate between Java Resources, L10nRegistry and ChromeRegistry all together :)

:stas also wanted to get Intl API x L10nRegistry for Fluent.
Flags: needinfo?(stas)
Flags: needinfo?(rnewman)
Flags: needinfo?(l10n)
Flags: needinfo?(jfkthame)
> But the problem then is that we need to filter out the requested that we don't have a match in all availables…

The proposal to make the negotiation comprise two steps solves this exact problem, right? The first step filters the requested locales and the second one negotiates against the available ones of each source.

In this context, ECMA 402's use of the word "supported" also makes sense: it denotes a subset of the requested locales which can be supported by the given set of the available locales.

> My proposal is to introduce a new parameter to languageNegotiation for filtering strategy only that allows to filter requested locales, instead of available.

Do we expect these two different use-cases to take the same options?  For instance, would we want to allow the `defaultLocale` option when negotiating the requested locales? Perhaps it's worth considering having two explicit operations, BestAvailableLocales and SupportedLocales (to use the naming from ECMA 402) and expose them as two separate methods?
Flags: needinfo?(stas)
The description of your problem sounds suspiciously like a rephrasing of set cover, which is NP-complete.

I suspect you're making a general-purpose algorithm to solve a specific case.

The specific case I'm aware of is:

- One part of the app has a list of supported locales: A, B, C, D, E.
- One part of the app has a subset of that list, or even an overlapping set: B, C, F.
- Both sets are a subset of a broader set of locales: A, B, C, D, E, F, G.

Note that the third statement bounds the universe. It also implies commonality in region code. For Fennec you're not trying to solve

requested: ['fr']
available1: ['fr-AL', 'en-GB']
available2: ['es-FR', 'pl']

you're trying to solve:

requested: ['fr']
available1: ['fr', 'es-MX', 'en-US']
available2: ['fr', 'en-US']

or

requested: ['fr']
available1: ['fr', 'es-MX', 'en-US']
available2: ['en-US']

-- markedly simpler, because the two sets are regular and predictable.

Furthermore, in the Fennec case it's initially the case that `available2` is always a subset of `available1`, and moving forward it's likely that the two sets will accrue new members in the same operation. Again, that simplifies the problem.

Can you clarify what the LocaleService's language negotiation is *for*, beyond the above?
Flags: needinfo?(rnewman)
> Can you clarify what the LocaleService's language negotiation is *for*, beyond the above?


For anyone lacking background, the intro is:
 - https://www.ietf.org/rfc/rfc4647.txt
 - https://www.w3.org/TR/ltli/
 - https://www.w3.org/International/questions/qa-when-lang-neg

but :rnewman rephrased his question to

> I suppose my question is: if you closed this bug WONTFIX, what doesn't work?

If we close this as wontfix, we would have a language negotiation mechanism that can only operate on a single set of "available".
That, in turn, would lead to us having to always guarantee that one set of availables is a subset of another and negotiate between requested and the smallest set of available in scenarios where we have more than one set of availables.

That's going to be tricky since:
 - Java and Gecko use different locale codes. Aligning that so that each code is fully matching ("sr-Cyrl" vs "sr-SR" etc.) will be tricky
 - Gecko and ICU sets may not overlap. If we'll want to narrow down the selection to one of them, we'd need investment in ICU build system to make sure that we include the data that is a superset or subset of Gecko locales.
 - Gecko and WebExtensions locales would need to be bound

My take is that getting a robust language negotiation that guards us in all scenarios is a more stable foundation for the future.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #4)

> That's going to be tricky since:
>  - Java and Gecko use different locale codes. Aligning that so that each
> code is fully matching ("sr-Cyrl" vs "sr-SR" etc.) will be tricky
>  - Gecko and ICU sets may not overlap. If we'll want to narrow down the
> selection to one of them, we'd need investment in ICU build system to make
> sure that we include the data that is a superset or subset of Gecko locales.
>  - Gecko and WebExtensions locales would need to be bound

OK, that helps clarify, I think.

I think I understand your filtering approach, too, though correct me if I'm wrong! You're declaring the 'requested' set as the bounds and the vocabulary for the solution, with the idea being that the members of that set will yield something reasonable when re-applied to each of the available sets.

Am I correct in understanding that if I throw "en-FR" at, say, ICU, and it has "en-GB", it'll use that?


My next concern is about unpredictable behavior introduced by changes in one of the sets.

To take your worked example:

---
available1: ['de', 'en']
available2: ['en-GB']
requested: ['de', 'en-US']

supported1: ['de', 'en-US'] x ['de', 'en'] = ['de', 'en-US']
supported2: ['de', 'en-US'] x ['en-GB'] = ['en-US']

supported: ['de', 'en-US'] ∩ ['en-US'] = ['en-US']
---

Let's say `available2` _adds_ `de-FR`. We re-negotiate in the background, and `supported` becomes `['de', 'en-US']`. By adding a new ICU dictionary, my browser, or one of its add-ons, just switched to German. That's what'll happen with filtering renegotiation across multiple sets, right?

You could argue that this is what I wanted when I checked "Deutsch" and "English (US)" in some picker a month ago, but I suspect this would be surprising.

What would happen for this worked example?

---
available1: ['de', 'fr']
available2: ['en', 'es']
requested: ['de', 'en-US']
---

I think it's:

---
supported1: ['de', 'fr'] x ['de', 'en-US'] = ['de']
supported2: ['en', 'es'] x ['de', 'en-US'] = ['en-US']

supported: ['de'] ∩ ['en-US'] = []
---

which… is no use, right?


I can't help but think that there are one or two of these sets that are 'primary', and the rest shouldn't factor into the equation at all.

That is: when I tell Firefox which locale I want to use, I don't expect ICU or WebExtensions to have any say in the matter; I expect them to either download necessary content, or fall back to a reasonable fallback. I do not expect them to use a strictly different locale to my primary UI -- all components involved in this negotiation should have the same target.

Given that -- as I assumed above -- these things will fall back to whatever they can from my chosen locale, why do we need to involve them in the negotiation?

Are you solely trying to address the situation where I have one extension supplying French, one extension supplying German, and I'm Swiss, with ["de", "fr", "en"] checked, and so I'd like Extension A to show French UI and Extension B to show German, rather than both falling back to whatever hacky 'en' default they ship?

Is this a situation that really occurs?

Sorry if I'm being dense!
> supported: ['de'] ∩ ['en-US'] = []
> ---
> which… is no use, right?

Which is why in language negotiation we have the "defaultLocale" which is the last resort locale we use when we can't match anything.

> I can't help but think that there are one or two of these sets that are 'primary', and the rest shouldn't factor into the equation at all.

Completely agree!

That's why, for example, I don't think that in Firefox we would "block" a locale on missing ICU data.
But with a robust multi-set language negotiation solution we're in control of choosing which sets are "primary" and which are not.

For example, in Fennec case we have two "primary" sets - Java resources, L10nRegistry resources and multiple secondary (extensions, ICU etc.)

> Given that -- as I assumed above -- these things will fall back to whatever they can from my chosen locale, why do we need to involve them in the negotiation?

Yes, for secondary sets, you just do: availableSecondary x negotiated

so you treat negotiated languages as the requested set. That means that if we filtered out "de" becuase we don't have Java resources for it, then it's not in the "negotiated".

> Are you solely trying to address the situation where I have one extension supplying French, one extension supplying German, and I'm Swiss, with ["de", "fr", "en"] checked, and so I'd like Extension A to show French UI and Extension B to show German, rather than both falling back to whatever hacky 'en' default they ship?

I'd like to get to the point where we have a solution to more than one "primary" set of availables.
That may mean different things in different scenarios.

I may want to negotiate languages together for all extensions, or for Java and Gecko resources, or for Fluent and ICU for some non-Gecko case.

I believe that if we can come up with a strategy (and I think that Stas's proposal is a good candidate), we'll be able to freely choose what's "primary" for a given product and get a good selection of negotiated locales based on a bullet-proof negotiation strategy rather than bullet proof build system.

That may help us later when we'll get to places where we don't control all of the build system or we'll want to handle scenarios with third-party input.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #4)
> 
> > I suppose my question is: if you closed this bug WONTFIX, what doesn't work?
> 
> If we close this as wontfix, we would have a language negotiation mechanism
> that can only operate on a single set of "available".
> That, in turn, would lead to us having to always guarantee that one set of
> availables is a subset of another and negotiate between requested and the
> smallest set of available in scenarios where we have more than one set of
> availables.

I don't think that's necessarily the case. We've already came to two counter examples here where we don't want the negotiation to happen cross-silo, namely ICU and extensions.

I also think that the example of Android/Gecko is a counter example, as that problem isn't one of matching user selection, but mapping of one to the other.
Flags: needinfo?(l10n)
Flags: needinfo?(jfkthame)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.