Closed Bug 525494 Opened 10 years ago Closed 2 years ago

Make chrome registry follow BCP 47 (RFC 4646 and 4647), plus tweaking

Categories

(Core :: Internationalization, defect)

defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: Pike, Assigned: GPHemsley)

References

(Blocks 2 open bugs, )

Details

(Whiteboard: [bcp47])

Attachments

(1 file, 1 obsolete file)

With the growing number of locales, we're exposing some shortcomings of the locale matching algorithm that the chrome registry uses right now.

We should support script tags, as well as some out-of-bcp-47 mechanism that allows us to determine how different versions for Spanish match. Say, you're looking for the best match for es-CL within es-ES, es-MX, es-AR.

Maybe we can implement the core code in js these days, parsing bcp 47 in c++ on string iterators would be pain.

Benjamin, comments? (And move the bug to the right component, as you please.)
Don't really care, a JS implementation sounds fine.
I hope I'm not barking up completely the wrong tree here...

I wonder if svg <switch> should follow bcp 47, particularly if allowReorder="yes" If so a c++ implementation would be preferable so it could replace nsSVGFeatures::GetBestLanguagePreferenceRank.

For reference:

http://www.w3.org/TR/SVG/struct.html#SwitchElement
http://www.w3.org/TR/SMIL3/smil-content.html#adef-allowReorder
Why does that make a C++ implementation preferable? Seems like a JS impl would work just fine for that case.
OK. My knowledge of C++ calling JS is rather limited.
Blocks: 486686
Blocks: 380287
We have similar locale selection code in the add-ons manager, it'd be nice for that to be able to use something general there too:

http://hg.mozilla.org/projects/addonsmgr/file/a8ef9131036b/toolkit/mozapps/extensions/XPIProvider.jsm#l154
My current idea is to expose the code to the chrome reg, and at least the jetpack feature api around l10n. Not yet sure if I can go as far as "don't call that code by default at all, if needed, call xpcom wrapping jsm". The latter might be too expensive and too many files, I'm not sure.

Mossop, what (more or less) exactly is that code you pointed at used for?
(In reply to comment #6)
> Mossop, what (more or less) exactly is that code you pointed at used for?

Given a list of localized name/description/other strings that an extension provides in its install.rdf it tries to select the best one for the current application locale.
I'm starting to hack on this.
Assignee: smontagu → l10n
Status: NEW → ASSIGNED
(In reply to comment #7)
> (In reply to comment #6)
> > Mossop, what (more or less) exactly is that code you pointed at used for?
> 
> Given a list of localized name/description/other strings that an extension
> provides in its install.rdf it tries to select the best one for the current
> application locale.

Just to mention in terms of API, as far as the EM is concerned (and maybe the chrome registry too), having something I can pass an array of locales to and having it give back the best match would be pretty much ideal.
I'm not sure if I want to pass arrays through xpcom, sounds like overweight.

The rough idea I have for xpcom is to fast-path anything that only has one entry, and if more than one entry is present, I might do something like this in pseudo code:

handler = bcp47service.getHandler(preferredLocale, domain);
for (loc in locales) {
  handler.visit(loc);
}
return handler.bestMatch;

That sounds like I don't have to create arrays I need to pass through xpcom. oth, I need to create the handler object, but at least I can do that in js land.

The 'domain' is to enable locale selection per extension or jetpack again, I don't think it'll hurt adding it to the API.

Probably a question for Benjamin, IIRC, he dropped that when creating the toolkit chrome registry (the xpfe one had that feature). Do you think that was a good choice, and would it still be?
(In reply to comment #10)
> The rough idea I have for xpcom is to fast-path anything that only has one
> entry, and if more than one entry is present, I might do something like this in
> pseudo code:
> 
> handler = bcp47service.getHandler(preferredLocale, domain);
> for (loc in locales) {
>   handler.visit(loc);
> }
> return handler.bestMatch;

The problem with this is that from C++ to the JS service you'll be crossing xpconnect for every call which doesn't seem terribly efficient.

PRUnichar arrays aren't terribly painful to construct in C++ and pass to the JS component. Some rough psuedo code:

nsTArray<const PRUnichar*> locales;
for every locale
  locales.AppendElement(locale)
bcp47service.GetBestMatch(locales.Elements(), locales.Length(), match);
... or should I pass an nsIUTF8StringEnumerator? I'd probably end up writing directly on top of the nsProviderArray. Could be feasible to be called from js that way, too?
(In reply to comment #12)
> ... or should I pass an nsIUTF8StringEnumerator? I'd probably end up writing
> directly on top of the nsProviderArray. Could be feasible to be called from js
> that way, too?

That probably works too
Blocks: 654467
Blocks: bcp47
Whiteboard: [bcp47]
Assignee: l10n → gphemsley
Is this a sufficient proof of concept for the code that is intended with this bug?

https://github.com/GPHemsley/BCP47/blob/master/bcp47.js

Some code demonstrating how to use it is here:
https://github.com/GPHemsley/BCP47/blob/master/bcp47.php

One thing to note about this is that it assumes the user knows best, so it checks every value of their priority list literally before it uses the BCP 47 Lookup algorithm to find the next-best match.

This means (for example) if your priority list is [ zh-Hant-CN, en-US ] and your available list is [ en-US, zh-Hant ], you will get back 'en-US' rather than 'zh-Hant'. I don't know if that's the desired behavior.

Not also that the getRelatedTags() function gives the behavior Axel referred to in comment 0, where we can add additional related tags to the end of the priority list (e.g. to pick the best fit for a regional locale we don't support directly). Again, this assumes the user knows best, so the related tags are added to the end of the list, not in the middle.

If this code is along the right lines, where should I go about putting it in the Mozilla codebase?
Attached patch Add BCP47 module (WIP v1) (obsolete) — Splinter Review
I've integrated that external code into patch to m-c. It's still a work in progress, but I'd like to know if I'm heading in the right direction with this.

This patch incorporates the code introduced in bug 730209 (and bug 741842). I haven't yet tied it in with the rest of the module, but I assume that this is a better place to keep it. (I envision a follow-up bug to remove the duplicate code from InlineSpellChecker.jsm later on.)

I'm not sure if the parseTag() function is necessary—someone please enlighten me of situations where it might come in handy. (One such situation might be the getDisplayName() integration mentioned above.)

The tests included in this patch may not be exhaustive. I know testLookup() certainly isn't. But I wanted to get something thrown together so that I could get a better sense of where this should be going.

Axel, Simon: Thoughts?
Attachment #631019 - Flags: feedback?(smontagu)
Attachment #631019 - Flags: feedback?(l10n)
Blocks: 743385
Blocks: 181515
Comment on attachment 631019 [details] [diff] [review]
Add BCP47 module (WIP v1)

Review of attachment 631019 [details] [diff] [review]:
-----------------------------------------------------------------

I think there are a host of js comments here. Probably best to get some input from toolkit hackers.

Things I noticed:

BCP47('en-US') vs new BCP47('en-US') can probably be handled. I also found the BCP47.BCP47 thing surprising.

The string bundles should probably be in a separate wrapper, and use getters to get the bundles late, and if needed.

The code could use some more structure and comments, I have a hard time navigating it. Maybe slice it up in one object that implements the API and is exported and shows the general structure of what the code is doing, and one that's the workhorse?

The regular expressions are nightmare-ish. I wouldn't know when there's a () and a (:?) and what. I wonder if the parser can be written cleaner by using only chunks, and offsets. Might be faster, even, and easier to maintain.

For the test, gathering feedback from whimboo. I'm not sure if we have a strategy for running that phase of our tests against non-en-US builds?
Attachment #631019 - Flags: feedback?(l10n)
Attachment #631019 - Flags: feedback?(hskupin)
Attachment #631019 - Flags: feedback-
(In reply to Axel Hecht [:Pike] from comment #16)
> Comment on attachment 631019 [details] [diff] [review]
> Add BCP47 module (WIP v1)
> 
> Review of attachment 631019 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think there are a host of js comments here. Probably best to get some
> input from toolkit hackers.

I'm sure there are, but my primary concern right now is just nailing down the feature set: What do we need this module to do? What don't we need it to do?

> Things I noticed:
> 
> BCP47('en-US') vs new BCP47('en-US') can probably be handled. I also found
> the BCP47.BCP47 thing surprising.

I've since consolidated the object definition into a simple 'let BCP47 ='. As for BCP47.BCP47(), we could probably do without it. All it does right now is call BCP47.parseTag(), anyway.

> The string bundles should probably be in a separate wrapper, and use getters
> to get the bundles late, and if needed.
>
> The regular expressions are nightmare-ish. I wouldn't know when there's a ()
> and a (:?) and what. I wonder if the parser can be written cleaner by using
> only chunks, and offsets. Might be faster, even, and easier to maintain.

The contents of getDisplayName are copied directly from the code for bug 730209 (and bug 741842). Much of that is now duplicated in parseTag, so I intend to consolidate them eventually. (For example, there is no need to have two different regexps—in two different formats—doing the same thing in the same object.)

With that being said, which method do you prefer: The one in parseTag, where it's all chunked out? Or the one in getDisplayName, where it's a single long regexp? I'd be inclined to favor the former, which also has an explanatory comment that describes which index corresponds to which part of the tag.

> For the test, gathering feedback from whimboo. I'm not sure if we have a
> strategy for running that phase of our tests against non-en-US builds?

The same goes for the tests: They already exist in the codebase from the aforementioned bugs, but I imagine that a more comprehensive test suite will be needed to test the entire object. (Those tests are only for getDisplayName.)

Running the tests on localized builds would obviously be a great improvement on the situation, but I'm not sure that it's within the scope of this particular bug.

> The code could use some more structure and comments, I have a hard time
> navigating it. Maybe slice it up in one object that implements the API and
> is exported and shows the general structure of what the code is doing, and
> one that's the workhorse?

Could you elaborate on what you mean here? Do you mean having the internal parser functions in a separate object that is not export? I suppose that makes more sense.

Beyond the internal parser functions, there are the following:
* getPreferredValue: Translates deprecated language (sub)tags in favor of their Preferred-Value.
* getRelatedTags: This is intended to implement the functionality requested in comment 0, using additional information that is not available in the language tag to implement more reliable fallback mechanisms. (It doesn't currently show anything beyond the 'en' fallback for 'en-US'.)
* parseTag: Parses a given tag into a uniform object that can be manipulated.
* makeTag: Collapses a parsed tag object into a language tag string. (Not sure if this is strictly necessary. I don't currently use it anywhere.)
* lookup: This uses the Lookup algorithm to perform the language tag fallback.
* parseAcceptLanguage: This parses an Accept-Language header into a priority list that can be used by the rest of the code. (It might be possible to move this somewhere else, or perhaps use existing code? I don't know if we already have corresponding code someplace else.)
* getDisplayName: The translates a language tag into a display name that can be used in the UI. This is the exact same code as introduced in bug 730209 (and bug 741842) for use in the spellchecker dictionary list. (The idea is, that code would eventually be updated to make use of the new centralized BCP47 module.)

I hope that helps explain things a little better. And thanks for getting to this, Axel. I greatly appreciate it.
(In reply to Gordon P. Hemsley [:gphemsley] from comment #17)
> (In reply to Axel Hecht [:Pike] from comment #16)
> > Comment on attachment 631019 [details] [diff] [review]
> > Add BCP47 module (WIP v1)
> > 
> > Review of attachment 631019 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > I think there are a host of js comments here. Probably best to get some
> > input from toolkit hackers.
> 
> I'm sure there are, but my primary concern right now is just nailing down
> the feature set: What do we need this module to do? What don't we need it to
> do?

I can't tell, tbh. Start small with a good design, and it'll be easy to add stuff that we'll need later on.

> > Things I noticed:
> > 
> > BCP47('en-US') vs new BCP47('en-US') can probably be handled. I also found
> > the BCP47.BCP47 thing surprising.
> 
> I've since consolidated the object definition into a simple 'let BCP47 ='.
> As for BCP47.BCP47(), we could probably do without it. All it does right now
> is call BCP47.parseTag(), anyway.

I guess something like 

function BCP47(tag) {
  if (this isinstance BCP47) {
    return this.parse(tag);
  }
  else {
    return new BPC47(tag);
  }
}

or so wallpapers successfully against new vs not.

> > The string bundles should probably be in a separate wrapper, and use getters
> > to get the bundles late, and if needed.
> >
> > The regular expressions are nightmare-ish. I wouldn't know when there's a ()
> > and a (:?) and what. I wonder if the parser can be written cleaner by using
> > only chunks, and offsets. Might be faster, even, and easier to maintain.
> 
> The contents of getDisplayName are copied directly from the code for bug
> 730209 (and bug 741842). Much of that is now duplicated in parseTag, so I
> intend to consolidate them eventually. (For example, there is no need to
> have two different regexps—in two different formats—doing the same thing in
> the same object.)
> 
> With that being said, which method do you prefer: The one in parseTag, where
> it's all chunked out? Or the one in getDisplayName, where it's a single long
> regexp? I'd be inclined to favor the former, which also has an explanatory
> comment that describes which index corresponds to which part of the tag.

I think neither. Constructing the big regexp is better than not, but we should write a human-readable parser instead. You can use the sticky flag 'y' (https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp) and set lastIndex accordingly to what you've already parsed. That should be about as fast or faster, depending on whether you brace yourself into regexp-hell.

> > For the test, gathering feedback from whimboo. I'm not sure if we have a
> > strategy for running that phase of our tests against non-en-US builds?
> 
> The same goes for the tests: They already exist in the codebase from the
> aforementioned bugs, but I imagine that a more comprehensive test suite will
> be needed to test the entire object. (Those tests are only for
> getDisplayName.)
> 
> Running the tests on localized builds would obviously be a great improvement
> on the situation, but I'm not sure that it's within the scope of this
> particular bug.

Yeah, I know it's not new, but that doesn't mean that we need to add, if we're facing a problem here.

> > The code could use some more structure and comments, I have a hard time
> > navigating it. Maybe slice it up in one object that implements the API and
> > is exported and shows the general structure of what the code is doing, and
> > one that's the workhorse?
> 
> Could you elaborate on what you mean here? Do you mean having the internal
> parser functions in a separate object that is not export? I suppose that
> makes more sense.

Yeah.

> Beyond the internal parser functions, there are the following:
> * getPreferredValue: Translates deprecated language (sub)tags in favor of
> their Preferred-Value.
> * getRelatedTags: This is intended to implement the functionality requested
> in comment 0, using additional information that is not available in the
> language tag to implement more reliable fallback mechanisms. (It doesn't
> currently show anything beyond the 'en' fallback for 'en-US'.)
> * parseTag: Parses a given tag into a uniform object that can be manipulated.
> * makeTag: Collapses a parsed tag object into a language tag string. (Not
> sure if this is strictly necessary. I don't currently use it anywhere.)
> * lookup: This uses the Lookup algorithm to perform the language tag
> fallback.
> * parseAcceptLanguage: This parses an Accept-Language header into a priority
> list that can be used by the rest of the code. (It might be possible to move
> this somewhere else, or perhaps use existing code? I don't know if we
> already have corresponding code someplace else.)
> * getDisplayName: The translates a language tag into a display name that can
> be used in the UI. This is the exact same code as introduced in bug 730209
> (and bug 741842) for use in the spellchecker dictionary list. (The idea is,
> that code would eventually be updated to make use of the new centralized
> BCP47 module.)

As I said before, we can expand the API as needed if we get the basic algorithms right.

> I hope that helps explain things a little better. And thanks for getting to
> this, Axel. I greatly appreciate it.

Sorry for the lag.
I've made an attempt to address some of your concerns, but I haven't addressed them all. (In particular, I haven't changed the regexp to use the sticky flag, and I haven't used getters to get the string bundles.)

I have, however, made the code a little more cohesive. The getDisplayName() code is now integrated into the BCP47 module, and I have split out the language tag parsing code into a separate BCP47Parser module.

Axel, Simon: What do you think?
Attachment #631019 - Attachment is obsolete: true
Attachment #631019 - Flags: feedback?(smontagu)
Attachment #631019 - Flags: feedback?(hskupin)
Attachment #633818 - Flags: feedback?(smontagu)
Attachment #633818 - Flags: feedback?(l10n)
Attachment #633818 - Flags: feedback?(hskupin)
Comment on attachment 633818 [details] [diff] [review]
Add BCP47 module (WIP v2)

I haven't checked the whole code but only the test, so please excuse silly questions. 

>+++ b/intl/locale/tests/unit/test_BCP47.js
>+Components.utils.import("resource://gre/modules/BCP47.jsm");
>+let bcp47 = new BCP47();

It is necessary to have a class for which has to be initialized each time or would a simple object suffice?

>+  testLookup: function() {
>+    var priorityList = [];
>+    var availableList = [ 'en-GB', 'en-US', 'en-ZA', 'es-AR', 'es-CL', 'es-ES', 'es-MX', 'ga-IE', 'zh-Hant', 'zh-Hans' ];
>+
>+    // Empty priority list returns default language tag.
>+    do_check_eq( bcp47.lookup( [], availableList ), 'en-US' );
>+
>+    // The first entry in the priority list is in the available list, so return that.
>+    do_check_eq( bcp47.lookup( [ 'en-US', 'en-CA', 'en' ], availableList ), 'en-US' );
>+
>+    // Use user's stated preferences over any lookup matching.
>+    do_check_eq( bcp47.lookup( [ 'zh-Hant-CN', 'en-US' ], availableList ), 'en-US' );

You probably want to use the pre-declared priorityList variable here and loop over tests.

>+  testGetDisplayName: function() {
>+    // Check non-well-formed language tag.
>+    do_check_eq(bcp47.getDisplayName("-invalid-"), "-invalid-");

I would also check wrong capitalization e.g. 'en-Us'. 

>+    // Check invalid but well-formed language subtag and invalid but well-formed script subtag.
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz"), "qaz / Qaaz");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-fonipa"), "qaz / Qaaz (fonipa)");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-qxqaaaaz"), "qaz / Qaaz (qxqaaaaz)");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-US"), "qaz (United States) / Qaaz");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-US-fonipa"), "qaz (United States) / Qaaz (fonipa)");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-US-qxqaaaaz"), "qaz (United States) / Qaaz (qxqaaaaz)");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-QZ"), "qaz (QZ) / Qaaz");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-QZ-fonipa"), "qaz (QZ) / Qaaz (fonipa)");
>+    do_check_eq(bcp47.getDisplayName("qaz-Qaaz-QZ-qxqaaaaz"), "qaz (QZ) / Qaaz (qxqaaaaz)");

What about moving all those test data into an array you loop over in the test method? It will make the lines shorter.

>+[test_BCP47.js]
>\ No newline at end of file

Please add the requested final line break.
Attachment #633818 - Flags: feedback?(hskupin) → feedback+
FYI, I've implemented Section 6.2 and Section 9 (excluding Best Fit methods) in pure JavaScript here:
https://github.com/GPHemsley/BCP47/blob/e446989ae909bb2fcb3abfafb1f2f1f34c340fac/bcp47.js

I'll have to port it over to m-c at a later point, but if you have feedback now, that'd be great.
For comparison, here's my prototype of the Intl module:
http://lindenbergsoftware.com/ecmascript/Intl.js

I'm currently porting this into SpiderMonkey under bug 769872, although the current version still has to make some serious concessions to limitations in the self-hosted JavaScript environment.

You can find test cases in the conformance test suite at
http://hg.ecmascript.org/tests/test262/file/21bf9dc75c14/test/suite/intl402
Comment on attachment 633818 [details] [diff] [review]
Add BCP47 module (WIP v2)

Good improvement.

I guess the string bundles can be removed from the api, and possibly be getters to only load them if required.

Re the i18n api, I love that we're seeing that making progress, I wonder if there's stuff to refactor in this patch to not depend on the i18n api landing on the one hand, and not making it impossible to use it once it does.
Attachment #633818 - Flags: feedback?(l10n) → feedback+
(In reply to Axel Hecht [:Pike] from comment #24)
> Re the i18n api, I love that we're seeing that making progress, I wonder if
> there's stuff to refactor in this patch to not depend on the i18n api
> landing on the one hand, and not making it impossible to use it once it does.

The way I was thinking about it was that this module would create a backend that would feed the Intl object and could also be used outside of it. (Not sure what Norbert thinks of that.)

In the code I mentioned in comment 22, I use similar code as the most recent patch to feed the Intl methods IsStructurallyValidLanguageTag() and CanonicalizeLanguageTag() (Section 6.2.2 and 6.2.3). I figure we should leave the BCP47 parsing to a separate module like this so that we can use it for things that don't involve Intl.

One thing that we need which Intl doesn't provide, for example, is the ability to get a display name from a language tag. We need this for things like the Language Preferences dialog and the spellchecker list.

Axel, since the primary motivation for addressing this bug (as I understand it) is to facilitate the creation of l10n teams with more complex locale codes, what particular support do we need in order to do that?
My main idea was to use Intl.foo compatible APIs, if feasible, so that we can easily port. Of course, you and Norbert need to sort out who's chicken and who's egg.
I have to admit, I haven't thought about sharing code used for the Intl module across Mozilla yet - my current mission is to bring up some reasonably complete implementation of the API in order to verify that it's implementable, better before TC 39 approves the standard in two weeks.

There are some constraints:

- The negotiation part of the Internationalization API is pretty tightly specified, while I noticed more leniency in Gordon's implementation - e.g. Gordon't BestAvailableLocale handles language tags that aren't well formed or not canonicalized; the spec version simply doesn't allow them. There may be good reasons for this leniency in the context where you're using the code, but it would have to be implemented in a wrapper around the Intl version so as to not affect standards conformance.

- My understanding is that SpiderMonkey is used in projects outside of Mozilla, so it can't have dependencies on other parts of Firefox.

One way to handle this would be to expose a language negotiation API as an extension to Intl, which then other Mozilla code can use. But maybe you have better ideas?
Depends on: 769872
With LocaleService::negotiateLanguages and mozIntl extending our Intl API, I think we now have all the capabilities that were asked for here.

We're also moving away from Chrome Registry for l10n, so I hope I can resolve this bug as WONTFIX.
Please, reopen if needed.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.