Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage

RESOLVED FIXED in Firefox 47

Status

()

Firefox
Translation
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: kmag, Assigned: kmag)

Tracking

unspecified
Firefox 47
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox47 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

There are two language detection methods in the WebExtensions API which require us to return more information about the detected languages than the LanguageDetector API currently provides. One of them also requires us to be able to pass in more information about the page that we're detecting the language of.

Since these APIs come from Chromium, they're already designed to work directly with CLD2. We just need to expose more of the interface.

In particular, we need to be able to return the top three detected languages, and their percentages, based on the analysis of a plain text string. We also need to be able to return the primary language of a given tab, for which we need to pass in content-language, TLD, and charset hints.
Created attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

Review commit: https://reviewboard.mozilla.org/r/35041/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35041/
Attachment #8719625 - Flags: review?(florian)
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

https://reviewboard.mozilla.org/r/35041/#review31651

Is there documentation for the API you need to implement?

This looks generally fine, but I'm not fully comfortable reviewing the .idl file and the C++ code, so I would prefer if you could ask someone else to have a second look at these.

::: browser/components/translation/cld2/Makefile:21
(Diff revision 1)
> +# worker.

Should we somehow release the worker after several language detections, or after several minutes of inactivity to mitigate the consequences of this?

::: browser/components/translation/cld2/Makefile:71
(Diff revision 1)
> -	$(CC) $(FLAGS) -I. -o cld-worker.js $^ --post-js post.js -s EXPORTED_FUNCTIONS="['_detectLangCode', '_lastResultReliable']"
> +	$(EMCC) $(FLAGS) -I. -o cld-worker.js $(OBJECTS) --post-js cld.js --post-js post.js

it's not obvious to me what generates cld.js. Is it a side-effect of the webidl compiler?

::: browser/components/translation/cld2/cld.idl:6
(Diff revision 1)
> +  [Const] DOMString getName();

What's the use case for returning the language name from CLD? This code currently returns only the language code, because a non-localized language name is rarely useful, and it's not too difficult to get a localized language name based on the language code.

::: browser/components/translation/cld2/cld.idl:20
(Diff revision 1)
> +                                     DOMString? tldHint, long encodingHint,

Why is encodingHint not optional? Why do we have 2 different versions of the detectLanguage method when it seems they are the same with more optional parameters on the second version?

::: browser/components/translation/cld2/cld.idl:23
(Diff revision 1)
> +  boolean getIsReliable();

I'm not really familiar with this idl variant, but it seems a lot of the methods with a name starting with "get" here are things that I would usually define as readonly attributes in xpidl.

::: browser/components/translation/cld2/post.js:6
(Diff revision 1)
> +// on the constructor, which isn't very practical.

Why is this not practical?

::: browser/components/translation/cld2/post.js:161
(Diff revision 1)
> +      'languages': new Array(3).fill(0).map((_, index) => {

Does it make sense to return a fixed-size array here even when CLD only detected one language?

::: browser/components/translation/test/unit/test_cld2.js:395
(Diff revision 1)
> +      let pair = Array.isArray(item[0]) ? item[i] : item;

I find this test hard to read/confusing. I think redefining the "pair" variable with a similar but slightly different line of code really doesn't help.
Attachment #8719625 - Flags: review?(florian)
https://reviewboard.mozilla.org/r/35041/#review31651

> Should we somehow release the worker after several language detections, or after several minutes of inactivity to mitigate the consequences of this?

I was thinking about that too, yeah. It would probably be easy enough to destroy it after some inactivity delay, after it analyzes a particularly large string.

> it's not obvious to me what generates cld.js. Is it a side-effect of the webidl compiler?

Yeah, it generates both cld.js and cld.cpp. There's not really a way to specify that in the Makefile without having it run the IDL compiler twice, but I suppose I can add a comment.

> What's the use case for returning the language name from CLD? This code currently returns only the language code, because a non-localized language name is rarely useful, and it's not too difficult to get a localized language name based on the language code.

I don't have a use for it at the moment, but it's readily available and easy enough to return, so I figured I may as well, in case someone else had a use for it.

> Why is encodingHint not optional? Why do we have 2 different versions of the detectLanguage method when it seems they are the same with more optional parameters on the second version?

The WebIDL binder is pretty crude. It only supports overloading for explicit prototypes. The ? just means that the value is nullable. Though, if it comes to it, it accepts null even without the ?. I added it mainly for my comfort.

> I'm not really familiar with this idl variant, but it seems a lot of the methods with a name starting with "get" here are things that I would usually define as readonly attributes in xpidl.

I initially wrote them as attributes, but this binding generator doesn't support attributes very well. It implements `readonly attribute boolean isReliable` by adding a `get_isReliable` method that returns the value of the `isReliable` member variable. I found that confusing, so I decided to make it explicit.

> Why is this not practical?

I suppose it's not so much impractical as clumsy. Calling methods directly on the prototype of a constructor is not something you see (or should see) very often in JavaScript. I'd rather write `LanguageInfo.detectLanguage()` than `LanguageInfo.prototype.detectLanguage()`.

> Does it make sense to return a fixed-size array here even when CLD only detected one language?

Probably not. I went with that approach to match the behavior of the underlying API, but filtering out the unknown/0% elements should be fine too.

> I find this test hard to read/confusing. I think redefining the "pair" variable with a similar but slightly different line of code really doesn't help.

Yeah, I wasn't really happy with it either. It was mostly a last minute refactor to deal with one outlier. I'll try to come up with something better.
(In reply to Kris Maglione [:kmag] from comment #3)

> > What's the use case for returning the language name from CLD? This code currently returns only the language code, because a non-localized language name is rarely useful, and it's not too difficult to get a localized language name based on the language code.
> 
> I don't have a use for it at the moment, but it's readily available and easy
> enough to return, so I figured I may as well, in case someone else had a use
> for it.

I would prefer that we don't return it if it's not used and we don't want to encourage usage of it.

It's relatively easy to get a localized name based on the lang code:
               Cc["@mozilla.org/intl/stringbundle;1"]
                 .getService(Ci.nsIStringBundleService)
                 .createBundle("chrome://global/locale/languageNames.properties")
                 .GetStringFromName(code);
(In reply to Florian Quèze [:florian] [:flo] from comment #4)
> I would prefer that we don't return it if it's not used and we don't want to
> encourage usage of it.
> 
> It's relatively easy to get a localized name based on the lang code:
>                Cc["@mozilla.org/intl/stringbundle;1"]
>                  .getService(Ci.nsIStringBundleService)
>                 
> .createBundle("chrome://global/locale/languageNames.properties")
>                  .GetStringFromName(code);

OK. That's fine with me.
Attachment #8719625 - Attachment description: MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian → MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r?azakai
Attachment #8719625 - Flags: review?(florian)
Attachment #8719625 - Flags: review?(azakai)
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/35041/diff/1-2/
I think this addresses all of your comments. I removed the language name and useless results, added an idle timeout to destroy the worker after processing large strings, added some comments to the IDL, and cleaned up the tests a bit.

azakai, you reviewed the original version of this module. Would you mind reviewing the binding portions of this one?

Comment 8

2 years ago
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

https://reviewboard.mozilla.org/r/35041/#review31765

::: browser/components/translation/LanguageDetector.jsm:12
(Diff revision 2)
> -XPCOMUtils.defineLazyModuleGetter(this, "Promise",
> +// Since Emscripten can handle heap growth, but not heap shrinkage, we

Would it be useful if emscripten did support memory shrinking?
Attachment #8719625 - Flags: review?(azakai)

Comment 9

2 years ago
Emscripten parts look ok to me.

(Sorry, first time I see this review tool, not sure if I'm doing this right.)
https://reviewboard.mozilla.org/r/35041/#review31765

> Would it be useful if emscripten did support memory shrinking?

I'm honestly not sure. In our case, it's easy enough to just restart the worker when it gets to big. In the general case, we'd probably wind up with heap fragmentation that would prevent it from shrinking properly anyway, so it probably wouldn't help much without some kind of segment-based allocator.

The SPLIT_MEMORY setting looks like it's close to the right thing, but this code doesn't compile with it, and I'm not sure how well it would deal with allocations larger than the segment size.
(In reply to Alon Zakai (:azakai) from comment #9)
> Emscripten parts look ok to me.
> 
> (Sorry, first time I see this review tool, not sure if I'm doing this right.)

I'm not sure it's possible to do mozreview right, to be honest. It has its upsides, but no-one ever seems to really get the hang of it.

Thanks.
Attachment #8719625 - Flags: review?(florian)
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

https://reviewboard.mozilla.org/r/35041/#review32121

::: browser/components/translation/LanguageDetector.jsm:15
(Diff revision 2)
> +// These values define the cut-off string length and the idle timeout

Could you please rephrase this comment a little bit? I had to read the code to figure out if this meant:
A. the worker will be destroyed right after processing a large string, and after an idle period (for any size of processed string).
B. the worker will be destroyed some time after processing a large string.

::: browser/components/translation/LanguageDetector.jsm:17
(Diff revision 2)
> +var LARGE_STRING = 1.5 * 1014 * 1024;

Where is this "1014" magic constant coming from? And the 1.5?

::: browser/components/translation/LanguageDetector.jsm:33
(Diff revision 2)
> +      // string.

I find this code difficult to understand, but can't find any obvious simplification to suggest. Maybe word the comment to make it unambiguous that this will run after the worker has already replied?

::: browser/components/translation/LanguageDetector.jsm:118
(Diff revision 2)
> +   *  - 'bestGuess' An object describing the best guess at the language.

Is 'bestGuess' useful? It doesn't provide more than the existing 'language' field, right?

::: browser/components/translation/cld2/cldapp.cc:41
(Diff revision 2)
> +    CLD2::Language languages[3] = {};

Should this "3" number that appears 5 times in this file be a constant?

::: browser/components/translation/test/unit/test_cld2.js:441
(Diff revision 2)
> +  yield LanguageDetector.detectLanguage(test_string);

I would suggest checking the result here too, just to be extra sure setting up the timer the first time isn't interferring with returning the result.

::: browser/components/translation/test/unit/test_cld2.js:444
(Diff revision 2)
> +  ok(detectorModule.workerManager._idleTimeout != null,

nit: ok(foo != null, "message") can be simplified to ok(foo, "message"). And equal(foo, null, "message") can be simplified to ok(!foo, "message").
https://reviewboard.mozilla.org/r/35041/#review32121

> Where is this "1014" magic constant coming from? And the 1.5?

Oops. Should have been 1024

> Is 'bestGuess' useful? It doesn't provide more than the existing 'language' field, right?

Not anymore, no. I decided to keep it for consistency with the `languages` array, and in case we want to add more details later, but I'm not really attached to it.

> Should this "3" number that appears 5 times in this file be a constant?

I guess. I was trying to be consistent with the CLD headers, which always use a literal `3`, but a constant would probably be better.

> I would suggest checking the result here too, just to be extra sure setting up the timer the first time isn't interferring with returning the result.

Good point

> nit: ok(foo != null, "message") can be simplified to ok(foo, "message"). And equal(foo, null, "message") can be simplified to ok(!foo, "message").

I wanted to guard against the possibility of `setTimeout` ever returning `0` as an ID. It's not likely, but explicitly testing against `null` isn't much more complicated than testing it as a boolean.
(In reply to Kris Maglione [:kmag] from comment #13)

> > Is 'bestGuess' useful? It doesn't provide more than the existing 'language' field, right?
> 
> Not anymore, no. I decided to keep it for consistency with the `languages`
> array, and in case we want to add more details later, but I'm not really
> attached to it.

Let's simplify then. Removing it also lets us merge the Language and LanguageGuess interfaces in the idl definitions.
Attachment #8719625 - Attachment description: MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r?azakai → MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai
Attachment #8719625 - Flags: review?(florian)
Attachment #8719625 - Flags: review?(azakai)
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/35041/diff/2-3/
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

https://reviewboard.mozilla.org/r/35041/#review32269

Looks good to me.
Attachment #8719625 - Flags: review?(florian) → review+
Comment on attachment 8719625 [details]
MozReview Request: Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r?florian r=azakai

https://reviewboard.mozilla.org/r/35041/#review32331

Looks good to me.
Attachment #8719625 - Flags: review?(azakai)
Thanks
https://hg.mozilla.org/integration/fx-team/rev/77cd9b2e6c8c168de7810a8a0f9c6dfa8ab98caf
Bug 1248501 - Allow fuller use of the CLD2 API from LanguageDetector.detectLanguage. r=florian r=azakai

Comment 20

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/77cd9b2e6c8c
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox47: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 47
You need to log in before you can comment on or make changes to this bug.