Closed Bug 971047 Opened 6 years ago Closed 6 years ago

Import the CLD language detection library

Categories

(Firefox :: Translation, defect)

defect
Not set

Tracking

()

VERIFIED FIXED
Firefox 30

People

(Reporter: Felipe, Assigned: florian)

References

Details

(Whiteboard: [translation] p=8 s=it-30c-29a-28b.3 [qa-])

Attachments

(2 files, 4 obsolete files)

Chromium has a language detection library called CLD - Compact Language Library, which we are planning to use for our translation pilot.
CLD v1 is BSD-licensed and v2 is Apache-licensed.

It is a C++ library that we want to compile to asm.js with Emscripten. To do that we need to write some wrapper code to work as a call entry point to:
  - pass the correct inparams and receive the outparams
  - convert our 16-bit char strings to utf8 which is the format that the library uses

There's already an existing emscript'ed project called cld.js (BSD-licensed) which has some the wrapper code needed, but it uses CLD v1 and ideally we want v2.
Blocks: 971048
Whiteboard: p=0
(In reply to :Felipe Gomes from comment #0)

> It is a C++ library that we want to compile to asm.js with Emscripten.

Out of curiosity, why is this solution preferred to compiling the library as a binary, and either wrapping it with an XPCOM component or opening it with jsctypes?
We should evaluate both options. Avoiding the security/stability downsides of C++ would be nice if the performance is acceptable.
(In reply to Florian Quèze [:florian] [:flo] from comment #1)
> (In reply to :Felipe Gomes from comment #0)
> 
> > It is a C++ library that we want to compile to asm.js with Emscripten.
> 
> Out of curiosity, why is this solution preferred to compiling the library as
> a binary, and either wrapping it with an XPCOM component or opening it with
> jsctypes?

Yeah, the security advantage of importing an external library in JS is very compelling, and also it makes it easy to run the library code in a separate thread with a Worker.

I started looking from the Emscripten angle because at first we thought we'd ship it through an add-on, and not having binary components would make that much easier. That is no longer the case, but these other advantages made the idea stick.
Some more details: The existing project that I mentioned is this one:
https://github.com/jaukia/cld-js/

Besides the basic compilation from the library code, we need to be able to pass it the new params and receive all the return values, which are the language detected, a boolean "confident" mark, and a numeric confidence score.

There's also the converting the JS strings to UTF8, which cld.js does it here: https://github.com/jaukia/cld-js/blob/master/src/assets/pre.js, but hopefully we have something in the tree that does that?

Also, the return value from that we get now is the english name of the detected language (e.g., "English", "Portuguese", etc), but it would be better if we got the lang code instead ("en", "pt"). I haven't checked yet if this conversion is being done by cld.js or if it's the actual library which is giving the name as the result.
The existing project looks over 2 years old, and since it sounds like you want a major upgrade of the project, it might be simplest to port from scratch and not base off of cld-js.

The actual porting is fairly simple, the harder part is learning the library, it's API, and so forth (which you need to do anyhow, whether native or JS-compiled). I can help with the emscripten porting part, once you have the general stuff worked out - that is, once you have a Makefile that works natively, generates a binary that uses a C API you created for the purposes you want here, and the next task is to port that Makefile so it works in JS against the same C API (in JS), then I could do that part.
Blocks: 973275
Whiteboard: p=0 → [translation] p=0
Assignee: nobody → florian
Status: NEW → ASSIGNED
Whiteboard: [translation] p=0 → [translation] p=8 s=it-30c-29a-28b.3
Whiteboard: [translation] p=8 s=it-30c-29a-28b.3 → [translation] p=8 s=it-30c-29a-28b.3 [qa-]
Attached file langdet-wip1.patch.gz (obsolete) —
Work in progress. Unfortunately, the emscripten compiled cld files weights 5.4M.

There's also a problem with some of the tests failing by returning the value that was expected for the previous test. I suspect a buffer somewhere isn't cleaned up correctly. I would expect a call to detectLanguage to not be affected by whatever was done in the previous call.

This attachment is gzipped because of its size. It contains a patch that adds a "langdet" folder in browser/components/

This folder contains a JS module LanguageDetector.jsm, the worker cld-worker.js (5.4MB), CLD2's source code with the script I used to compile it (src/build.sh), and the CLD2 unit tests that I converted to an xpcshell test.
Note that about 1/3 of the size of that patch file is the memory initializer, which you can turn into a separate binary file with    --memory-init-file 1         (the 1/3 figure is of the entire patch file, I didn't measure how much it is of the emscripten output file, but that would be larger).

You might also be able to reduce this in a few other ways:

* Build source files with -Os or -Oz to optimize for size
* Build bitcode to JS with -O3 which runs the expensive variable reuse pass
* Use llvm LTO during bitcode to JS     -s INLINING_LIMIT=1 --llvm-lto 1     (that also disables inlining, as otherwise code tends to increase in size)
* Use closure on the outside non-asm.js code         --closure 1        (can break some code)
(In reply to Florian Quèze [:florian] [:flo] from comment #6)
> Created attachment 8386262 [details]
> langdet-wip1.patch.gz
> 
> Work in progress. Unfortunately, the emscripten compiled cld files weights
> 5.4M.
> 

I wonder how much this is from CLD2 being just bigger than CLD1, and how much of it is due to the work that already went into reducing the file size for jaukia's cld.js.

> There's also a problem with some of the tests failing by returning the value
> that was expected for the previous test. I suspect a buffer somewhere isn't
> cleaned up correctly. I would expect a call to detectLanguage to not be
> affected by whatever was done in the previous call.

I saw the same thing happening on my tests using the cld.js from github. It is a bit concerning..
I didn't look further into that

> 
> This attachment is gzipped because of its size. It contains a patch that
> adds a "langdet" folder in browser/components/
> 
> This folder contains a JS module LanguageDetector.jsm, the worker
> cld-worker.js (5.4MB), CLD2's source code with the script I used to compile
> it (src/build.sh), and the CLD2 unit tests that I converted to an xpcshell
> test.

FWIW, in bug 971054 I am proposing a structure of:
 - browser/modules/translation/   ->  resource:///modules/translation/
Where most of the .jsm related to translation could go

and a subfolder (and submodule folder) for specific engines:
 - browser/modules/translation/engines/ -> resource:///modules/translation/engines/  where we can add the modules related to each engine, and perhaps cld-worker.js too
(In reply to Alon Zakai (:azakai) from comment #7)
> Note that about 1/3 of the size of that patch file is the memory
> initializer, which you can turn into a separate binary file with   
> --memory-init-file 1         (the 1/3 figure is of the entire patch file, I
> didn't measure how much it is of the emscripten output file, but that would
> be larger).

Would we then need to package 2 different files? Would the result be smaller?

> You might also be able to reduce this in a few other ways:
> 
> * Build source files with -Os or -Oz to optimize for size

I will try this.

> * Build bitcode to JS with -O3 which runs the expensive variable reuse pass

The line I used to compile is:
emcc -s USE_TYPED_ARRAYS=2 -O3 --closure 0 -DCLD_WINDOWS -I. -o ../cld-worker.js cldapp.cc $SOURCES --pre-js pre.js --post-js post.js -s EXPORTED_FUNCTIONS="['_detectLanguageRaw']"

Do you mean I should build the object files individually with different flags than the final link?

> * Use llvm LTO during bitcode to JS     -s INLINING_LIMIT=1 --llvm-lto 1    
> (that also disables inlining, as otherwise code tends to increase in size)

I will try.

> * Use closure on the outside non-asm.js code         --closure 1        (can
> break some code)

I tried it and the resulting code didn't work.
Yes, you would have 2 files then, one with JS and one with binary data to initialize memory with. Almost always having the binary one is overall smaller since the text representation for numbers is far less efficient than binary.

You can specify different options during compilation of sources and compilation to JS. If you compile from cc to js as in that command, then -O3 is used for both. If you want to separate them, you could do

emcc cldapp.cc -o cldapp.o -Oz # C opts for size
emcc cldapp.o -o cpdapp.js -O3 # Maximal JS opts

Regarding closure, it might not be worth the effort as it only reduces the surrounding code. Maybe 100K or so total difference.
Using -Os instead of -O3 for the .o files saves 14kB.
-Oz isn't recognized by the compiler ("Exception: Invalid optimization level: -Oz")

Adding -s INLINING_LIMIT=1 --llvm-lto 1 saves another 43kB.

As expected, "--memory-init-file 1" offers a much larger size win, and the new sizes of the files are:
715K  cld-worker.js
1,5M  cld-worker.js.mem
Unfortunately, this is at the cost of introducing a different bug: the first time the library is used, the return value is an empty string. The following calls give the same result as without the separate memory init file.
Oh, sorry, -Oz is only on emscripten's incoming branch - which requires the new backend https://github.com/kripken/emscripten/wiki/LLVM-Backend - that takes some setting up, so if you haven't done so already, might not be worth doing right now. But, both -Os and -Oz are much more effective there, so it might be worth trying at some point.

The problem of not working the first time is likely because the memory init file is loaded asynchronously, and you are calling it before it has arrived. We should actually assert when that happens but I guess we missed a code path - how are you calling into the compiled code?

In any case the solution for that is to wait for a callback telling you when the init file arrived, using addOnPreMain(callback) (the callback is called right before main would be executed, which is after all asynchronous setup has been done).
Another way to do that is to call your callback manually from main(),

int main() {
  EM_ASM( myJSCallback() ); // calls a global JS func
  return 0;
}
(In reply to Alon Zakai (:azakai) from comment #12)

> In any case the solution for that is to wait for a callback telling you when
> the init file arrived, using addOnPreMain(callback)

This worked, thanks! It took me a bit of time to get it to work because I assumed that messages posted to workers are queued until the 'onmessage' property is set by the worker code. Apparently this assumption was wrong, and the messages are only queued until the end of the execution of the synchronous code that runs when starting the worker.
After adding a queue in LanguageDetector.jsm for messages that need to be sent to the cld worker, things worked!


I have some more questions:
- the .js.mem file is next to the .js file, and in my current local build they are regular files. Will this still work after |make package| once these files are packed into omni.ja? If not, do you have suggestions for how we could get it to work? (or if you have no idea, some explanations about how the file is read would already be useful.)
- it seems the generated .js code contains a fair amount of code that will never get run (eg. all the things that are conditional on ENVIRONMENT_IS_* variables. The generated code will only run in a worker context.) Is there a compiler flag to specify the target environment?
I actually don't know how omni.ja stuff works. But basically the script will just try to open an xhr to the .mem file. That code could easily be changed to load in a different way if necessary. But if omni supports xhrs then there should be no need.

There isn't actually that much to strip out based on ENVIRONMENT, and no option to use for that. There is just a lot of general code an emscripten app needs - we ship a whole libc, basically. But in a sizable app like here, the asm.js part is by far the biggest anyhow, even minified.

Note that closure as mentioned before can help a lot if the non-asm.js code does matter, we could fix the problems preventing it from working if it's worth the size reduction. Typically it amounts to making sure to do x['y'] instead of x.y so closure is not confused. We could also try closure at a lower level of optimization.
(In reply to Alon Zakai (:azakai) from comment #15)
> I actually don't know how omni.ja stuff works. But basically the script will
> just try to open an xhr to the .mem file. That code could easily be changed
> to load in a different way if necessary. But if omni supports xhrs then
> there should be no need.

omni.ja is mostly a zip file with some specific optimizations. XHR should work. I'll verify tomorrow.

> There isn't actually that much to strip out based on ENVIRONMENT, and no
> option to use for that. There is just a lot of general code an emscripten
> app needs - we ship a whole libc, basically. But in a sizable app like here,
> the asm.js part is by far the biggest anyhow, even minified.

In my generated .js file, the asm.js part is 398kB out of 715kB.

I would be surprised if this library used more than a very tiny portion of libc. Is there any automated way to do dead code elimination? (I don't think reviewing all this code to remove the unused parts would be a good use of anybody's time.)

And by the way, I'm continuing this discussion about size reduction because I'm curious to see how much we can still win here, but I think the size reduction we got with the --memory-init-file trick is already good enough to consider this a viable option compared to shipping a native library (for comparison, the native library weights 1.7MB on Mac x86_64).

> Note that closure as mentioned before can help a lot if the non-asm.js code
> does matter, we could fix the problems preventing it from working if it's
> worth the size reduction. Typically it amounts to making sure to do x['y']
> instead of x.y so closure is not confused. We could also try closure at a
> lower level of optimization.

Is the x['y'] -> x.y thing only something I need to look for in the code in the files included with --pre and --post? Is this documented anywhere?

Thank you so much for all the help here!
> 398kB out of 715kB.

Oh, right - I was thinking about the older numbers. Yes, this matters a lot here :)

The way to get dead code elimination is to use closure, basically, it's very good at it.

The x['y'] thing is a closure-specific issue, see https://developers.google.com/closure/compiler/docs/api-tutorial3 for more details. But really, the thing is you need your --pre and --post-js files to use x['y'] format when it is NOT ok to minify 'y'. If you will call Module.myAPI from outside the closure-compiled code, then you need Module['myAPI'], otherwise closure can reduce it to Module.a.

I can take a look at those files if you run into problems here.
Oh, if it's only for things that need to be called from the outside, the changes should be minimal :).
Yes, the way to think about it is, if closure can see all uses of x.prop, then it can minify prop safely to 'a'. But if this is called by code closure cannot see, then closure needs to be told to not minify to x.a, and the way to do that is x['prop'] which closure only changes to x.prop but not any further.
Another question I have is: is there any way to debug the compiled code?
I assume the answer is no, and I'm thinking about building a binary of the library so that I can use gdb to debug the issue mentioned in comment 6 (the library sometimes returning the previous result instead of the new one). But I figured it's worth asking just in case.


Felipe, would you like my next patch to have all the files in browser/modules/translation/cld/ instead of browser/components/langdet?
In a debug build with -g4, emscripten emits source maps, so you can use the browser debugger and see the C/C++ source. But no direct support for variable values etc. In general, native builds are easier to debug if you like debuggers.

But if you like printf debugging instead then JS builds are superior - in a nonminified build you can modify the build to add arbitary JS for debugging purposes. I even wrote this http://mozakai.blogspot.com/2012/06/debugging-javascript-is-awesome.html
(In reply to Florian Quèze [:florian] [:flo] from comment #11)

> 715K  cld-worker.js

With --closure 1, this is down to 476K. Interestingly, I didn't have anything to fix in the JS code for closure to work; so I guess some of the changes I made yesterday for other reasons 'fixed it'.
Attached file langdet-wip2.patch.gz (obsolete) —
I think this now does (most of) what we need.
Attachment #8386262 - Attachment is obsolete: true
Same as attachment 8387642 [details], but without the imported C++ source code, so that I can attach a raw patch for easy bugzilla reviewing.
Attachment #8387647 - Flags: feedback?(felipc)
Attachment #8387647 - Flags: feedback?(azakai)
Comment on attachment 8387647 [details] [diff] [review]
langdet-wip2 without the C++ source code

Review of attachment 8387647 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/langdet/src/cldapp.cc
@@ +9,5 @@
> +using namespace CLD2;
> +
> +bool g_is_reliable;
> +
> +const char* detectLangCode(const char* src) {

The API I really wanted here was more:
const char* detectLangCode(const char* src, bool *is_reliable);

This would save us the g_is_reliable ugly global variable and the lastResultReliable function.

::: browser/components/langdet/src/post.js
@@ +8,5 @@
> +    var encoder = new TextEncoder();
> +    encoder['encoding'] = "utf-8";
> +    var utf8Array = encoder['encode'](aMsg.data);
> +    var ptr = Module['_malloc'](utf8Array.length);
> +    new Uint8Array(Module['HEAPU8'].buffer, ptr, utf8Array.length).set(utf8Array);

azakai, is this correct, or do I need to add a null byte at the end of the array myself?

@@ +10,5 @@
> +    var utf8Array = encoder['encode'](aMsg.data);
> +    var ptr = Module['_malloc'](utf8Array.length);
> +    new Uint8Array(Module['HEAPU8'].buffer, ptr, utf8Array.length).set(utf8Array);
> +    var lang = Pointer_stringify(_detectLangCode(ptr));
> +    var confident = !!Module['ccall']("lastResultReliable", "number");

This is because I couldn't find how to pass a pointer to a boolean as a second parameter to detectLangCode.

::: browser/components/langdet/test/test_cld2.js
@@ +1,1 @@
> +// Copyright 2013 Google Inc. All Rights Reserved.

Not sure about the licensing here. Should this file keep the original license? Should I add an MPL2 header at the top? I wrote the last 17 lignes of the file (the trivial test code). The rest of the file is a conversion into a JS format of CLD2 unit tests that were in C. For now I've copied the header that was in the original unit test.
Gerv, can you check that things make sense here from a licensing point of view?

Summary of what I'm doing here: CLD2 is an Apache2 licensed C++ library that is being emscripten compiled into asm.js; we plan to ship it as part of Firefox. The library's code is generally unmodified, but the xpcshell test I'm adding here contains a large part that's directly adapted from a C++ unit test. (See last paragraph of comment 25).
Flags: needinfo?(gerv)
We should check in the C++ even if we also check in the JS version to avoid a build-time emscripten dependency. If it's Apache, no additional work needs to be done in about:license.

If your test is adapted from a C++ unit test which is Apache-licensed, the easiest thing to do is to license your test in the same way, with an Apache header.

Does that answer your questions?

Gerv
Flags: needinfo?(gerv)
(In reply to Gervase Markham [:gerv] from comment #27)
> We should check in the C++ even if we also check in the JS version to avoid
> a build-time emscripten dependency.

This is the plan. attachment 8387642 [details] is what we would check-in; it contains both the JS version and the C++ version (with the Makefile I used).

> If your test is adapted from a C++ unit test which is Apache-licensed, the
> easiest thing to do is to license your test in the same way, with an Apache
> header.
> 
> Does that answer your questions?

Yes, thanks! Looks like I don't have any change to do.
(In reply to Florian Quèze [:florian] [:flo] from comment #25)
> Comment on attachment 8387647 [details] [diff] [review]
> langdet-wip2 without the C++ source code
> 
> Review of attachment 8387647 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: browser/components/langdet/src/cldapp.cc
> @@ +9,5 @@
> > +using namespace CLD2;
> > +
> > +bool g_is_reliable;
> > +
> > +const char* detectLangCode(const char* src) {
> 
> The API I really wanted here was more:
> const char* detectLangCode(const char* src, bool *is_reliable);
> 
> This would save us the g_is_reliable ugly global variable and the
> lastResultReliable function.
> 
> ::: browser/components/langdet/src/post.js
> @@ +8,5 @@
> > +    var encoder = new TextEncoder();
> > +    encoder['encoding'] = "utf-8";
> > +    var utf8Array = encoder['encode'](aMsg.data);
> > +    var ptr = Module['_malloc'](utf8Array.length);
> > +    new Uint8Array(Module['HEAPU8'].buffer, ptr, utf8Array.length).set(utf8Array);
> 
> azakai, is this correct, or do I need to add a null byte at the end of the
> array myself?

You need to allocate 1 extra byte in the malloc call for the null terminator, and to write it out, yes. Note that there is a utility function for this, writeStringToMemory (see src/preamble.js). Actually looking at that code, I think your implementation is better...

> 
> @@ +10,5 @@
> > +    var utf8Array = encoder['encode'](aMsg.data);
> > +    var ptr = Module['_malloc'](utf8Array.length);
> > +    new Uint8Array(Module['HEAPU8'].buffer, ptr, utf8Array.length).set(utf8Array);
> > +    var lang = Pointer_stringify(_detectLangCode(ptr));
> > +    var confident = !!Module['ccall']("lastResultReliable", "number");
> 
> This is because I couldn't find how to pass a pointer to a boolean as a
> second parameter to detectLangCode.
> 

Pointers are literally just integers, so you can pass them around normally. Then you can read from them using getValue (or directly using the HEAP* views).
(In reply to Alon Zakai (:azakai) from comment #29)

> > > +    var lang = Pointer_stringify(_detectLangCode(ptr));
> > > +    var confident = !!Module['ccall']("lastResultReliable", "number");
> > 
> > This is because I couldn't find how to pass a pointer to a boolean as a
> > second parameter to detectLangCode.
> > 
> 
> Pointers are literally just integers, so you can pass them around normally.
> Then you can read from them using getValue (or directly using the HEAP*
> views).

So I would need to malloc a boolean (what's the size of a boolean?) and pass the pointer returned by malloc to the function as a parameter?

Would this be better than the current implementation with a global variable and a second C function?
Yes. A boolean is an int as well, so 4 bytes.

The out param could save a call, but only if you use HEAP to access the bool and not getValue. But, actually your current API seems nicer to me, function calls for everything. Perf difference hardly matters here anyhow.
Comment on attachment 8387647 [details] [diff] [review]
langdet-wip2 without the C++ source code

Review of attachment 8387647 [details] [diff] [review]:
-----------------------------------------------------------------

So for the folder paths, what if we throw everything cld-related to browser/modules/translation/cld, and the LanguageDetector.jsm module be directly under browser/translation?
Do we need to include a license file from cld? If yes then this would be a good format for us to include it in the same folder.

::: browser/components/langdet/LanguageDetector.jsm
@@ +6,5 @@
> +
> +Components.utils.import("resource://gre/modules/XPCOMUtils.jsm");
> +
> +XPCOMUtils.defineLazyModuleGetter(this, "Promise",
> +                                  "resource://gre/modules/commonjs/sdk/core/promise.js");

I think the recommended module to import now is resource://gre/modules/Promise.jsm

@@ +24,5 @@
> +      for (let string of pendingStrings)
> +        worker.postMessage(string);
> +      pendingStrings = [];
> +    }
> +    else

common style around here is:

} else {

::: browser/components/langdet/test/test_cld2.js
@@ +389,5 @@
> +function run_test() {
> +  do_test_pending();
> +  let i = 0;
> +  function testPair() {
> +    LanguageDetector.detectLanguage(kTestPairs[i][2]).then(result => {

since detectLanguage returns a promise, it would be nicer to have this function use add_task which uses Task.jsm. http://mxr.mozilla.org/mozilla-central/source/testing/xpcshell/head.js#1374

Then your test can be written clearer as (pseudo-code):

for (testpairs) {
  let result = yield LanguageDetector.detectLanguage(...)
  do_check_eq(..)
}
Attachment #8387647 - Flags: feedback?(felipc) → feedback+
browser/modules was intended as a place to put modules that don't deserve their own top-level/components directory - translation seems big enough to not be in that class. Can we keep all the translation code together in browser/components/translation?
Comment on attachment 8387647 [details] [diff] [review]
langdet-wip2 without the C++ source code

Review of attachment 8387647 [details] [diff] [review]:
-----------------------------------------------------------------

emscripten glue bits look good to me, just need to handle the null terminator as discussed before.
Attachment #8387647 - Flags: feedback?(azakai) → feedback+
Summary of conversations I've had today:

- I was wondering if it's better to do the UTF8 encoding (the TextEncoder.encode call) on the main thread so that we can transfer to the worker an ArrayBuffer (which is transfered without memory copy), or to transfer the JS string (which will be copied) and do the UTF8 encoding from the worker. I asked Yoric, and he told me that the measurements he did last year showed that both are more or less equivalent from a performance point of view. I said we are going to have large strings here (because we will have the whole text content of the current web page), and Yoric advised that we prefer transferring the string and doing the UTF8 encoding on the worker. The rationale was that while the performance of both solutions are currently similar, optimizing the string transfer is something that _could_ be done in the future.

I discussed with Gavin what needs to happen for this to land; more specifically how we avoid affecting users with the build size increase. Conclusions are:
- for the size reduction effort, I'll finish exploring the ideas I already have, but only spend an hour or two on them; and then we will go with what we have. While we will likely want to revisit this later, what we have now is already good enough for mozilla-central.
- to not ship this by default until the feature is usable, we debated whether this is worth adding a configure switch, and decided it was overkill and it would be easier to just comment out one line in a moz.build file for now.
(In reply to :Felipe Gomes from comment #32)

> @@ +24,5 @@
> > +      for (let string of pendingStrings)
> > +        worker.postMessage(string);
> > +      pendingStrings = [];
> > +    }
> > +    else
> 
> common style around here is:
> 
> } else {

Really?

I've been told (arguably a long time ago!) in bug 339102 comment 24:
> General comment: code-style in browser/ is usually
> 
> function func_name(aFoo)
> {
>   if (...)
>     one-line
> 
>   if (...) {
>     multiple lines
>   }
>   else if (...) {
>    ...
>   }
> 
>   try {
>    ...
>   }
>   catch(ex) { }
> }
> 
> I don't mind if you choose something else, but please try to be consistent.

and in bug 339102 comment 44:
> general comment: please avoid |} else [if] {| code style in new code.

I remember discussing it with Mano, and the rationale was that always having a \n before "else" makes it line up with "if", which makes the code flow easier to understand when scrolling through the code. Since that, I've always put a line break immediately before the 'else' keyword, and required it in code for which I've been the reviewer.
the cuddled else coding style is one of those things that keep coming back, since both sides have their advantage (one has only 1 row for the else, the other one has better alignment with the if).
The official coding style (https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style) says:

Control structures
Use K&R bracing style: left brace at end of first line, cuddle else on both sides.

That said, a lots of module are instead using the uncuddled version, for readability reasons.  So far I think we tried to keep the modules coherent internally but constantly failed at that (at least in modules I know, both forms can be found). Probably it's one of those things that will never reach consensus.
Using the table file cld2_generated_quadchrome0122_16.cc (4.5MB) instead of cld2_generated_quadchrome0122_2.cc (7MB) seems to save 438KB, and the tests still pass. I'll use the smaller table file for the next version of the patch.
(In reply to Florian Quèze [:florian] [:flo] from comment #28)
> (In reply to Gervase Markham [:gerv] from comment #27)

> > Does that answer your questions?
> 
> Yes, thanks! Looks like I don't have any change to do.

I actually have another licensing question (that Felipe was asking in comment 32): I'll have the C++ code of the cld2 library that we are compiling (ie. I will have a subset of the files that are in the cld repository) in a 'cld' folder in mozilla-central. Should I also include in that folder the 'LICENSE' file (https://code.google.com/p/cld2/source/browse/trunk/LICENSE), or is it enough to keep the Apache2 header in each C++ file? I kinda assumed having the headers was enough as we already have the full text of the apache2 license at about:license#apache, but I guess it's better to be sure :).
Flags: needinfo?(gerv)
It's a good idea to include the LICENSE file, as that makes it clear that this particular sub-directory of code is Apache-licensed.

Gerv
Flags: needinfo?(gerv)
Attached patch Patch v3 (obsolete) — Splinter Review
I think this is now ready for review.

I pushed to try at https://tbpl.mozilla.org/?tree=Try&rev=99ddc2ccde2f to verify that the tests pass on all OSes.
Attachment #8387642 - Attachment is obsolete: true
Attachment #8387647 - Attachment is obsolete: true
Attachment #8389166 - Flags: review?(felipc)
(In reply to Florian Quèze [:florian] [:flo] from comment #41)
> Created attachment 8389166 [details] [diff] [review]
> Patch v3
> 
> I think this is now ready for review.
> 
> I pushed to try at https://tbpl.mozilla.org/?tree=Try&rev=99ddc2ccde2f to
> verify that the tests pass on all OSes.

Note: the try server push includes a 'translation' line added to browser/components/moz.build, but the patch I attached here doesn't, so that checking this in to mozilla-central doesn't cause any Nightly size increase until the translation feature is actually usable.
Comment on attachment 8389166 [details] [diff] [review]
Patch v3

Review of attachment 8389166 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/translation/LanguageDetector.jsm
@@ +1,4 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this file,
> + * You can obtain one at http://mozilla.org/MPL/2.0/. */
> +

"use strict"

@@ +30,5 @@
> +  }
> +  return worker;
> +});
> +
> +this.LanguageDetector = {

We need to document this. Let's include a javadoc style comment here describing what this function do, that it returns a promise and that the promise has the two fields, .language and .confident.. Because otherwise it's hard to look at the worker code and see it (if you don't know you have to look in post.js)

::: browser/components/translation/cld/Makefile
@@ +37,5 @@
> +
> +FLAGS=-s USE_TYPED_ARRAYS=2 -O3 -s INLINING_LIMIT=1 --llvm-lto 1 --memory-init-file 1 --closure 1
> +
> +all: $(SOURCES:.cc=.o)
> +	$(CC) $(FLAGS) -I. -o cld-worker.js $^ --post-js post.js -s EXPORTED_FUNCTIONS="['_detectLangCode', '_lastResultReliable']"

is exporting _lastResultReliable still needed after you moved to the global var?

::: browser/components/translation/cld/post.js
@@ +20,5 @@
> +    heap[ptr + strLength] = 0;
> +
> +    var lang = Pointer_stringify(_detectLangCode(ptr));
> +    var confident = !!Module['ccall']("lastResultReliable", "number");
> +    postMessage({lang: lang, confident: confident});

let's call it "language" instead of "lang"

@@ +22,5 @@
> +    var lang = Pointer_stringify(_detectLangCode(ptr));
> +    var confident = !!Module['ccall']("lastResultReliable", "number");
> +    postMessage({lang: lang, confident: confident});
> +
> +    Module['_free'](ptr);

I'm not able to confidently review the Emscripten API usage throughtout this function, so I think azakai should give a final ok

::: browser/components/translation/moz.build
@@ +1,4 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +

Since we'll be adding quite a number of jsms here, I'd like to keep everything related to Translation in its own subfolder path to not polute the global resource:///modules/ path.

I don't know if we should go with resource:///modules/translation/, or just resource:///translation/. I'd say the former, but opinions are welcome. To do it you just need to include:

JS_MODULES_PATH = 'modules/translation' at this moz.build

after that change then don't forget to update the WORKER_URL and the import at the test_cld2.js

@@ +9,5 @@
> +]
> +
> +XPCSHELL_TESTS_MANIFESTS += [
> +    'test/xpcshell.ini'
> +]

And in any case, every new moz.build file needs to be reviewed by a build peer, even if simple.

::: browser/components/translation/test/test_cld2.js
@@ +386,5 @@
> +
> +add_task(function test_pairs() {
> +  for (let pair of kTestPairs) {
> +    let result = yield LanguageDetector.detectLanguage(pair[2]);
> +    do_check_eq(result.lang, pair[0]);

might as well check result.confident too..  Are there any test cases here where it gives false?
Attachment #8389166 - Flags: review?(felipc) → review+
(In reply to :Felipe Gomes from comment #43)

> might as well check result.confident too..  Are there any test cases here
> where it gives false?

There was one with the table I originally used. There may be more now.
(In reply to :Felipe Gomes from comment #43)

> > +	$(CC) $(FLAGS) -I. -o cld-worker.js $^ --post-js post.js -s EXPORTED_FUNCTIONS="['_detectLangCode', '_lastResultReliable']"
> 
> is exporting _lastResultReliable still needed after you moved to the global
> var?

Yes, lastResultReliable is the function that we use to get the value of the global variable.
Attached patch Patch v4Splinter Review
r?felipe just in case he wants to check his comments have correctly been addressed.

r?azakai for a final check of the emscripten bits (the only change compared to the attachment you already looked at is adding the terminating null byte).

r?gps for the new moz.build file.
Attachment #8389166 - Attachment is obsolete: true
Attachment #8389356 - Flags: review?(gps)
Attachment #8389356 - Flags: review?(felipc)
Attachment #8389356 - Flags: review?(azakai)
Attachment #8389356 - Flags: review?(felipc) → review+
Comment on attachment 8389356 [details] [diff] [review]
Patch v4

Review of attachment 8389356 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/translation/cld2/post.js
@@ +14,5 @@
> +    // Copy the UTF8 byte array to the heap.
> +    var strLength = utf8Array.length;
> +    var ptr = Module['_malloc'](strLength + 1);
> +    var heap = Module['HEAPU8'];
> +    new Uint8Array(heap.buffer, ptr, strLength).set(utf8Array);

This line could have been

  HEAPU8.subarray(ptr, ptr+strlength).set(utf8Array)

but both work and are fine. I think both allocate a new object so nothing can be saved here.
Attachment #8389356 - Flags: review?(azakai) → review+
cuddled elses 4 life!
Comment on attachment 8389356 [details] [diff] [review]
Patch v4

Review of attachment 8389356 [details] [diff] [review]:
-----------------------------------------------------------------

I realize I'm supposed to review just the moz.build file. That would get an r+. (Although, without a corresponding entry in a DIRS variable elsewhere, this file won't get picked up - is that intentional?)

I'm granting an r- because it looks to me like we're shipping a 1.1 MB binary file in browser/components/translation/cld/cld-worker.js.mem and I don't think that will fly. The Debian people will go ape (I expect glandium to chime in here). It appears to only target a single platform (Windows) (although perhaps my knowledge of what emscripten is doing here is wrong). I /think/ this is against some policy somewhere (written or otherwise). I seem to remember stinks being raised for similar transgressions.

The Makefile isn't up to the standards for a piece of shipping code that integrates with our build system:

* The Makefile isn't installed into the object directory and we don't want to pollute the source directory with junk
* emcc needs to be detected in configure
* Automation will need emscripten installed (assuming we want to generate the binaries as part of the build)
* This should be behind a feature flag and detected in configure

But then again, this Makefile isn't part of the build, so...

I only casually perused the bug comments. Maybe I missed something, but it looks like nobody considered doing this the right way in build and automation (reproducing the binary via emscripten as part of the build).

Furthermore, I do not feel comfortable giving r+ to a patch checking in binary platform code. But if this is OK by others, I can be swayed.

More context, please.
Attachment #8389356 - Flags: review?(gps) → review-
Build peeps: See my last comment.
(In reply to Gregory Szorc [:gps] from comment #49)
> I'm granting an r- because it looks to me like we're shipping a 1.1 MB
> binary file in browser/components/translation/cld/cld-worker.js.mem and I
> don't think that will fly. The Debian people will go ape (I expect glandium
> to chime in here).

Why?

> The Makefile isn't up to the standards for a piece of shipping code that
> integrates with our build system:

> But then again, this Makefile isn't part of the build, so...

Indeed, we're just checking it in for reference.

> I only casually perused the bug comments. Maybe I missed something, but it
> looks like nobody considered doing this the right way in build and
> automation (reproducing the binary via emscripten as part of the build).

What's the benefit? We don't expect to be modifying the CLD code frequently, a manual build process for the cases where we do seems fine.
Regarding the 1.1MB binary '.mem' file, if size is a concern we could gzip it and decompress it as it loads. That type of file tends to compress well.
(In reply to Gregory Szorc [:gps] from comment #49)

> I realize I'm supposed to review just the moz.build file. That would get an
> r+. (Although, without a corresponding entry in a DIRS variable elsewhere,
> this file won't get picked up - is that intentional?)

It's intentional, this is how we avoid building it until the translation feature is fully implemented (see last paragraph of comment 35).

For now, we add a 'translation' line to browser/components/moz.build locally and for try server builds.

> I'm granting an r- because it looks to me like we're shipping a 1.1 MB
> binary file in browser/components/translation/cld/cld-worker.js.mem and I
> don't think that will fly. The Debian people will go ape (I expect glandium
> to chime in here).

I think the reason why we are providing in mozilla-central the C++ source code and the Makefile that was used to produce the binary is exactly so that people who are unhappy about using a pre-existing binary file can regenerate it.

> It appears to only target a single platform (Windows)

The -DCLD_WINDOWS in the Makefile is a leftover from a previous version of the library. It doesn't do anything with the current version, and the code has been verified to work on all the platforms we care about (https://tbpl.mozilla.org/?tree=Try&rev=99ddc2ccde2f is all green).

> The Makefile isn't up to the standards for a piece of shipping code that
> integrates with our build system

Not integrating it in the build system was intentional.

> I only casually perused the bug comments. Maybe I missed something, but it
> looks like nobody considered doing this the right way in build and
> automation (reproducing the binary via emscripten as part of the build).

Nobody considered requiring all Firefox developers to install emscripten.
As I see it, this adds "binary" platform-independent code, along with the corresponding source and build rules to rebuild it. This is, in fact, much better than another kind of similar case in the tree, where we have a binary platform-dependent code, without any build rules to rebuild it (and it's always a PITA to update it, and there are probably only two people who know how to do it ; for the curious, i'm talking about other-licenses/7zstub/firefox/7zSD.sfx). This is also better than yet another kind of similar case in several places in the tree, where we imported a minified jquery, without the corresponding source and minifier.

I'm not saying these are fine, that would be a question better asked to gerv. I'm just putting this in perspective compared to what we already have in the tree.

Note that *not* including the "binary" platform independent code would mean adding a build dependency on emscripten, which is adding a serious degree of pain to people who want to build the tree.
Gerv, see comment 54.
Flags: needinfo?(gerv)
(In reply to Alon Zakai (:azakai) from comment #52)
> Regarding the 1.1MB binary '.mem' file, if size is a concern we could gzip
> it and decompress it as it loads. That type of file tends to compress well.

It doesn't compress that well: gzipping it saves only 243kB. But I don't think size is really gps' concern here.
(In reply to Mike Hommey [:glandium] from comment #54)

> I'm not saying these are fine, that would be a question better asked to
> gerv.

gerv answered in comment 27.
(In reply to Florian Quèze [:florian] [:flo] from comment #57)
> gerv answered in comment 27.

That sets it, then. At least for me. It would be good to have better integration in the build system for people that *do* want to rebuild those files.

Greg, thoughts?
Flags: needinfo?(gerv) → needinfo?(gps)
(BTW, and FTR, we also have plenty of binary platform-dependent code with no source in other-licenses/nsis)
Comment on attachment 8389356 [details] [diff] [review]
Patch v4

Review of attachment 8389356 [details] [diff] [review]:
-----------------------------------------------------------------

I was confused about what was going on. Changing review to r+. Sorry for raising everyone's blood pressure.

Related question: at 1.1 MB, what's the performance impact of having that large .mem file in omni.jar? I /think/ we do omni.jar I/O synchronously on the main thread. Could there be large jank when this file is first accessed? Also, since that file is in omni.jar, we'll get compression on distribution for free.
Attachment #8389356 - Flags: review- → review+
And FTR I wasn't suggesting we *require* all developers to install Emscripten. But I was suggesting that any developers who wanted this feature enabled would need Emscripten. Since we don't plan on updating this file frequently, that addressed one of my concerns about large binary files checked into source control.
Flags: needinfo?(gps)
(In reply to Gregory Szorc [:gps] from comment #60)
> Related question: at 1.1 MB, what's the performance impact of having that
> large .mem file in omni.jar? I /think/ we do omni.jar I/O synchronously on
> the main thread. Could there be large jank when this file is first accessed?

Good questions - we should at the very least have full talos runs before this lands as an attempt to gauge any perf impact.

Perhaps vladan has some ideas off-hand for other ways to measure the perf impact, or specific concerns that merit investigation.
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #62)
> (In reply to Gregory Szorc [:gps] from comment #60)
> > Related question: at 1.1 MB, what's the performance impact of having that
> > large .mem file in omni.jar? I /think/ we do omni.jar I/O synchronously on
> > the main thread. Could there be large jank when this file is first accessed?
> 
> Good questions - we should at the very least have full talos runs before
> this lands as an attempt to gauge any perf impact.

What do you mean by "before this lands" here?

If we land the patch here, the 1.1MB file won't be included in omni.ja, until we actually add a 'translation' line in browser/components/moz.build.

And even then, the cld-worker is lazily started, so the file won't be accessed until some code calls the LanguageDetector.detectLanguage function.

The WIP I have in bug 979424 would, I think, be more interesting to send at talos.
Flags: needinfo?(gavin.sharp)
Yes, sorry I wasn't clear. Obviously no need to talos a patch that isn't built, but wherever we end up enabling packaging/use of this code we should be diligent.
Flags: needinfo?(gavin.sharp)
https://hg.mozilla.org/integration/fx-team/rev/f33b13605010

I removed -DCLD_WINDOWS from the Makefile, as that was confusing.
We read omni.ja's in their entirety on startup (on the main thread), so I expect this patch will regress startup time. ts_paint might not show the regression, since it tests for "warm" Firefox startups. It might be better to leave CLD out of omni.ja so it only gets read in when we need it.

Why are we compiling an external C++ library to asm.js? Is it because we have particular concerns about safety/stability?
This decision seems very strange to me as well.  What are the safety concerns here exactly?  Have we contacted the chromium folks to ask what kinds of security fixes they have done on their code?  We don't compile the rest of the C++ code that we have on the platform to JS for safety reasons.
(In reply to Vladan Djeric (:vladan) from comment #66)
> We read omni.ja's in their entirety on startup (on the main thread)

Except maybe on mac, we shouldn't be.
I can't speak for the people doing the planning and work in this bug, but in general I think that sandboxing new C/C++ code we include into Firefox makes sense. Unless we want to audit it very carefully (both initially, and over time as it changes). Even so, Chrome sandboxes its own components where performance does not matter too much (like their new microphone process), using NaCl, so it makes sense for us to do something similar (NaCl could be another option for self-sandboxing, in theory, btw). All of this is just because of the familiar security risks with C/C++ code; Firefox (and other browsers) just got hacked yesterday at a security conference for example.

With JS we get good performance and a very high guarantee of sandboxing - it's high enough for us to be confident running content straight off the web, after all :) So where performance is not critical, this might make sense then.

Three other possible benefits of this approach are (1) code can be loaded dynamically only if/when actually needed (although perhaps not from omni.ja, based on discussion above?), (2) it is almost trivial to run computational code in a worker, avoiding jank on the main thread, and (3) when you build the codebase once into JS, it runs on all platforms - no worries with header issues or platform-specific bugs in the code.

There are downsides too of course, as mentioned above by others. I don't know enough about the other factors here to argue if the tradeoff is overall good or not in this case. Just wanted to lay out the potential benefits as I see them.
(In reply to comment #69)
> I can't speak for the people doing the planning and work in this bug, but in
> general I think that sandboxing new C/C++ code we include into Firefox makes
> sense. Unless we want to audit it very carefully (both initially, and over time
> as it changes). Even so, Chrome sandboxes its own components where performance
> does not matter too much (like their new microphone process), using NaCl, so it
> makes sense for us to do something similar (NaCl could be another option for
> self-sandboxing, in theory, btw). All of this is just because of the familiar
> security risks with C/C++ code; Firefox (and other browsers) just got hacked
> yesterday at a security conference for example.

We import C++ code from other projects into our tree fairly regularly, and test it through a whole variety of means such as auditing, debugging assertions, fuzzing, etc.  I understand the security benefits of JS over C++ in general, but it's not like we don't have tools to deal with imported C++ code.  Also note that depending on the source of the imported code, various degrees of effort might have already been put into making it secure.

> With JS we get good performance and a very high guarantee of sandboxing - it's
> high enough for us to be confident running content straight off the web, after
> all :) So where performance is not critical, this might make sense then.

The thing is, it's not just the runtime performance of the code which we should consider.  We should consider other things such as the size of the JS code versus the compiled C++ code (which affects how much additional time we spend reading the binaries from the disk on the main thread during startup, if the C++ code is linked into libxul), the memory usage of the C++ compiled code vs the JS compiled code, the possibilities of that affecting what we end up doing when we decide to port this code to mobile, etc.

> Three other possible benefits of this approach are (1) code can be loaded
> dynamically only if/when actually needed (although perhaps not from omni.ja,
> based on discussion above?),

C++ code can also be dynamically loaded.  ;-)

> (2) it is almost trivial to run computational code
> in a worker, avoiding jank on the main thread

It's also trivial to run computation code in a background thread, as I'm sure you're aware.  :-)

> (3) when you build the
> codebase once into JS, it runs on all platforms - no worries with header issues
> or platform-specific bugs in the code.

We have managed to get that under control for the C++ parts of the code base, I don't see why that should be different here.

> There are downsides too of course, as mentioned above by others. I don't know
> enough about the other factors here to argue if the tradeoff is overall good or
> not in this case. Just wanted to lay out the potential benefits as I see them.

Yes, understood.  What I'm trying to say is that by compiling C++ code into JS, we're also adding a whole bunch of new ways to shoot ourselves in the foot in addition to the available ways of doing that in C++.  But we do have quite a bit of experience dealing with the C++ footguns, and at least from the discussion on this bug so far it doesn't seem like the first set of footguns have been fully explored here.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #70)
> Yes, understood.  What I'm trying to say is that by compiling C++ code into
> JS, we're also adding a whole bunch of new ways to shoot ourselves in the
> foot in addition to the available ways of doing that in C++.

Which footguns are you concerned about, specifically?

It's not clear to me what you think the downsides with this approach are, other than "I'm used to dealing with C++".

We should obviously be aware of all the tradeoffs we're making - comparing run-time memory use between this and the "native code" approach would definitely be useful, for example. But the benefits to this approach seem big enough that it shouldn't be dismissed just because we've not done it before.
(In reply to comment #71)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #70)
> > Yes, understood.  What I'm trying to say is that by compiling C++ code into
> > JS, we're also adding a whole bunch of new ways to shoot ourselves in the
> > foot in addition to the available ways of doing that in C++.
> 
> Which footguns are you concerned about, specifically?

Code size, memory usage, affect on startup time, runtime performance, and the porting story to mobile off the top of my head.

> It's not clear to me what you think the downsides with this approach are, other
> than "I'm used to dealing with C++".
> 
> We should obviously be aware of all the tradeoffs we're making - comparing
> run-time memory use between this and the "native code" approach would
> definitely be useful, for example. But the benefits to this approach seem big
> enough that it shouldn't be dismissed just because we've not done it before.

Sure, no doubt about that.  But the downsides of it should not be disregarded either because "I'm used to dealing with JS".  :-)

Also, please consider the other point that I raised about it being quite normal for us to import C++ code from other projects.  Compiling to JS has not been the default answer in a majority of cases.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #72)
> > Which footguns are you concerned about, specifically?
> 
> Code size, memory usage, affect on startup time, runtime performance, and
> the porting story to mobile off the top of my head.

These are questions we'll need to answer about any newly added code, regardless of what language it's in. We're going to understand the costs before we ship the code.

> Also, please consider the other point that I raised about it being quite
> normal for us to import C++ code from other projects.

"being normal" doesn't mean it's a good idea :)
It's definitely an interesting approach, and possibly a viable one.

IMO the discussion if we should sandbox external code in JS is probably bigger than the scope of this bug, so maybe on dev.platform?

However, to have a good discussion with some data to back up the arguments, it'd probably be best if there was a concrete case to assess and analyze, and from that perspective, this bug might fit the bill.

Assuming that for this bug we could switch relatively easily (TM/TBD) between pure C++ and Emscripten, this bug could be an interesting test case for the approach.
https://hg.mozilla.org/mozilla-central/rev/f33b13605010
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 30
(In reply to comment #74)
> It's definitely an interesting approach, and possibly a viable one.
> 
> IMO the discussion if we should sandbox external code in JS is probably bigger
> than the scope of this bug, so maybe on dev.platform?

It's really hard to have any discussion without data, which we don't seem to have here. :(  Without that, any question one asks on the approach employed here can be turned around and asked from us.

> However, to have a good discussion with some data to back up the arguments,
> it'd probably be best if there was a concrete case to assess and analyze, and
> from that perspective, this bug might fit the bill.

It would definitely be awesome if someone gathered some data here.

> Assuming that for this bug we could switch relatively easily (TM/TBD) between
> pure C++ and Emscripten, this bug could be an interesting test case for the
> approach.

We can get data on many of the questions that I raised here without actually needing any facilities to switch between the implementation per se, for example, the code size issue can easily be answered by compiling the C++ code into x86/x86-64/arm using the optimization flags we built with.  In fact, I think the only one which we cannot answer accurately without having this kind of switch would be the runtime memory usage.

But as this patch is landed and it seems like this falls into the perf umbrella more than anything else, I'll defer further pushing on this to Vladan.  For the record, I remain unconvinced so far that there is a good reason why we should avoid natively compiling the C++ code here.
Flags: needinfo?(vdjeric)
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #73)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #72)
> > > Which footguns are you concerned about, specifically?
> > 
> > Code size, memory usage, affect on startup time, runtime performance, and
> > the porting story to mobile off the top of my head.
> 
> These are questions we'll need to answer about any newly added code,
> regardless of what language it's in. We're going to understand the costs
> before we ship the code.

Is there a bug on file for that investigation?

> > Also, please consider the other point that I raised about it being quite
> > normal for us to import C++ code from other projects.
> 
> "being normal" doesn't mean it's a good idea :)

No, it just means that we have plenty of experience how to do it right.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #76)
> We can get data on many of the questions that I raised here without actually
> needing any facilities to switch between the implementation per se...
> ...
> But as this patch is landed and it seems like this falls into the perf
> umbrella more than anything else, I'll defer further pushing on this to
> Vladan.  For the record, I remain unconvinced so far that there is a good
> reason why we should avoid natively compiling the C++ code here.

Indeed, we can test this stuff relatively quickly. However, my reasoning for "easy switch" was from a maintenance point of view. If down the road we need to abandon this approach, then it would hurt if switching is hard.

As for pro/cons, IMO the big pro is better sandboxing. The cons is mostly the price(s) for it, and the question is if it's worth the price (probably mostly in perf and maintainability) - which I don't have an answer to.
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #76)
> For the record, I remain unconvinced so far that there is a good
> reason why we should avoid natively compiling the C++ code here.

One of the main motivators, IMO, is the security aspect. Note that this library will consume arbitrary content from webpages. And half of the library code is statistically-generated C++ code. So while it must be "safe enough" for Chrome to be using it, the security advantage of running it in JS is significant, at least without a sandboxed process around it.

All else being equal, I'd choose this option. Florian has some preliminary numbers for memory usage and code size which were looking really good, but I'll let him comment so that I don't quote wrong numbers. And we should get more data soon. I think the main concern is cold startup from the omni.ja size increase, and if that's the only thing blocking we can investigate having this file outside of omni.ja.
(In reply to comment #79)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #76)
> > For the record, I remain unconvinced so far that there is a good
> > reason why we should avoid natively compiling the C++ code here.
> 
> One of the main motivators, IMO, is the security aspect. Note that this library
> will consume arbitrary content from webpages. And half of the library code is
> statistically-generated C++ code. So while it must be "safe enough" for Chrome
> to be using it, the security advantage of running it in JS is significant, at
> least without a sandboxed process around it.

OK, here's some data on the amount of code we're talking about.  As you know, the majority of of the CLD2 code is data tables, which don't introduce security risks on their own.  Without that, we have:

$ git status
HEAD detached at origin/fx-team
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	deleted:    cld2_generated_cjk_compatible.cc
	deleted:    cld2_generated_deltaoctachrome0122.cc
	deleted:    cld2_generated_distinctoctachrome0122.cc
	deleted:    cld2_generated_quadchrome0122_16.cc
	deleted:    cld_generated_cjk_delta_bi_4.cc
	deleted:    cld_generated_cjk_uni_prop_80.cc
	deleted:    cld_generated_score_quad_octa_0122_2.cc
	deleted:    generated_distinct_bi_0.cc
	deleted:    generated_entities.cc
	deleted:    generated_language.cc
	deleted:    generated_language.h
	deleted:    generated_ulscript.cc
	deleted:    generated_ulscript.h

no changes added to commit (use "git add" and/or "git commit -a")
$ perl ~/Downloads/cloc-1.60.pl *.h *.cc
      35 text files.
      35 unique files.
       0 files ignored.

http://cloc.sourceforge.net v 1.60  T=0.11 s (313.3 files/s, 151869.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C++                             13           1203           2301           6866
C/C++ Header                    22           1088           1689           3818
-------------------------------------------------------------------------------
SUM:                            35           2291           3990          10684
-------------------------------------------------------------------------------

So we're talking about auditing at most 10kloc, and there is a chance that we don't use all of that code if we for example only call DetectLanguage() without asking for HTML parsing.  It seems to me like this code is at least auditable, and if you're worried about the quality of the code it should be audited regardless of whether we compile it natively or not, since a buffer overrun in native code would be a bug which probably breaks language detection in the JS compiled code.
I pushed to try a moz.build file that causes the build to create cld2 binary library.
Unfortunately the Linux and Windows builds failed: https://tbpl.mozilla.org/?tree=Try&rev=3985f2aa1ac5

Here is the data I have about binary vs emscripten'ed size:

emscripten:
cld-worker.js 481372 bytes + cld-worker.js.mem 1108936 bytes
Total: 1590308 bytes (1553kB)

binary:
From my local x86_64 Mac build:
libcld2.dylib 1300792 bytes (1270kB)
From the try server's universal build:
libcld2.dylib 2576004 bytes (2515kB)


About memory usage:
The cld-min.js file from https://github.com/jaukia/cld-js/ was using about 30MB after being loaded.
Our cld-worker.js is compiled with a newer version of emscripten that outputs asm.js code. It takes 16MB of memory when loaded, and when I tried using it on real web pages, I was the memory usage grow to 18MB.

I don't have memory usage data from the binary library, and don't know a way to easily collect them.
(In reply to comment #81)
> Created attachment 8391241 [details] [diff] [review]
>   --> https://bugzilla.mozilla.org/attachment.cgi?id=8391241&action=edit
> Build a binary to get data about code size
> 
> I pushed to try a moz.build file that causes the build to create cld2 binary
> library.

Thanks!

> Unfortunately the Linux and Windows builds failed:
> https://tbpl.mozilla.org/?tree=Try&rev=3985f2aa1ac5
> 
> Here is the data I have about binary vs emscripten'ed size:
> 
> emscripten:
> cld-worker.js 481372 bytes + cld-worker.js.mem 1108936 bytes
> Total: 1590308 bytes (1553kB)

So that is about 1.5MB read on the main thread during application startup currently.  That's bad.

> binary:
> From my local x86_64 Mac build:
> libcld2.dylib 1300792 bytes (1270kB)
> From the try server's universal build:
> libcld2.dylib 2576004 bytes (2515kB)

Comparing to the universal build is not really useful, FWIW.

> About memory usage:
> The cld-min.js file from https://github.com/jaukia/cld-js/ was using about 30MB
> after being loaded.
> Our cld-worker.js is compiled with a newer version of emscripten that outputs
> asm.js code. It takes 16MB of memory when loaded, and when I tried using it on
> real web pages, I was the memory usage grow to 18MB.

This amount of memory usage is extremely worrying.  Nick, what do you think?
Flags: needinfo?(n.nethercote)
> I don't have memory usage data from the binary library, and don't know a way
> to easily collect them.

Can you just compare one of the top-line numbers for both versions? 'resident' is probably the best one to compare. It's also a bit noisy, but if you do measurements on three or more invocations for both versions, it's probably enough for a reasonable comparison.

48 MiB of memory is quite a lot. It would likely relegate this feature to "desktop only" territory.
> 48 MiB of memory is quite a lot. It would likely relegate this feature to
> "desktop only" territory.

It also would be nice if it was something that could be off by default, and enabled only by those who want it, because it seems like most people wouldn't need it. (Though I admit to not understanding the purpose of this language detection; this bug doesn't explain that.)
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #83)

> 48 MiB of memory is quite a lot. It would likely relegate this feature to
> "desktop only" territory.

Where is this 48MiB figure coming from? I said during my testing it takes 16MiB once loaded, and 18MiB after using it for a while.
(In reply to Nicholas Nethercote [:njn] from comment #83)
> > I don't have memory usage data from the binary library, and don't know a way
> > to easily collect them.
> 
> Can you just compare one of the top-line numbers for both versions?

We don't have a usable binary version. I just compiled a binary library on the try server to check the code size, but we don't have any code using that binary library.
I did 30 + 18 MiB, from comment 81. Sorry if that's wrong. You haven't explained what you're measuring and how, and there are lots of different memory metrics. Numbers from about:memory would be great if you have them, because their meaning is clear.
(In reply to Nicholas Nethercote [:njn] from comment #87)
> I did 30 + 18 MiB, from comment 81. Sorry if that's wrong. You haven't
> explained what you're measuring and how, and there are lots of different
> memory metrics. Numbers from about:memory would be great if you have them,
> because their meaning is clear.

For all these values, I'm looking at the size indicated by about:memory for the relevant worker.

The 30MB figure is the size at startup of a version that we are _not_ using (we used it for a previous prototype though).

The 16MB figure is the memory usage just after our worker is loaded, before it's actually used. All of it is in "non-heap/elements/asm.js".

The 18MB figure is our worker, after it is has been used for a while. Here is how I see it in about:memory :
│   ├──18.38 MB (06.90%) -- worker(resource:///modules/translation/cld-worker.js, 0x119fc8400)
│   │  ├──17.11 MB (06.42%) -- zone(0x12507d800)
│   │  │  ├──16.89 MB (06.34%) -- compartment(web-worker)
│   │  │  │  ├──16.84 MB (06.32%) -- classes
│   │  │  │  │  ├──16.00 MB (06.00%) -- class(ArrayBuffer)
│   │  │  │  │  │  ├──16.00 MB (06.00%) -- objects
│   │  │  │  │  │  │  ├──16.00 MB (06.00%) ── non-heap/elements/asm.js
│   │  │  │  │  │  │  └───0.00 MB (00.00%) ++ (2 tiny)
│   │  │  │  │  │  └───0.00 MB (00.00%) ++ shapes/gc-heap
│   │  │  │  │  └───0.84 MB (00.32%) ++ (6 tiny)
│   │  │  │  └───0.05 MB (00.02%) ++ (3 tiny)
│   │  │  └───0.21 MB (00.08%) ++ (6 tiny)
│   │  └───1.27 MB (00.48%) ++ (4 tiny)
Thanks for the clarification. 18 MiB is obviously better than 48, but still high enough to be a problem for Fennec and B2G, and making it enable-on-request-only for desktop would be a good idea.

Actually... the 16 MiB of asm.js elements is clearly the asm.js heap, which may only be partially used, in which case the actual physical memory consumption may be less than this suggests. (Though virtual memory consumption is still a problem on win32.) Can you get 'resident' values with this enabled (and in use) and with it off? That would give us an idea of how much of that heap memory is committed. Thanks!
(In reply to comment #89)
> Thanks for the clarification. 18 MiB is obviously better than 48, but still
> high enough to be a problem for Fennec and B2G, and making it
> enable-on-request-only for desktop would be a good idea.

We should keep mobile in mind even if we are planning to initially ship this on desktop.  Chrome for Android for example supports this, I don't see why we wouldn't want to support this on Fennec and B2G.
Status: RESOLVED → VERIFIED
We're not even enabling it on desktop yet, so the advice about b2g integration or mobile implications is a bit premature :)

I appreciate the concern, and desire to ensure we do our due diligence here, I really do. I know that code occasionally is landed that regresses perf/memory without us knowing about it ahead of time, and so it's good that we're thinking about this stuff right from the start.

Florian, it sounds like we need a bug on file to do some performance and memory-use comparisons between native code and emscriptened asm.js implementations. Want to file that and throw it in the backlog?
Flags: needinfo?(vdjeric)
Flags: needinfo?(florian)
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
(In reply to comment #91)
> Florian, it sounds like we need a bug on file to do some performance and
> memory-use comparisons between native code and emscriptened asm.js
> implementations. Want to file that and throw it in the backlog?

Was this bug ever filed?
I filed bug 996119 so that we investigate the performance and memory usage impact of CLD.
Flags: needinfo?(florian)
Depends on: 1003117
Depends on: 1003120
Depends on: 1003125
Component: General → Translation
You need to log in before you can comment on or make changes to this bug.