Last Comment Bug 994357 - Use DOM overlays for safely setting node content
: Use DOM overlays for safely setting node content
Status: RESOLVED FIXED
: sec-low, sec-want
Product: Firefox OS
Classification: Client Software
Component: Gaia::L10n (show other bugs)
: unspecified
: All All
P2 normal (vote)
: 2.2 S9 (3apr)
Assigned To: Staś Małolepszy :stas
:
:
Mentors:
https://github.com/mozilla-b2g/gaia/b...
Depends on: 1020138 994290 994519 1084502 1089710
Blocks: 999779 1148895 936534 1006359 1027117 1142526 1154438
  Show dependency treegraph
 
Reported: 2014-04-09 15:29 PDT by Zibi Braniecki [:gandalf][:zibi]
Modified: 2015-06-08 07:33 PDT (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
2.2+
fixed
fixed


Attachments
WIP (14.32 KB, patch)
2014-04-10 06:01 PDT, Staś Małolepszy :stas
gandalf: feedback+
Details | Diff | Splinter Review
WIP (Gaia repo) (20.29 KB, patch)
2014-05-20 05:28 PDT, Staś Małolepszy :stas
no flags Details | Diff | Splinter Review
Gaia patch, no tests yet (22.31 KB, patch)
2014-06-19 02:05 PDT, Staś Małolepszy :stas
no flags Details | Diff | Splinter Review
Gaia patch, no tests yet, less reordering of code (16.43 KB, patch)
2014-06-19 13:52 PDT, Staś Małolepszy :stas
no flags Details | Diff | Splinter Review
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master (46 bytes, text/x-github-pull-request)
2015-03-10 22:46 PDT, Autolander
gandalf: review+
Details | Review | Splinter Review
[gaia] stasm:994357-reland > mozilla-b2g:master (46 bytes, text/x-github-pull-request)
2015-03-30 23:19 PDT, Autolander
stas: review+
jocheng: approval‑gaia‑v2.2+
Details | Review | Splinter Review

Description User image Zibi Braniecki [:gandalf][:zibi] 2014-04-09 15:29:19 PDT
Currently the heuristics for node content l10n are bizarre.

We should aim at simplifying it and making it more predictable for developers.
Comment 1 User image Staś Małolepszy :stas 2014-04-09 16:34:08 PDT
What do you think about using the DOM-overlay approach we implement in L20n master?  I'd argue it's useful and more intuitive than the current setTextContent function.

https://github.com/l20n/l20n.js/blob/master/docs/html.md#make-html-elements-localizable

See https://github.com/l20n/l20n.js/blob/986794bcd653b5b3c18a1a629852ec4c15612f22/bindings/l20n/html.js#L254-L406
Comment 2 User image Zibi Braniecki [:gandalf][:zibi] 2014-04-10 00:36:35 PDT
I like it
Comment 3 User image Staś Małolepszy :stas 2014-04-10 06:01:36 PDT
Created attachment 8404666 [details] [diff] [review]
WIP

I ported DOM overlays from master which turned out to be pretty easy.  Here's a WIP, with a few open questions:

 - it doesn't support dataset nor style attributes for now, because it uses setAttribute everywhere.  It would be helpful to fix bug 994290 first.

 - I removed setTextContext, but it was also used in mozL10n.localize.  Let's clean it up in bug 994519 first

 - this might not work well with the buildtime pretranslation, since some custom attributes from the default localization might leak into other localizations (done on the fly)

See examples/regular.html for a working example.  To flash Gaia, ake sure you add win.Node.ELEMENT_NODE = 1 to build/webapp-optimize.js.
Comment 4 User image Staś Małolepszy :stas 2014-04-10 11:45:11 PDT
(In reply to Staś Małolepszy :stas from comment #3)

>  - this might not work well with the buildtime pretranslation, since some
> custom attributes from the default localization might leak into other
> localizations (done on the fly)

To elaborate on this:  the implemented approach (taken directly from L20n master) uses the source HTML to define which elements are allowed in the translation.  There's also a whitelist of allowed elements and attributes which are safe to use anywhere.  When the translation is applied on the fly, the attributes are set on the DOM nodes.  When the language then changes again, the attributes from the previous language, which are not present in the new translation would leak since they are not explicitly removed.  So that this doesn't happen, the code keeps the original DOM node in memory nad uses it as the baseline for any new translation.

Gaia's buildtime optimization which pretranslates the HTML into the default language would make that default language be the original DOM.  So any attributes set in the default language would leak to any other localization *on runtime*.
Comment 5 User image Zibi Braniecki [:gandalf][:zibi] 2014-05-16 14:53:57 PDT
Comment on attachment 8404666 [details] [diff] [review]
WIP

Review of attachment 8404666 [details] [diff] [review]:
-----------------------------------------------------------------

I believe that we cannot resolve the pretranslate case leaking right now. Fortunately, except of style.maxWidth which we're going to get rid of, there's no reason for locales to have different attributes.

In the future we may end up getting rid of pretranslation overall, and move to shadowDOM will help us with the leaking as well.

I suggest landing it as is (with tests) if we get good performance numbers.

::: bindings/l20n/dom.js
@@ +95,5 @@
>  
> +  if (value) {
> +    // if there is no HTML in the translation nor no HTML entities are used,
> +    // just replace the textContent
> +    if (value.indexOf('<') === -1 && value.indexOf('&') === -1) {

I'm worried about this line. It seems to be in the hot path and I would like to see the number on how it impact non-default locale translation.
Comment 6 User image Staś Małolepszy :stas 2014-05-17 16:19:30 PDT
(In reply to Zibi Braniecki [:gandalf] from comment #5)

> > +    if (value.indexOf('<') === -1 && value.indexOf('&') === -1) {
> 
> I'm worried about this line. It seems to be in the hot path and I would like
> to see the number on how it impact non-default locale translation.

It looks like it doesn't have any measurable impact.

cold_load_time diffs, negative means DOM overlays were faster:

		median	mean	stdev
-------------------------------------
Browser		0	-7	-2
Calendar	6	7	1
Camera		3	6	9
Clock		9	12	48
Contacts	7	11	24
Email		-2	5	43
Messages	-1	5	76
Music		10	12	5
Phone		3	1	-13
Settings	23	21	-2
Usage		10	12	-2
Video		-22	-14	3
Gallery		-5	1	-20
-------------------------------------
Avg		3	6	13
Comment 7 User image Staś Małolepszy :stas 2014-05-19 14:08:30 PDT
When this bug is fixed, we should discourage all uses of node.textContent = 'foo' and node.innerHTML = 'foo' done manually by developers.  Instead, all translations should be inserted into the DOM using the overlays logic from this bug.  This is a more secure and robust solution.

This has interesting tie-ins with bug 992473 and bug 994519.

We could expose a mozL10n.translateDOMFragment(node, id, args) method that, given a node, fetches the translation, interpolates args, and inserts the result using the overlays logic.

However, with Mutation Observers (bug 992473) there will no longer by any need to explicitly call translateElement on the node that is being inserted.  It only needs a data-l10n-id attribute to be picked up by the observer and to be correctly localized.

For args (program data used as values for placeables in translations), we need a way to persist them until the observer kicks in.  Currently this is done with mozL10n.localize() settings data-l10n-args to a serialized JSON.  The observer then reads it and parses to a JS object.  This should be good for now.  In the future, I think we can consider moving to the shadow DOM and/or storing args as regular JS objects on the node.

So a complete solution using Mutation Observers could be as simple as:

  node.setAttribute('data-l10n-id', 'hello');
  node.setAttribute('data-l10n-args', JSON.stringify({ username: 'Mary' }));
  document.body.appendChild(node);

In bug 994519 we will discuss whether or not we should wrap this in a method of mozL10n.
Comment 8 User image Staś Małolepszy :stas 2014-05-20 05:28:23 PDT
Created attachment 8425415 [details] [diff] [review]
WIP (Gaia repo)

Vivien, I'm interested in hearing your thoughts on this approach.  You might remember something similar that we once discussed in Paris 2 or 3 years ago for L20n :)

In a nutshell, this allows localizers to safely use text-level semantics HTML in translations (em, sup, strong).  At the same time developers need not to worry about malicious innerHTML anymore, thanks to the <template> element which creates an inert DOM that can be filtered according to a whitelist.

Also, if a node with data-l10n-id contains another interactive element, like a button or an input, the translation will be inserted around it and into it, and the element will not be replaced.  This allows any two-bindings or event handlers attached to the element to be preserved during the localization of the DOM.

Note, this is still work in progress.  I'll need a few more days next week to write tests.  We'll also need bug 994290 which will remove nested attributes (which were used for foo.dataset.something, but aren't anymore;  see bug 994290's dependencies).

The code has been taken directly from L20n master, where it passed a security review last year (bug 925579).
Comment 9 User image Staś Małolepszy :stas 2014-06-19 02:05:10 PDT
Created attachment 8442698 [details] [diff] [review]
Gaia patch, no tests yet

I cleaned the patch up and rebased it on top of the current master, in particular, on top of l10n mutation observers (bug 992473).

Perf looks good.  Here are the cold startup times measured against master;  no statistically significant differences:

App       Δ median  Δ mean  Sig?
--------  --------  ------  ----
Browser        -13       2      
Calendar         5       9      
Camera           5      -4      
Clock          -12     -15      
Contacts        -8      -5      
Email            1       2      
Messages         6     -18      
Phone           18      15      
Settings         8      31      
Usage           -1      -5 

A few tests are still failing and I suspect that this is due to the setTextContent logic which I remove in the patch.  It was introduced way back in bug 902363 and roughly, it only inserts translations in place of the first TEXT_NODE-type node, leaving any source HTML children intact.  DOM overlays are a better solution to the same problem (and then some);  you can reorder nodes, or inject text both before and after the HTML children.

https://github.com/mozilla-b2g/gaia/pull/19413
https://travis-ci.org/mozilla-b2g/gaia/builds/27903664
https://tbpl.mozilla.org/?tree=Gaia-Try&rev=1245cc552b1d

Vivien, let me know what you think.  I still have a lot of tests to write which I'll do early next week, hence the feedback request, not a review request.  However, the l10n.js part of the patch should be considered complete.
Comment 10 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-06-19 06:40:51 PDT
Comment on attachment 8442698 [details] [diff] [review]
Gaia patch, no tests yet

Review of attachment 8442698 [details] [diff] [review]:
-----------------------------------------------------------------

::: shared/js/l10n.js
@@ +1453,5 @@
> +    }
> +    return index;
> +  }
> +
> +  /* jshint -W104 */

This is very hard for me to see the diff between the 2 versions since the code has been moved around and splinter just shows me new code.

Can you keep the code at the same place ?
Comment 11 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-06-19 06:47:12 PDT
(In reply to Staś Małolepszy :stas from comment #1)
> What do you think about using the DOM-overlay approach we implement in L20n
> master?  I'd argue it's useful and more intuitive than the current
> setTextContent function.
> 
> https://github.com/l20n/l20n.js/blob/master/docs/html.md#make-html-elements-
> localizable
> 

Do you have an example of how it will work with .properties ? (in order to make it clearer for poor mortal like me that does not know the .lol syntax very well).
Comment 12 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-06-19 06:50:25 PDT
(In reply to Staś Małolepszy :stas from comment #8)
> Created attachment 8425415 [details] [diff] [review]
> WIP (Gaia repo)
> 
> In a nutshell, this allows localizers to safely use text-level semantics
> HTML in translations (em, sup, strong).  At the same time developers need
> not to worry about malicious innerHTML anymore, thanks to the <template>
> element which creates an inert DOM that can be filtered according to a
> whitelist.
> 

I'm not against the idea of allowing safe text-level semantics elements in a secure way. But I would like to see the clean diff of the patch in order what do you mean exactly with the usage of the template element.

> Also, if a node with data-l10n-id contains another interactive element, like
> a button or an input, the translation will be inserted around it and into
> it, and the element will not be replaced.  This allows any two-bindings or
> event handlers attached to the element to be preserved during the
> localization of the DOM.
> 

This part sounds good.
Comment 13 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-06-19 06:50:59 PDT
Comment on attachment 8442698 [details] [diff] [review]
Gaia patch, no tests yet

I will give you a feedback as soon I have a patch where I can actually see the real diffs ;)
Comment 14 User image Staś Małolepszy :stas 2014-06-19 13:52:22 PDT
Created attachment 8443028 [details] [diff] [review]
Gaia patch, no tests yet, less reordering of code
Comment 15 User image Staś Małolepszy :stas 2014-06-19 13:54:21 PDT
Comment on attachment 8442698 [details] [diff] [review]
Gaia patch, no tests yet

(Pressed Enter to soon.)

Here's an updated patch which should be much easier to read.  I needed to define the whitelist early and initially decided to move the rest of the DOM localization code close to it, but you're right, the huge diff is not worth it.

I'll post an example in a few.
Comment 16 User image Staś Małolepszy :stas 2014-06-19 14:17:18 PDT
I have recently come across a great real world example of how DOM overlays could be used to simplify developers' and localizers' life.  Behold :)

In System's FxAccounts code, you'll find the following three l10n strings defined:

https://github.com/mozilla-b2g/gaia/blob/ccd70903544486bea04e85d8a4aacf63f1de2a72/apps/system/fxa/locales/fxa.en-US.properties#L62-L68

  fxa-notice = By proceeding, I agree to the {{tos}} and {{pn}} of Firefox cloud services.
  fxa-tos = Terms of Service
  fxa-pn = Privacy Notice

On runtime, there's a logic defined here which replaces {{tos}} and {{pn}} with <a> elements (step 1) and then insert the translation of the text of the link (step 2):

https://github.com/mozilla-b2g/gaia/blob/master/apps/system/fxa/js/screens/fxam_enter_email.js#L65-L75
https://github.com/mozilla-b2g/gaia/blob/master/apps/system/fxa/js/screens/fxam_enter_email.js#L77-81

With DOM overlays, you could first define two <a> children of the fxaNotice element, like so:

  <p data-l10n-id="fxa-notice">
    <a id="fxa-terms" href="…"></a><a id="fxa-privacy" href="…"></a>
  </p>

You don't need any en-US text content in there, but you can put the desired values for ids and hrefs.  Then, you'd only need one l10n string:

  fxa-notice = By proceeding, I agree to the <a>Terms of Service</a> and <a>Privacy Notice</a> of Firefox cloud services.

The translation text will be inserted around and into the <a> elements defined in the source.  They will be the same elements, so any event listeners attached earlier will continue to work.  The localizer doesn't have to worry about the URL nor can they break the id.  In fact, even if they try to add an id, or class, or style, or onclick to the <a> element in the translation, l10n.js will securely filter it out.

I like this example also because it demonstrates a limitation of the current DOM overlay implementation, which I think is fair.  In case two (or more) elements of the same type are present, the localizer cannot reorder them in the translation.  If there was an input and a button, or a span and an anchor, they could reorder them no problem, because there is no ambiguity as to which element should correspond to which.

Hope this help.  Please do let me know if you any more questions, I'd be more than happy to answer them.

You also probably see now why I'll need some time to write good tests for this.  It's complex, but not too complicated, to quote the Zen of Python :)
Comment 17 User image Staś Małolepszy :stas 2014-06-19 14:18:44 PDT
I meant to post permalinks:

(In reply to Staś Małolepszy :stas from comment #16)

> On runtime, there's a logic defined here which replaces {{tos}} and {{pn}}
> with <a> elements (step 1) and then insert the translation of the text of
> the link (step 2):

https://github.com/mozilla-b2g/gaia/blob/ccd70903544486bea04e85d8a4aacf63f1de2a72/apps/system/fxa/js/screens/fxam_enter_email.js#L65-75
https://github.com/mozilla-b2g/gaia/blob/ccd70903544486bea04e85d8a4aacf63f1de2a72/apps/system/fxa/js/screens/fxam_enter_email.js#L77-81
Comment 18 User image Francesco Lodolo [:flod] 2014-06-20 07:34:39 PDT
(In reply to Staś Małolepszy :stas from comment #16)
>   fxa-notice = By proceeding, I agree to the {{tos}} and {{pn}} of Firefox
> cloud services.
>   fxa-tos = Terms of Service
>   fxa-pn = Privacy Notice

There's one thing I really like in this approach (I'm clearly biased, being the one who suggested it): there's not a single element of markup in the string. There are two variables, but variables are easy to parse, therefore errors on the localization side are easy to spot.

>   fxa-notice = By proceeding, I agree to the <a>Terms of Service</a> and
> <a>Privacy Notice</a> of Firefox cloud services.

That's faster to localize (1 string vs 3 strings), and clearer (no need of localization notes). But this looks also more fragile than the other string, if we ignore locales who like to translate variable names.

Do we have some sort of way to catch if localization removes one of the html element? If that's the case, the second approach is definitely better.

> In case two (or more) elements of the same type are present, 
> the localizer cannot reorder them in the translation.

That would definitely be a limitation developers must be aware of. Out of curiosity, I checked what locales are currently doing with that string, and nobody inverted tos and pn. If that happens, we'd end up with the wrong URL.
Comment 19 User image Zibi Braniecki [:gandalf][:zibi] 2014-06-20 08:33:41 PDT
(In reply to Francesco Lodolo [:flod] from comment #18)
> (In reply to Staś Małolepszy :stas from comment #16)
> >   fxa-notice = By proceeding, I agree to the {{tos}} and {{pn}} of Firefox
> > cloud services.
> >   fxa-tos = Terms of Service
> >   fxa-pn = Privacy Notice
> 
> There's one thing I really like in this approach (I'm clearly biased, being
> the one who suggested it): there's not a single element of markup in the
> string.

That's also its limitation.

> >   fxa-notice = By proceeding, I agree to the <a>Terms of Service</a> and
> > <a>Privacy Notice</a> of Firefox cloud services.
> 
> That's faster to localize (1 string vs 3 strings), and clearer (no need of
> localization notes). But this looks also more fragile than the other string,
> if we ignore locales who like to translate variable names.

What do you mean by fragile? I believe we do expect our localizers to understand basic HTML when they localize HTML markup.

> Do we have some sort of way to catch if localization removes one of the html
> element? If that's the case, the second approach is definitely better.

I don't think we have now, but since the markup will be a clean DOMFragment it'll be easy to add if we want it.

> > In case two (or more) elements of the same type are present, 
> > the localizer cannot reorder them in the translation.
> 
> That would definitely be a limitation developers must be aware of. Out of
> curiosity, I checked what locales are currently doing with that string, and
> nobody inverted tos and pn. If that happens, we'd end up with the wrong URL.

That may be a reason why they don't do this ;) Ability to control the order of elements inside DOMFragment is an important feature of DOMOverlays to me.
Comment 20 User image Francesco Lodolo [:flod] 2014-06-20 08:45:32 PDT
(In reply to Zibi Braniecki [:gandalf] from comment #19)
> What do you mean by fragile? I believe we do expect our localizers to
> understand basic HTML when they localize HTML markup.

Working daily on mozilla.org localization, it's easy for an "</a>" to become "<a>", or get lost in translation. If we have checks in place, definitely better.
Comment 21 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-07-11 06:14:34 PDT
I'm still thinking about that.

There is one thing that makes me a bit uncomfortable by opening the ability for localizers to insert nodes in the document.

The app author may expect the DOM tree to be of a certain shape, and so format the JS code to rely on .firstElementChild, lastElementChild or even CSS rules to touch a particular element based on it's ordering in a list.

Letting localizers alter this may break the assertions made by the developer when he wrote the code, or did I misunderstood something ?
Comment 22 User image Staś Małolepszy :stas 2014-07-11 06:28:19 PDT
(In reply to Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) from comment #21)

> Letting localizers alter this may break the assertions made by the developer
> when he wrote the code, or did I misunderstood something ?

I think it boils down to educating developers about this.  We taught them not to regexp for English strings and not to append "s" to create plurals, so I think we can teach them about this too :)

You're right:  they won't be able to make such assumptions anymore.  Which is good, because they shouldn't, really.  If a translation requires to reverse the order of child elements, it should be allowed to do it.

In general, developers should treat localized content as black box:  they shouldn't tinker with it, parse it, String.replace it, nothing.  DOM overlays approach solves a common problem while keeping the black box paradigm.
Comment 23 User image Vivien Nicolas (:vingtetun) (:21) - (NOT reading bugmails, needinfo? please) 2014-08-04 07:46:42 PDT
Comment on attachment 8443028 [details] [diff] [review]
Gaia patch, no tests yet, less reordering of code

As discussed in the meeting, I will be much more comfortable if those changes can prove that they don't broke any tests in the future. Having the testing infrastructure works for those localization bits would allow the developer to create a test and ensure it works, whatever the localizer is doing.
Comment 24 User image Zibi Braniecki [:gandalf][:zibi] 2014-08-13 21:18:38 PDT
As a temporary step on the way to this, I'd like to simplify setTextContent o just set textContent. I filed bug 1053629.
Comment 25 User image Frederik Braun [:freddyb] 2014-08-21 08:17:39 PDT
Just to chime in from a security perspective: This is potentially bad for security, considering that innerHTML might bring us into XSS and content injection troubles: https://github.com/mozilla-b2g/gaia/blob/1166fa5ebeeaac147e97e4cf71b784ddbc7cd91d/shared/js/l10n.js#L1711
Comment 26 User image Zibi Braniecki [:gandalf][:zibi] 2014-10-06 17:22:27 PDT
Frederik: that's exactly what we're planning to remove. DOM Overlays branch does not use innerHTML at all and we have a bug (bug 1027117) to remove all innerHTML uses once we land DOM Overlays.

Here are a few examples where we currently use custom approaches to interpolate localizable DOM Fragments:
 - https://github.com/mozilla-b2g/gaia/blob/71238815f3e632d26d78253e041a4428c4946a76/apps/calendar/js/templates/account.js#L19
 - https://github.com/mozilla-b2g/gaia/blob/71238815f3e632d26d78253e041a4428c4946a76/apps/clock/js/banner/main.js#L73-L77
 - https://github.com/mozilla-b2g/gaia/blob/71238815f3e632d26d78253e041a4428c4946a76/apps/settings/js/simcard_lock.js#L63-L67
Comment 27 User image Frederik Braun [:freddyb] 2014-10-07 01:31:52 PDT
We're on the same page then. Thanks for explaining this! :)
Comment 28 User image Zibi Braniecki [:gandalf][:zibi] 2014-10-07 09:52:42 PDT
Freddy's template experiment: https://github.com/mozfreddyb/template-experiment - wondering if we can reuse it in DOM Overlays?
Comment 29 User image Autolander 2015-03-10 22:46:19 PDT
Created attachment 8575788 [details] [review]
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master
Comment 30 User image Staś Małolepszy :stas 2015-03-11 02:24:30 PDT
Comment on attachment 8575788 [details] [review]
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master

Zibi, can you take a look at this rebased pull request?  I got a green try build:

https://treeherder.mozilla.org/#/jobs?repo=gaia-try&revision=7373e02e86c9

and I'm planning on adding more tests early next week.
Comment 31 User image Zibi Braniecki [:gandalf][:zibi] 2015-03-11 10:02:05 PDT
Comment on attachment 8575788 [details] [review]
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master

yesss :)
Comment 32 User image Staś Małolepszy :stas 2015-03-17 10:52:40 PDT
Comment on attachment 8575788 [details] [review]
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master

Zibi, I updated the pull request with a second commit in which HTML values are marked as such in the parser.  Re-requesting review on that part.
Comment 33 User image Ted Clancy [:tedders1] 2015-03-19 13:05:19 PDT
Hey, I'm sorry to be a intrusive latecomer to this when obviously so much work has been done on it already and it's just about to land after several months of development.

But I have some concerns about this, and I think they should be addressed before people start using this feature. Because once people start using this feature, it will be too late to change the way it works, and we all want this to be the best tool out there.

My concerns are the same concerns that have been voiced by others in this thread:

1) The fact that the translator can't re-order the elements. I think this is a MAJOR problem, and it needs to be solved. Sometimes languages require a different word order.

2) It would be nice if the translator didn't have to see HTML tags, because that's a programmer thing, not a translator thing. (After all, if the programmer wants to change to A to an EM, they shouldn't have to re-translate the strings.)

I have a suggestion. 

Using the same example as previous comments, maybe this is what the syntax should look like:

// HTML file

<p data-l10n-id="fxa-notice">
  <a id="fxa-terms" href="…"></a><a id="fxa-privacy" href="…"></a>
</p>


// English l20n file

fxa-notice = By proceeding, I agree to the {{fxa-terms}} and {{fxa-privacy}} of Firefox cloud services.
fxa-terms = Terms of Service
fxa-privacy = Privacy Notice

// French l20n file

fxa-notice = En poursuivant, je suis agréeable à {{fxa-privacy}} et {{fxa-terms}} des Services nuageant de Firefox.
fxa-terms = les Termes de service
fxa-privacy = l'Avis de privé

The magic here is that the id's of the HTML elements are the SAME as the names of the corresponding placeholders in the l20n file. That way, the translator can re-order the placeholders, and L20n can still find the matching HTML tag. Also, the translators deal with placeholders, not HTML tags.

What do you think, Staś? Would this work?

If you agree, I can do the new development work, because I don't want to create more work for you.
Comment 34 User image Zibi Braniecki [:gandalf][:zibi] 2015-03-19 16:52:22 PDT
(In reply to Ted Clancy [:tedders1] from comment #33)
> I have a suggestion. 
> 
> Using the same example as previous comments, maybe this is what the syntax
> should look like:
> 
> // HTML file
> 
> <p data-l10n-id="fxa-notice">
>   <a id="fxa-terms" href="…"></a><a id="fxa-privacy" href="…"></a>
> </p>
> 
> 
> // English l20n file
> 
> fxa-notice = By proceeding, I agree to the {{fxa-terms}} and {{fxa-privacy}}
> of Firefox cloud services.
> fxa-terms = Terms of Service
> fxa-privacy = Privacy Notice
> 
> // French l20n file
> 
> fxa-notice = En poursuivant, je suis agréeable à {{fxa-privacy}} et
> {{fxa-terms}} des Services nuageant de Firefox.
> fxa-terms = les Termes de service
> fxa-privacy = l'Avis de privé
> 
> The magic here is that the id's of the HTML elements are the SAME as the
> names of the corresponding placeholders in the l20n file. That way, the
> translator can re-order the placeholders, and L20n can still find the
> matching HTML tag. Also, the translators deal with placeholders, not HTML
> tags.

I like the proposal because of its simplicity. For many cases that would be totally enough. The problem with it is that we do expect localizers to localize HTML attributes as well. It, unfortunately is a HTML Fragment localization so we do need our localizers to understand basic HTML.

Trying to workaround the fact that we need HTML Fragment translation by creating a simplified pseudo-translation is going to be significantly limiting for what the localizers can do with the fragment.

What you focus on is element ordering, which is also solved by Stas's solution, but what you don't solve is everything else that localizers can do including element styling and ability to create additional semantics.

What I mean by this is that we want to enable localizers to do sth like this:

HTML:

<div l10n-id="foo">
  <p/>
  <a href="http://www.mozilla.org"/>
</div>

en-US:

<foo """
  <p>Hello <strong>world</strong>.</p>
  <a title="Press the left mouse button">Click me mr. Smith</a>
""">

ab-CD:

<foo """
  <p><strong style="text-decoration:none">World</strong>&nbsp;<span style="font-size: 1.1em">hello</span></p>
  <a title="Press the left mouse button">Click me <sup>mr.</sup> Smith</a>
""">

----------------

I believe that there's a difference in the perspective that you expressed in another bug saying "you have more trust in localizers" - and I believe it's true. Over last 10+ years we've built powerful local communities of passionate people who are willing to go above and beyond to create high quality localizations. DOM Overlays is enabling them to produce better quality products if they will be willing to put additional work into learning how.

That doesn't mean they make no mistakes, and we want them to think about "code" as little as possible, but I believe that when we talk about localization of HTML fragments, we should give them ability to localize HTML fragment, not a pseudo-templating stripped-down version of in-between the code and the translation.
Comment 35 User image Staś Małolepszy :stas 2015-03-20 10:29:32 PDT
(In reply to Ted Clancy [:tedders1] from comment #33)
> Hey, I'm sorry to be a intrusive latecomer to this when obviously so much
> work has been done on it already and it's just about to land after several
> months of development.

Thanks for taking time to comment, Ted, it's great that you're sharing comments!

> 1) The fact that the translator can't re-order the elements. I think this is
> a MAJOR problem, and it needs to be solved. Sometimes languages require a
> different word order.

I'll be the first person to agree that languages might require different word order :)  However, in my experience as both a localizers and a developer, it's actually quite rare that HTML elements of the same type (like 2 <a> elements) require a different order.  Surely it's possible that they might -- but if it's an edge case, I think it's reasonable to handle it as such.  Let's file a new bug and add this later on.  I think there's already a lot of value we will be providing with the current implementation which covers many many cases which are not currently possible, so it's already an improvement.

Also, I'd like to point out that some reordering is possible, albeit in a limited form.  Elements of different types can be safely reordered and the algorithm will match them just fine.  For elements of the same type (like 2 <a> elements) we should probably do something similar to what you propose: use identifiers to allow localizers to explicitly state which element is which.  We could either re-use the id attribute for this, or invent a new one.  See http://informationisart.com/8/ (it's an old blog post but still relevant) and bug 922576 comment 11 for some ideas.

> 2) It would be nice if the translator didn't have to see HTML tags, because
> that's a programmer thing, not a translator thing. (After all, if the
> programmer wants to change to A to an EM, they shouldn't have to
> re-translate the strings.)

I don't think the objective should be to protect localizers from using HTML.  Especially when it comes down to text-level semantics [1].  We want localizers to be able to use &nbsp; for spaces before punctuation, <em> for borrowed words, <sup> for Madame, <bdi> for embedded text etc.  Furthermore, as Zibi mentioned, we also want them to localize attributes:  titles, labels, placeholders etc.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html

> I have a suggestion. 
> […snip…]
> What do you think, Staś? Would this work?

That's an interesting suggestion.  One way to make it more robust would be to not use the id attribute but something custom (data-l10n-something) so that we don't run into problems when one string is used multiple times.  I'm not convinced however that the goal should be to avoid HTML at all cost.  I do believe that text-level HTML is easy enough for localizers to use it.  It's probably easier than Mediawiki's syntax for example, especially given the fact that DOM overlays remove the need for localizers to deal with non-localizable attributes like class, style, href etc.

Is

  By proceeding, I agree to the {{fxa-terms}} and {{fxa-privacy}} of Firefox cloud services.

really all that easier than the following?

  By proceeding, I agree to the <a>Terms of Service</a> and <a>Privacy Notice</a> of Firefox cloud services.

Admittedly, the latter example makes it easier to see the whole translation at once and to make edits without having to look up the values of the placeables.  It allows the localizers to use other text-level elements, like em, sup etc; and to add title attributes to <a> elements.
Comment 36 User image Staś Małolepszy :stas 2015-03-20 12:10:15 PDT
Zibi, can you take a look at https://github.com/stasm/gaia/commit/08d2f619691fae02e2a428de29ec07debe3ee401 ?
Comment 37 User image Ted Clancy [:tedders1] 2015-03-20 12:19:15 PDT
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #34)

Hi Zibi,

> What you focus on is element ordering, which is also solved by Stas's
> solution

I don't think Stas's patch actually does take care of element ordering.

What I'm talking about is a situation like this:

// index.html
<p data-l10n-id="donation-notification">
  <a id="recipient" href="recipient.html">
  <a id="gift" href="gift.html">
</p>

// en_US.l10n
donation-notification = You have given <a>{{$recipient.name}</a> a <a>{{$gift}}</a>!

// fr_CA.l10n
donation-notification = Tu as donné <a>{{$gift}}</a> à <a>{{$recipient.name}}</a> !

In English, it's:
You have given <a id="recipient" href="recipient.html">Susan</a> a <a id="gift" href="gift.html">coin</a>!

In French, it's: 
Tu as donné <a id="recipient" href="recipient.html">une pièce de monnaie</a> à <a id="gift" href="gift.html">Susan</a> !".

The problem is, the links are backwards. Even if you set the href's using javascript, it still won't work, because the id's are backwards too.

But I now understand the point you and Staś are making, and I'll address it further in my response to Staś below.
Comment 38 User image Zibi Braniecki [:gandalf][:zibi] 2015-03-20 12:27:24 PDT
(In reply to Staś Małolepszy :stas from comment #36)
> Zibi, can you take a look at
> https://github.com/stasm/gaia/commit/
> 08d2f619691fae02e2a428de29ec07debe3ee401 ?

I like the code in bindings now, I'm concerned about the amount of Arrays and Objects introduced in Resolver. I'd like to talk about how can we avoid that, at least for the hotpath (simple string).
Comment 39 User image Ted Clancy [:tedders1] 2015-03-20 13:43:51 PDT
(In reply to Staś Małolepszy :stas from comment #35)

Hi Staś,

> We want
> localizers to be able to use &nbsp; for spaces before punctuation, <em> for
> borrowed words, <sup> for Madame, <bdi> for embedded text etc.  

That's a good point. I see the need for that now. Although there are Unicode characters that can replace &nbsp; and <bdi>, Unicode plaintext is inadequate for superscript and italics, so we'll need HTML tags.

(I consider this to be a fault in Unicode, by the way. The Unicode committee thinks that italics are primarily a presentation issue, not a semantic issue. They are wrong.)

I guess what I'm hoping for is a stronger separation between the tags which are the programmer's concern (a, span, mark) and the tags which are the translator's concern (i, sup, wbr, rt, ...). It's nice that the translator can insert a <sup> tag, but what if the programmer was using <sup> for something else? For example, "Hello {{$name}}. <sup onclick="showTOS();">Click here for Terms of Service</sup>".

If the translator sets $name to "M<sup>me</sup> LeClerc", what happens?

We currently have a list of whitelisted tags. Could we possibly split that into two lists: a list of "translator-safe" tags, and a list of "non-translator-safe" tags? 

And if a tag is in the translator safe list, then translators can add or remove them at will, which means our algorithm shouldn't try to do matching on those tags. Developers should be warned about using them for any purposes other than their usual purpose, since they can be removed by a translator. (Perhaps we issue a warning if developers set any attributes on those tags.) 

Zibi gave an example where a translator changes a tag from <strong> to <strong style="text-decoration:none">. I think it would be easier if the translator could just remove the <strong> tag.

And if a tag isn't on the translator-safe list, the translator should be prevented from adding or removing them. Each tag in the source string needs a match in the translation.

> Elements of different types can be safely reordered and the
> algorithm will match them just fine.  For elements of the same type (like 2
> <a> elements) we should probably do something similar to what you propose:
> use identifiers to allow localizers to explicitly state which element is
> which.  We could either re-use the id attribute for this, or invent a new
> one.

Here's an idea: For elements on the non-translator-safe list (a, span, mark), we require them to have id's. The id's are included in the translation file, so translators are able to re-order them. Our algorithm will match based on tag and id, so the order won't matter.

And if a tag is on the translator-safe list, we *prevent* them from having an id, since they might be removed by a translator and the programmer shouldn't be trying to do anything special with them.

How does that sound?

> Let's file a new bug and add this later on.  I think there's
> already a lot of value we will be providing with the current implementation
> which covers many many cases which are not currently possible, so it's
> already an improvement.

If you land it, will people start using it? I'm just worried that once people are using it, it'll be harder to change later. There will be pressure from users to not change it.

But if you want to land it, go ahead. It is an improvement, as you say. I'll file a bug for the other issues.
Comment 40 User image Ted Clancy [:tedders1] 2015-03-20 13:48:59 PDT
Just to continue the example I gave before, this is what I'm suggesting it should look like:

// index.html
<p data-l10n-id="donation-notification">
  <a id="recipient" href="recipient.html">
  <a id="gift" href="gift.html">
</p>

// en_US.l10n
donation-notification = You have given <a id="recipient">{{$recipient.name}</a> a <a id="gift">{{$gift}}</a>!

// fr_CA.l10n
donation-notification = Tu as donné <a id="gift">{{$gift}}</a> à <a id="recipient">{{$recipient.name}}</a> !
Comment 41 User image Zibi Braniecki [:gandalf][:zibi] 2015-03-20 14:02:50 PDT
(In reply to Ted Clancy [:tedders1] from comment #39)
> (I consider this to be a fault in Unicode, by the way. The Unicode committee
> thinks that italics are primarily a presentation issue, not a semantic
> issue. They are wrong.)

Very interesting point. I believe that with DOM Overlays we'll be able to identify the subset of tags that people actually need for their locales.

> Zibi gave an example where a translator changes a tag from <strong> to
> <strong style="text-decoration:none">. I think it would be easier if the
> translator could just remove the <strong> tag.

We have to allow for that.

> If you land it, will people start using it? I'm just worried that once
> people are using it, it'll be harder to change later. There will be pressure
> from users to not change it.

I believe that we'll have time to work on that even after we land it. We're currently closing 2.2 cycle, and we aim DOM Overlays for 3.0. There's a lot of time in 3.0 cycle before we'll start getting close to l10n-freeze and localization cycle. During that time, some adventurous localizers will play with that and we'll get feedback and also start using it for en-US identifying use cases, patterns and anti-patterns.

I would like us to aim at having this pattern stable within the next six months tbh, and if we can get there in this timeframe, I don't think we'll have any legacy l10n to update.

Also, I know Stas expressed it already, but I wanted to say that it's awesome to see your perspective on our solutions. It's rare. Thanks so much for taking time to share it and help us get a better APIs!
Comment 42 User image Staś Małolepszy :stas 2015-03-23 15:33:52 PDT
(In reply to Ted Clancy [:tedders1] from comment #39)
> (In reply to Staś Małolepszy :stas from comment #35)
> 
> […] but what if the programmer was using <sup> for
> something else? For example, "Hello {{$name}}. <sup
> onclick="showTOS();">Click here for Terms of Service</sup>".
> 
> If the translator sets $name to "M<sup>me</sup> LeClerc", what happens?

Good point.  The solution in its current form doesn't address this scenario.

> We currently have a list of whitelisted tags. Could we possibly split that
> into two lists: a list of "translator-safe" tags, and a list of
> "non-translator-safe" tags? 
> 
> And if a tag is in the translator safe list, then translators can add or
> remove them at will, which means our algorithm shouldn't try to do matching
> on those tags. Developers should be warned about using them for any purposes
> other than their usual purpose, since they can be removed by a translator.
> (Perhaps we issue a warning if developers set any attributes on those tags.) 

The idea of warning developers about using text-level semantic elements which might be worth considering, at least now until we add some sort of solution for the problem you presented.

> And if a tag isn't on the translator-safe list, the translator should be
> prevented from adding or removing them. Each tag in the source string needs
> a match in the translation.

But that doesn't solve the reordering problem which you gave examples of, does it?

> Zibi gave an example where a translator changes a tag from <strong> to
> <strong style="text-decoration:none">. I think it would be easier if the
> translator could just remove the <strong> tag.

Yes, the localizer can remove the <strong> tag.  There's no explicit requirement to include all the tags from the original string in the translation.  (Furthermore the style attribute is not allowed on any elements in DOM overlays, so <strong style="text-decoration:none"> wouldn't actually work).

 > Here's an idea: For elements on the non-translator-safe list (a, span,
> mark), we require them to have id's. The id's are included in the
> translation file, so translators are able to re-order them. Our algorithm
> will match based on tag and id, so the order won't matter.
> 
> And if a tag is on the translator-safe list, we *prevent* them from having
> an id, since they might be removed by a translator and the programmer
> shouldn't be trying to do anything special with them.
> 
> How does that sound?

These are good ideas.  I'll admit that I dislike id's because some strings might be used many times in one document.  So maybe something like data-l10n-source could work in your proposal?

However, there's one thing about identifiers which I'm not fond of:  they need to be defined by developers, which means that if developers forget to do it, localizers don't have any means of re-ordering elements. This is asking for localizability bugs.  The blog post which I linked to in comment 35 mentions l10n-path (or better: data-l10n-path) which makes use of XPath to specify the relationships between child elements of the source and the translation.  Granted this is a bit more complex than identifiers but in its simplest form (a[1]) still pretty easy to do and also much more powerful and not depending on additional work by the developer.

With this approach, <sup data-l10n-path="sup[1]"> means to match this <sup> in the translation with the first <sup> in source.  We might also need something to say: don't match this with anything;  maybe an empty data-l10n-path=""?

Like Zibi, I also think it's safe to land this as-is right now and then work on the details of the re-ordering solution in a follow-up bug.
Comment 43 User image Ted Clancy [:tedders1] 2015-03-24 17:10:51 PDT
(In reply to Staś Małolepszy :stas from comment #42)

> However, there's one thing about identifiers which I'm not fond of:  they need to be defined by developers, which means that if developers forget to do it, localizers don't have any means of re-ordering elements. 

Yeah, that's why I think providing an id should be compulsory on tags like <a> or <span>. We should throw an exception if there's a non-translator-safe tag without an id.

> Like Zibi, I also think it's safe to land this as-is right now and then work
> on the details of the re-ordering solution in a follow-up bug.

Okay. Sounds good.
Comment 44 User image Zibi Braniecki [:gandalf][:zibi] 2015-03-25 15:48:18 PDT
Comment on attachment 8575788 [details] [review]
[gaia] stasm:994357-dom-overlays > mozilla-b2g:master

Ok, the code looks good. I guess we'll have a chance to revisit Resolver architecture with 3.0 and maybe come up with a better solution than passing locals, but for now I like it more than the previous patch.
Comment 45 User image Staś Małolepszy :stas 2015-03-29 00:34:51 PDT
I added tests and the latest build is green:

https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=a8bafb01c9f0e07f4dfb80b1c9d8b960574cc796

Let's land this.
Comment 46 User image Autolander 2015-03-29 01:09:09 PDT
http://docs.taskcluster.net/tools/task-graph-inspector/#FwsCsVZqTayT2HWe-XAsMQ

The pull request failed to pass integration tests. It could not be landed, please try again.
Comment 47 User image Staś Małolepszy :stas 2015-03-29 03:18:53 PDT
Looks like the taskcluster failures weren't related to the patch:

fatal: unable to access 'https://github.com/mozilla-b2g/gaia/': gnutls_handshake() failed: A TLS packet with unexpected length was received.  

Let's try again.
Comment 48 User image Autolander 2015-03-29 03:42:36 PDT
Pull request has landed in master: https://github.com/mozilla-b2g/gaia/commit/be25b16efa19bab8d54be08f8fe45dcc93bf93d0
Comment 50 User image Staś Małolepszy :stas 2015-03-29 04:14:18 PDT
(In reply to Staś Małolepszy :stas from comment #42)
> Like Zibi, I also think it's safe to land this as-is right now and then work
> on the details of the re-ordering solution in a follow-up bug.

I filed bug 1148895 about this.
Comment 51 User image James Lal [:lightsofapollo] (inactive) 2015-03-30 18:16:29 PDT
https://github.com/mozilla-b2g/gaia/commit/8835f2b85602dd2a7a9961623cbdf03ce681d10f

backed out due to infra issues... ( I am going to re-land this myself soon)
Comment 52 User image Staś Małolepszy :stas 2015-03-30 23:14:34 PDT
I think bug 1142526 is responsible for the infra problems because it makes l10n.js use the DOM overlays mechanism (in particular, <template> which causes bug 1084502 in pre-36 Gecko) on existing translations.  This bug alone should be safe to reland.
Comment 53 User image Autolander 2015-03-30 23:19:26 PDT
Created attachment 8585900 [details] [review]
[gaia] stasm:994357-reland > mozilla-b2g:master
Comment 54 User image Staś Małolepszy :stas 2015-03-31 00:01:33 PDT
Comment on attachment 8585900 [details] [review]
[gaia] stasm:994357-reland > mozilla-b2g:master

Carry over r+.
Comment 55 User image Autolander 2015-03-31 00:42:07 PDT
Pull request has landed in master: https://github.com/mozilla-b2g/gaia/commit/db95aaf2a25f8e948c614e0d6e933c55ba3e9f61
Comment 56 User image Zibi Braniecki [:gandalf][:zibi] 2015-04-17 12:47:29 PDT
Stas, is that really fixed?

Can people start using it or due to bug 1142526 they should not?
Comment 57 User image Staś Małolepszy :stas 2015-04-21 09:06:49 PDT
We need to fix the bug which caused bug 1142526 to be backed out first. See bug 1142526 comment 8.  The bug is that some builder automation still uses an old version of Gecko/XRE which causes build failures with DOM overlays.  This only happens on certain pvt builds; custom builds and treeherder use newer Geckos and all works fine there, but we can't start using DOM overlays just now.  Can you help me get in touch with :jlal re. the backout of bug 1142526, please?
Comment 58 User image Zibi Braniecki [:gandalf][:zibi] 2015-04-21 09:10:36 PDT
Ok, just wanted a confirmation that DOM Overlays are not on yet.
Comment 59 User image Staś Małolepszy :stas 2015-04-27 10:45:55 PDT
I re-landed bug 1142526 and will monitor the tree today and tomorrow.  James says all should be fine regarding the previous infra issues.
Comment 60 User image Ted Clancy [:tedders1] 2015-06-03 14:34:54 PDT
This is needed to land Bug 1154438 on v2.2.
Comment 61 User image Ted Clancy [:tedders1] 2015-06-03 14:39:52 PDT
Staś, there are two patches attached to this ticket. Which one do I apply for uplift on?
Comment 62 User image Ted Clancy [:tedders1] 2015-06-04 12:24:36 PDT
Comment on attachment 8585900 [details] [review]
[gaia] stasm:994357-reland > mozilla-b2g:master

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
RTL support for B2G (Bug 906270)

This patch is required for Bug 1166203, which is confirmed as blocking 2.2. (And probably a bunch of similar issues which haven't been spotted yet.)

User impact if declined: 
Punctuation marks will appear in the wrong place when LTR phrases appear within RTL text, or vice versa.

Testing completed: 
This code has been on master for months already with no problems.

Risk to taking this patch (and alternatives if risky): 
None foreseen.

String or UUID changes made by this patch:
None.
Comment 63 User image Staś Małolepszy :stas 2015-06-04 20:29:15 PDT
(In reply to Ted Clancy [:tedders1] from comment #61)
> Staś, there are two patches attached to this ticket. Which one do I apply
> for uplift on?

I'm OOTO, sorry for not responding earlier.  The -reland one is the right one.
Comment 64 User image Josh Cheng [:josh] 2015-06-06 22:13:50 PDT
Approving for 2.2 as this blocks bug 1154438.

Note You need to log in before you can comment on or make changes to this bug.