Implement document walker to grab webpage's text content and format it for translation service

VERIFIED FIXED in Firefox 31

Status

()

defect
VERIFIED FIXED
6 years ago
5 years ago

People

(Reporter: Felipe, Assigned: Felipe)

Tracking

Trunk
Firefox 31
Points:
---
Dependency tree / graph
Bug Flags:
firefox-backlog +
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [translation] p=13 s=it-31c-30a-29b.2 [qa-])

Attachments

(1 attachment, 5 obsolete attachments)

When a webpage content has finished loading, we need grab all of the page's text content and parse it in a way that is suitable for the translation service.

It must generate a tree similar to a DOM tree, but it must skip non-textual nodes that may exist inside the page's body (such as <script>, <style>, others?).
It also does not need to preserve the whole tree structure, as it can discard any nodes which do not directly have text content.
And it should assign unique ids to each node, which do not need to reflect the ids from the webpage.

For example, the following page:

<html>
<body>
  <h1 id="header" class="h1class">Welcome to <b>Felipe's</b> webpage</h1>
  <div>My Sidebar
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
  </div>

  <div class="main-page">
    <p>This is the main page's body</p>
  </div>

  <script>analyticsTracking();</script>
  <style>div { color: blue }</style>
</body></html>

Should get a tree (in a JS struct) like:

<body>
  <h1 id="1">Welcome to <b id="2">Felipe's</b>webpage</h1>
  <div id="3">
    <li id="4">Item 1</li>
    <li id="5"></li>
  <div>
  <p id="6">This is the main page's body</p>
</body>


In addition, this module needs to be able to parse a string serialization of this structure (as above) back into an equal JS struct.
Blocks: 971048
Blocks: 971054
Blocks: 971060
Whiteboard: p=0
Assignee: nobody → felipc
Status: NEW → ASSIGNED
Can't you use the DOM TreeWalker? <https://developer.mozilla.org/en/docs/Web/API/TreeWalker>

(Note that we may have some performance problems with it, it's probably not very commonly used, so may not be well optimized.)
Yep, that's what I'm using! https://hg.mozilla.org/users/felipc_gmail.com/fx-translate/file/8aa04f1ceb76/content/DocumentTokenizer.jsm (note: not complete yet)

But i still need some clever acceptNode code to skip unnecessary nodes to generate the minimal output possible.

And on the other direction I intend to use DOMParser to get the result back as a DOM tree.
Posted patch TranslationDocument (obsolete) — Splinter Review
Here it is, the first complete version of this module, which I called TranslationDocument as it will represent a document during and after the process of being translated.

This module has the following functionality:
 - Walks through a document to generate a related JS tree representing all the content from the document which is useful for translation

 - From this tree, it re-generates a HTML representation of it (which can be broken down into several chunks) which is suitable to be submitted to the translation service

 - Receives the HTML serialized string that was returned by the translation service, parses it and adds the information to its document's tree representation

 - Swaps the original/translated content from the document


Some of the functions look unnecessarily complicated, but a big part of it is because I went to great lengths to not use any recursive algorithms.
The _init function is specially complicated and I'm not 100% confident that it's correct, but it has been working. I might revisit it.


The getNthNonEmptyTextNode et. al are necessary for the following reason. Consider the following example:

English:    <p>Welcome to <b>Mozilla's</b> website</p>

Portuguese: <p>Bem-vindo à página da <b>Mozilla</b></p>

See how going from en to pt, what were two text nodes (one before, one after the <b> tag), became only one text node with <p> as a parent.

In the other direction it's tricky too, as a single text node (first child of <p>) is being split up into two..

The addTranslation/swapText functions handle these cases well. There are other trickier edge cases that involve restructuring the tree or duplicating elements.. I don't think we necessarily want to handle them perfectly because changing anything in a page structure, other than the content of text nodes, has a big chance of breaking the page's style or even scripts. (and text nodes are already a non-0 risk)
Comment on attachment 8380955 [details] [diff] [review]
TranslationDocument

Liberally sending feedback requests here.. Anyone who wants to take a look is more than welcome!
Attachment #8380955 - Flags: feedback?(ehsan)
Attachment #8380955 - Flags: feedback?(hsivonen)
Attachment #8380955 - Flags: feedback?(gavin.sharp)
Actually the _init function can be ignored for now as I'm reworking it. All other functions are stable and I don't intend to rewrite them before getting more feedback.
Blocks: 973288
Blocks: 976554
Blocks: 976556
Blocks: 973275
Comment on attachment 8380955 [details] [diff] [review]
TranslationDocument

Review of attachment 8380955 [details] [diff] [review]:
-----------------------------------------------------------------

I have to say I didn't spend a lot of time on this patch, and my feedback below is very high level.

I have one architectural question.  It seems like your goal is to extract sentences/phrases from the original document that may span different HTML elements, is that correct?  If it is, you need to pay attention to inline/block elements and try to break sentences/paragraphs around the boundaries of block elements (and deal with things like line breaks, tables, and special text content such as image alt text for accessibility, etc.)  As far as I can see, this patch doesn't do any of that work, and I doubt that it will work on all kinds of documents.

::: browser/modules/translation/TranslationDocument.jsm
@@ +151,5 @@
> +          unwind = true;
> +        }
> +     }
> +
> +    } while (limit--);

So this won't give you any performance guarantees since the amount of work that _createObjFor needs to do is determined by content.  Which is a fancy way to say that this will cause janks.  I think you need to do something fancier here, such as yielding to the event loop.  But then you have to deal with dynamic changes...

@@ +164,5 @@
> +  acceptNode: function(node) {
> +    return globalAcceptNode(node);
> +  },
> +
> +  _createObjFor: function(node) {

How do you deal with content like this here?

 <p>Here's my w<span>or</span>d</p>

@@ +178,5 @@
> +      // <td> tags stripped from their <table><tr> structures are
> +      // ignored by the parser. Change them to a <div> instead,
> +      // since the tag type does not really matter, what matters
> +      // is only the tag id.
> +      get tag() this.tagName != "td" ? this.tagName : "div"

This is risky business.  For example, you're not handling th here.  Should you be?

@@ +201,5 @@
> +   * point to the same objects (i.e., new ones are not created here).
> +   *
> +   * In case translation was chunked, this function is able to
> +   * produce a final translation by being called once
> +   * for each resulting chunk.

How do you deal with dynamic changes in the document between the time when you extract your tree and the time when the translation is available?

@@ +359,5 @@
> +
> +  let nodeName = node.nodeName.toLowerCase();
> +
> +  // Ignore non-text nodes that commonly appear in the body
> +  if (["script", "style", "iframe", "code", "noscript"].indexOf(nodeName)) {

This seems very fragile.  Why not test node.style.display == "none" or something?

Also, how do you deal with SVG/MathML content here?

@@ +392,5 @@
> + * while still keeping it a valid HTML structure (i.e., closing all the open tags and
> + * reopening them in the following chunk). There may be node duplication on the edges
> + * of two consecutive chunks, but that is expected. The contents will not be duplicated,
> + * only the node with the same id will be seen in the two chunks. It's necessary for
> + * addTranslation to rebuild that entire node with the translation results.

This scares me a lot.  Can you prove that this function is correct?

@@ +434,5 @@
> +  let str = "";
> +
> +  function closeOpen() {
> +    while (siblingStack.length && siblingStack[0] == 0) {
> +      str += "</" + openStack.shift().tag + ">";

Where does .tag come from?  If you cannot prove that it's sanitized, this can be XSS-prone.
Attachment #8380955 - Flags: feedback?(ehsan) → feedback-
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #1)
> Can't you use the DOM TreeWalker?
> <https://developer.mozilla.org/en/docs/Web/API/TreeWalker>
> 
> (Note that we may have some performance problems with it, it's probably not
> very commonly used, so may not be well optimized.)

If we even have had some performance problems with TreeWalker, what benefit does it provide over implementing the canonical DOM walk loop?

        var current = root;
        var next;
        domwalk: for (;;) {
            if (current.nodeType == Node.ELEMENT_NODE) {
                // do stuff
                
                if ((next = current.firstChild)) {
                    current = next;
                    continue;
                }
            }
            for (;;) {
                if ((next = current.nextSibling)) {
                    current = next;
                    break;
                }
                current = current.parentNode;
                if (current == root)
                    break domwalk;
            }
        }
Comment on attachment 8380955 [details] [diff] [review]
TranslationDocument

>+    let unwind = false;

Having to keep track of unwinding like this suggests that TreeWalker is getting in the way and walking the tree yourself (see my previous comment) might make the code easier to follow.

>+    treeWalker.firstChild() || treeWalker.nextNode();

(Again, TreeWalker is weird.)

>+    let tagName = node.nodeName.toLowerCase();

This is a bug. Please use node.localName (after filtering out nodes that aren't in the HTML namespace). (tagName/nodeName should never be used for anything.)

>+      // <td> tags stripped from their <table><tr> structures are
>+      // ignored by the parser. Change them to a <div> instead,
>+      // since the tag type does not really matter, what matters
>+      // is only the tag id.

If the translation service doesn't care about the semantics of HTML tags (does it indeed not?), why bother giving it anything but spans?

>+    // XXX DOMParser not available from jsm, what to do?

MDN says:
var parser = Components.classes["@mozilla.org/xmlextras/domparser;1"]
             .createInstance(Components.interfaces.nsIDOMParser);
Attachment #8380955 - Flags: feedback?(hsivonen) → feedback-
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #6)
> Comment on attachment 8380955 [details] [diff] [review]
> TranslationDocument
> 
> Review of attachment 8380955 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I have to say I didn't spend a lot of time on this patch, and my feedback
> below is very high level.
> 
> I have one architectural question.  It seems like your goal is to extract
> sentences/phrases from the original document that may span different HTML
> elements, is that correct?  If it is, you need to pay attention to
> inline/block elements and try to break sentences/paragraphs around the
> boundaries of block elements (and deal with things like line breaks, tables,
> and special text content such as image alt text for accessibility, etc.)  As
> far as I can see, this patch doesn't do any of that work, and I doubt that
> it will work on all kinds of documents.


Hmm, yes and no, depends on what you mean by it. I'm not extracting the content out of nodes and merging or doing anything with them, i'm just creating a smaller tree that doesn't contains blank text nodes or elements that don't directly contain content. So I'm not removing any of the element markers that define the boundaries around text content. This means that all </p> markers are in place for example. I don't need to do any sentence breaking or anything.

I haven't dealt with content in attributes yet (for things like alt text or input placeholder labels). I'll deal with them later but I have a plan.


> 
> ::: browser/modules/translation/TranslationDocument.jsm
> @@ +151,5 @@
> > +          unwind = true;
> > +        }
> > +     }
> > +
> > +    } while (limit--);
> 
> So this won't give you any performance guarantees since the amount of work
> that _createObjFor needs to do is determined by content.  Which is a fancy
> way to say that this will cause janks.  I think you need to do something
> fancier here, such as yielding to the event loop.  But then you have to deal
> with dynamic changes...

"the amount of work is determined by content": do you mean for the text node case, because of the text length?
It is a hard problem, because ultimately this code needs to walk the DOM, and that can only be done in the main thread. The idea for the |limit| was just to put a finite limit on how much time we spend here, and then we can optimize the function more and more if it's causing problems.

Yielding to the event loop would be a _nightmare_, because content could keep changing in a way that makes us never finish walking it.. And handling the dynamic changes would be no small feat.  Unless we can only yield to UI events while keeping content paused, like alert does. But that probably opens another can of worms.

However, an idea occured to me. I think that for text nodes, instead of copying the string right away, I could also create an obj for it with a weakRef to the node. This way it would be fixed time, and I only need to copy the string when serializing the tree into HTML again. Then we can sort of know that there's an upper bound to how much time it takes to traverse 10000 nodes.
But acceptNode still needs access to the string and run a regex on it, so I don't know if this would help.

Realistically the only solution to this is not running everything in one thread ;) But until there, we need to optimize things around this limitation.

> 
> @@ +164,5 @@
> > +  acceptNode: function(node) {
> > +    return globalAcceptNode(node);
> > +  },
> > +
> > +  _createObjFor: function(node) {
> 
> How do you deal with content like this here?
> 
>  <p>Here's my w<span>or</span>d</p>

I don't expect any translation engine or our code to deal with a word broken down like that in any reasonable way. The engines mostly treat tags as empty spaces (separators), so it looks like "w or d".
We can't even know if it visually looks like a single word; <span> could have margins or a display:block..

The important part is allowing inline elements to move freely around its siblings (other elements or text nodes).

> 
> @@ +178,5 @@
> > +      // <td> tags stripped from their <table><tr> structures are
> > +      // ignored by the parser. Change them to a <div> instead,
> > +      // since the tag type does not really matter, what matters
> > +      // is only the tag id.
> > +      get tag() this.tagName != "td" ? this.tagName : "div"
> 
> This is risky business.  For example, you're not handling th here.  Should
> you be?
> 

Yeah I don't know yet how to cover 100% of cases here

> @@ +201,5 @@
> > +   * point to the same objects (i.e., new ones are not created here).
> > +   *
> > +   * In case translation was chunked, this function is able to
> > +   * produce a final translation by being called once
> > +   * for each resulting chunk.
> 
> How do you deal with dynamic changes in the document between the time when
> you extract your tree and the time when the translation is available?

We hold weak refs to the original elements, so if they're gone we just don't do anything about it. Not done yet in this patch but I'm also thinking of checking if the original text is still there unchanged before replacing it with the translation.
> 
> @@ +359,5 @@
> > +
> > +  let nodeName = node.nodeName.toLowerCase();
> > +
> > +  // Ignore non-text nodes that commonly appear in the body
> > +  if (["script", "style", "iframe", "code", "noscript"].indexOf(nodeName)) {
> 
> This seems very fragile.  Why not test node.style.display == "none" or
> something?

We can't test for visibility because many websites have menus, navigation elements, "see more" content etc., that are not always visible but can't go untranslated.

We need to know what types of elements have text nodes in the body of a page but don't have any value as actual content.
(script and style, for example, were throwing most language detection attempts off due to the use of english keywords and property names)

> 
> Also, how do you deal with SVG/MathML content here?
> 

Good point, I'll do as Henri suggested and skip anything not in the html namespace.

> @@ +392,5 @@
> > + * while still keeping it a valid HTML structure (i.e., closing all the open tags and
> > + * reopening them in the following chunk). There may be node duplication on the edges
> > + * of two consecutive chunks, but that is expected. The contents will not be duplicated,
> > + * only the node with the same id will be seen in the two chunks. It's necessary for
> > + * addTranslation to rebuild that entire node with the translation results.
> 
> This scares me a lot.  Can you prove that this function is correct?

Yeah it's no big deal, this function is not very complicated. It's just a traditional depth-first tree traversal, with the addition that we keep track of all the nested elements that we've gone through (i.e., which are open for our currentNode). When we hit the limit we output closing tags for all of them, then start a new string reopening the same.

(Now, I can't prove that _init is correct ;) which is why i'm rewriting that one)

> 
> @@ +434,5 @@
> > +  let str = "";
> > +
> > +  function closeOpen() {
> > +    while (siblingStack.length && siblingStack[0] == 0) {
> > +      str += "</" + openStack.shift().tag + ">";
> 
> Where does .tag come from?  If you cannot prove that it's sanitized, this
> can be XSS-prone.

This tag is not from a DOM node, it's the tag property from a object created by _createObjFor
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> Comment on attachment 8380955 [details] [diff] [review]
> TranslationDocument
> 
> >+    let unwind = false;
> 
> Having to keep track of unwinding like this suggests that TreeWalker is
> getting in the way and walking the tree yourself (see my previous comment)
> might make the code easier to follow.
> 
> >+    treeWalker.firstChild() || treeWalker.nextNode();
> 
> (Again, TreeWalker is weird.)

Yeah, TreeWalker's API has some disappoint behaviors..  I'm gonna stick with just using nextNode() from TreeWalker or NodeIterator.

I didn't understand from your first comment if you're suggesting that it's better to hand implement a domwalker loop or not. But the main benefit I think is simplicity in just calling nextNode() instead of writing code to implement the same thing..

> 
> >+    let tagName = node.nodeName.toLowerCase();
> 
> This is a bug. Please use node.localName (after filtering out nodes that
> aren't in the HTML namespace). (tagName/nodeName should never be used for
> anything.)

Sure, I'll do that. But what's the reason that tagName/nodeName should never be used?

> 
> >+      // <td> tags stripped from their <table><tr> structures are
> >+      // ignored by the parser. Change them to a <div> instead,
> >+      // since the tag type does not really matter, what matters
> >+      // is only the tag id.
> 
> If the translation service doesn't care about the semantics of HTML tags
> (does it indeed not?), why bother giving it anything but spans?

What I meant by "doesn't matter" is more for our side, when parsing the translation result. In theory the translation service could care, as it is semantic information that it may use. But it does not need to be perfect.. I think having at least a clear distinction between block-level and inline elements would be useful. I even thought of converting every block level to <p> and every inline to <b> to reduce the payload size. But <p> can't be nested, so I'd need to use div for it.

But this was basically a hack to handle some websites that I've been testing. I still need a better way to handle this case in general as I'm sure there are others. Thoughts?
(In reply to comment #9)
> (In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness,
> emailapocalypse) from comment #6)
> > Comment on attachment 8380955 [details] [diff] [review]
> > TranslationDocument
> > 
> > Review of attachment 8380955 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > I have to say I didn't spend a lot of time on this patch, and my feedback
> > below is very high level.
> > 
> > I have one architectural question.  It seems like your goal is to extract
> > sentences/phrases from the original document that may span different HTML
> > elements, is that correct?  If it is, you need to pay attention to
> > inline/block elements and try to break sentences/paragraphs around the
> > boundaries of block elements (and deal with things like line breaks, tables,
> > and special text content such as image alt text for accessibility, etc.)  As
> > far as I can see, this patch doesn't do any of that work, and I doubt that
> > it will work on all kinds of documents.
> 
> Hmm, yes and no, depends on what you mean by it. I'm not extracting the content
> out of nodes and merging or doing anything with them, i'm just creating a
> smaller tree that doesn't contains blank text nodes or elements that don't
> directly contain content. So I'm not removing any of the element markers that
> define the boundaries around text content. This means that all </p> markers are
> in place for example. I don't need to do any sentence breaking or anything.

Oh, I see.  So for example, do you preserve or remove <br>?  What about <hr>?

> I haven't dealt with content in attributes yet (for things like alt text or
> input placeholder labels). I'll deal with them later but I have a plan.

OK.

> > ::: browser/modules/translation/TranslationDocument.jsm
> > @@ +151,5 @@
> > > +          unwind = true;
> > > +        }
> > > +     }
> > > +
> > > +    } while (limit--);
> > 
> > So this won't give you any performance guarantees since the amount of work
> > that _createObjFor needs to do is determined by content.  Which is a fancy
> > way to say that this will cause janks.  I think you need to do something
> > fancier here, such as yielding to the event loop.  But then you have to deal
> > with dynamic changes...
> 
> "the amount of work is determined by content": do you mean for the text node
> case, because of the text length?

Yes, exactly.

> It is a hard problem, because ultimately this code needs to walk the DOM, and
> that can only be done in the main thread. The idea for the |limit| was just to
> put a finite limit on how much time we spend here, and then we can optimize the
> function more and more if it's causing problems.

But I hope that it's clear now that limit doesn't actually do that.  I think it's harmful to have constructs in the code which mislead the people who read the code to believe that the code has certain performance characteristics that it actually doesn't.

> Yielding to the event loop would be a _nightmare_, because content could keep
> changing in a way that makes us never finish walking it.. And handling the
> dynamic changes would be no small feat.  Unless we can only yield to UI events
> while keeping content paused, like alert does. But that probably opens another
> can of worms.

Yeah, agreed!

> However, an idea occured to me. I think that for text nodes, instead of copying
> the string right away, I could also create an obj for it with a weakRef to the
> node. This way it would be fixed time, and I only need to copy the string when
> serializing the tree into HTML again. Then we can sort of know that there's an
> upper bound to how much time it takes to traverse 10000 nodes.
> But acceptNode still needs access to the string and run a regex on it, so I
> don't know if this would help.

Yeah I sort of expect that it wouldn't.  :/  And you still need to come up with a way to handle dynamic changes then.

(Note that handling dynamic changes is at least plausible by registering a mutation event listener and remembering which nodes you have previously traversed and which ones you have never seen before somehow.  No idea how to do the second part efficiently in JS though.)

> Realistically the only solution to this is not running everything in one thread
> ;) But until there, we need to optimize things around this limitation.

To be super clear, nobody is planning to implement multi-threaded DOM.  :-)

> > @@ +164,5 @@
> > > +  acceptNode: function(node) {
> > > +    return globalAcceptNode(node);
> > > +  },
> > > +
> > > +  _createObjFor: function(node) {
> > 
> > How do you deal with content like this here?
> > 
> >  <p>Here's my w<span>or</span>d</p>
> 
> I don't expect any translation engine or our code to deal with a word broken
> down like that in any reasonable way. The engines mostly treat tags as empty
> spaces (separators), so it looks like "w or d".

That should be considered a bug!  We don't need to inherit that bug.

> We can't even know if it visually looks like a single word; <span> could have
> margins or a display:block..

Which is why I suggested earlier that you should look at inline and block elements.  If you did that, you could basically ignore all inline elements and this problem would go away.

> The important part is allowing inline elements to move freely around its
> siblings (other elements or text nodes).

Not sure why that is needed.

> > @@ +201,5 @@
> > > +   * point to the same objects (i.e., new ones are not created here).
> > > +   *
> > > +   * In case translation was chunked, this function is able to
> > > +   * produce a final translation by being called once
> > > +   * for each resulting chunk.
> > 
> > How do you deal with dynamic changes in the document between the time when
> > you extract your tree and the time when the translation is available?
> 
> We hold weak refs to the original elements, so if they're gone we just don't do
> anything about it. Not done yet in this patch but I'm also thinking of checking
> if the original text is still there unchanged before replacing it with the
> translation.

But what do you mean by "gone"?  What happens if I taker an element and move it somewhere else in the DOM?  The weak ref would still be valid, right?

> > @@ +359,5 @@
> > > +
> > > +  let nodeName = node.nodeName.toLowerCase();
> > > +
> > > +  // Ignore non-text nodes that commonly appear in the body
> > > +  if (["script", "style", "iframe", "code", "noscript"].indexOf(nodeName)) {
> > 
> > This seems very fragile.  Why not test node.style.display == "none" or
> > something?
> 
> We can't test for visibility because many websites have menus, navigation
> elements, "see more" content etc., that are not always visible but can't go
> untranslated.
> 
> We need to know what types of elements have text nodes in the body of a page
> but don't have any value as actual content.
> (script and style, for example, were throwing most language detection attempts
> off due to the use of english keywords and property names)

We're entering the edge case territory here, but my point was that script and style are not special at all.  You can add script,style{display:block} to a page and ruin this heuristic.  :-)

> > Also, how do you deal with SVG/MathML content here?
> > 
> 
> Good point, I'll do as Henri suggested and skip anything not in the html
> namespace.

That's a fair suggestion!
(In reply to :Felipe Gomes from comment #10)
> I didn't understand from your first comment if you're suggesting that it's
> better to hand implement a domwalker loop or not.

I was suggesting hand-implementing the loop. I find it clearer than TreeWalker, since the structure of the algorithm itself offers a place for code that should run when walking back up the tree without having to maintain a boolean about that, but if you prefer TreeWalker, that's OK.

> Sure, I'll do that. But what's the reason that tagName/nodeName should never
> be used?

It can contain a meaningless prefix.

> What I meant by "doesn't matter" is more for our side, when parsing the
> translation result. In theory the translation service could care, as it is
> semantic information that it may use. But it does not need to be perfect.. I
> think having at least a clear distinction between block-level and inline
> elements would be useful. I even thought of converting every block level to
> <p> and every inline to <b> to reduce the payload size. But <p> can't be
> nested, so I'd need to use div for it.

Yeah, you'd need <div> for that.

> But this was basically a hack to handle some websites that I've been
> testing. I still need a better way to handle this case in general as I'm
> sure there are others. Thoughts?

What's worthwhile really depends on the internals of the service. Can we find out what markup the service cares about?
Whiteboard: p=0 → [translation] p=0
(In reply to :Ehsan Akhgari (needinfo? me!) (slow responsiveness, emailapocalypse) from comment #11)
> 
> > However, an idea occured to me. I think that for text nodes, instead of copying
> > the string right away, I could also create an obj for it with a weakRef to the
> > node. This way it would be fixed time, and I only need to copy the string when
> > serializing the tree into HTML again. Then we can sort of know that there's an
> > upper bound to how much time it takes to traverse 10000 nodes.
> > But acceptNode still needs access to the string and run a regex on it, so I
> > don't know if this would help.
> 
> Yeah I sort of expect that it wouldn't.  :/  And you still need to come up
> with a way to handle dynamic changes then.
> 
> (Note that handling dynamic changes is at least plausible by registering a
> mutation event listener and remembering which nodes you have previously
> traversed and which ones you have never seen before somehow.  No idea how to
> do the second part efficiently in JS though.)

But that's not the hard part of dealing with dynamic changes.. The hard part is keeping a tree representation while dynamic changes are happening, which is impossible as a change may cause it to become a cyclic or disjoint graph and no longer a tree.


I see two possible approaches:

(1) I looked more into other code that uses it, I think it may not be so crazy to use EnterModalState to freeze the webpage for a bit while yielding to the UI and other pages, until we can complete the walk on the page.

(2) If there is a way to perform an efficient copy of the page DOM, we could do that and then walk on the snapshot'ed copy instead of the live page, at our own pace (i.e. while yielding to the event loop).
Even better if we could send it through a non-copy move to a worker thread and use a worker for that. But I don't believe any of this is possible without a ton of other work (but I'd be glad to learn otherwise).


What do you think? If you're not strongly opposed to it I'm gonna get a working version of option 1 and see how well it works.


> 
> > Realistically the only solution to this is not running everything in one thread
> > ;) But until there, we need to optimize things around this limitation.
> 
> To be super clear, nobody is planning to implement multi-threaded DOM.  :-)

I know, I was thinking about e10s rather than multi-threaded DOM. To me it's expected that a page may need to be paused for a moment while it is being translated, as long as that doesn't affect other pages or the UI :)
Whiteboard: [translation] p=0 → [translation] p=13 s=it-30c-29a-28b.3
qa- as the end result(s) will be qa+ in the blocks bugs
Whiteboard: [translation] p=13 s=it-30c-29a-28b.3 → [translation] p=13 s=it-30c-29a-28b.3 [qa-]
Comment on attachment 8380955 [details] [diff] [review]
TranslationDocument

I don't think I have feedback on this beyond the issues Ehsan/Henri have raised. I think we agreed browser/modules/translation shouldn't exist in the other bug.
Attachment #8380955 - Flags: feedback?(gavin.sharp)
Whiteboard: [translation] p=13 s=it-30c-29a-28b.3 [qa-] → [translation] p=13 s=it-31c-30a-29b.1 [qa-]
Posted patch GetTranslationNodes (obsolete) — Splinter Review
Olli: this patch creates a function called getTranslationNodes which iterates through the document and finds all the useful nodes for translation. A "translation node" is a node that has content should be translated.

The code itself should be ready for review. I'm requesting only feedback because i've clearly thrown the new interface nsITranslationNode in the same file just to get things working, but I don't know where I should put that. I was hoping you could suggest a good place for it :)
And also I need tests for it.
Attachment #8380955 - Attachment is obsolete: true
Attachment #8396923 - Flags: feedback?(bugs)
And FTR, the "isTranslationRoot" and "isBlockFrame" information will be used higher up by the JS code that calls this function and formats the data for the translation service.
Comment on attachment 8396923 [details] [diff] [review]
GetTranslationNodes


>+
>+  nsSimpleContentList* elements = new nsSimpleContentList(doc);
You leak |elements| and all the elements you append to it.


>+    // An element is a translation node if it contains
>+    // at least one text node that has meaningful data
>+    // for translation
>+    for (nsIContent* child = content->GetFirstChild();
>+         child;
>+         child = child->GetNextSibling()) {
>+
>+      if (child->IsNodeOfType(nsINode::eTEXT) &&
>+          child->TextHasContentForTranslation()) {
>+          elements->AppendElement(content);
>+        break;
>+      }
>+    }   
>+  }
So this can get a bit slow with large document. nsSimpleContentList uses
nsTArray as storage and it really isn't too good when one need to append lots of stuff without being
able to set the capacity beforehand - nsTArray ends up doing lots of alloc/dealloc/memmove


+  for (uint32_t i = 0; i < count; i++) {
+    nsIContent* node = elements->Item(i);
+
+    bool isBlockFrame = false;
+    nsIFrame* frame = node->GetPrimaryFrame();
+    if (frame) {
+      isBlockFrame = frame->IsFrameOfType(nsIFrame::eBlockFrame);
+    }
+
+    bool isTranslationRoot = isBlockFrame;
+    if (!isBlockFrame) {
+      // If an element is not a block element, it still
+      // can be considered a translation root if the parent
+      // of this element didn't make into the list of nodes
+      // to be translated.
+      bool parentInList = false;
+      nsIContent* parent = node->GetParent();
+      if (parent) {
+        parentInList = (elements->IndexOf(parent) != -1);
Hmm, isn't this O(nlogn), when it could be O(1) 


>+NS_INTERFACE_MAP_BEGIN(nsTranslationNode)
>+  NS_INTERFACE_MAP_ENTRY_AMBIGUOUS(nsISupports, nsITranslationNode)
I don't see anything ambiguous here

Is the idea that some js implemented service will use nsIDOMWindowUtils.getTranslationNodes?
If so, I'm a bit worried about the performance since this would create wrappers for many nodes which wouldn't
otherwise have wrappers. And there will be wrappers also for nsITranslationNode objects.
And after translation most of those wrappers would be collected.
So C++ only implementation _might_ be good here.


And _if_ we want to do this all in JS, couldn't we just add [ChromeOnly] bool textHasContentForTranslation() to text nodes
and [ChromeOnly] bool isBlockLevel() (btw, how should inline-block be handled?) to elements? That would at least
remove the need for nsITranslationNode and reduce some addref/release.
Attachment #8396923 - Flags: feedback?(bugs) → feedback-
Posted patch GetTranslationNodes (obsolete) — Splinter Review
This implements a list-style interface as we discussed which made things much faster. Also used the hashtable and got rid of the second for-loop as it was possible to avoid it by using the PreContentIterator.

I added an inparam aRoot so that this is not always tied to the body of the document as it was before. I'll post tests later today.
Attachment #8396923 - Attachment is obsolete: true
Attachment #8397626 - Flags: review?(bugs)
Posted patch GetTranslationNodes + tests (obsolete) — Splinter Review
Now with tests
Attachment #8397626 - Attachment is obsolete: true
Attachment #8397626 - Flags: review?(bugs)
Attachment #8398574 - Flags: review?(bugs)
Comment on attachment 8398574 [details] [diff] [review]
GetTranslationNodes + tests

>--- a/content/base/public/nsIContent.h
You should update IID

>+  unsigned char ch;
>+  for (; cp < end; cp++) {
>+    ch = *cp;
>+
>+    // These are the characters that are letters
>+    // in the first 256 UTF-8 codepoints.
>+    if ((ch >= 'a' && ch <= 'z') ||
>+       (ch >= 'A' && ch <= 'Z') ||
>+       (ch >= 192 && ch <= 214) ||
>+       (ch >= 216 && ch <= 246) ||
>+       (ch >= 248)) {
>+      return true;
>+    }
Do we really not need numbers? Numbers may affect to the context quite a bit.
Couldn't we just check the whitespace bit and return true if that is not set?

>+  nsTHashtable<nsPtrHashKey<nsIContent> > translationNodesHash(1000);
no need for the space between > and >

>+
>+  nsTranslationNodeList* list = new nsTranslationNodeList;
You really want nsRefPtr<nsTranslationNodeList> list
and then .swap() when setting the value of the out param.

>+
>+  nsCOMPtr<nsIContentIterator> iter;
>+  iter = NS_NewPreContentIterator();
Hmm, why you don't use nsINode::GetNextNode ?

>+
>+    if (content->OwnerDoc() != doc) {
>+      continue;
>+    }
Don't understand this check.


>+
>+    // An element is a translation node if it contains
>+    // at least one text node that has meaningful data
>+    // for translation
>+    for (nsIContent* child = content->GetFirstChild();
>+         child;
>+         child = child->GetNextSibling()) {
>+
>+      if (child->IsNodeOfType(nsINode::eTEXT) &&
>+          child->TextHasContentForTranslation()) {
Why you need the check for node type?


>+nsTranslationNodeList::Item(uint32_t index, nsIDOMNode** aRetVal)
>+{
>+  NS_ENSURE_ARG_POINTER(aRetVal);
>+  if (index >= mLength) {
>+    *aRetVal = nullptr;
>+    return NS_OK;
>+  }
>+
>+  NS_ADDREF(*aRetVal = mNodes.ElementAt(index));

Isn't there SafeElementAt()? nsCOMArray certainly would have that.

> [scriptable, uuid(ef70a299-033c-4adc-b214-6649aed9d828)]
> interface nsIDOMWindowUtils : nsISupports {
You need to update uuid
Attachment #8398574 - Flags: review?(bugs) → review-
Posted patch translationnodes (obsolete) — Splinter Review
> 
> >+  unsigned char ch;
> >+  for (; cp < end; cp++) {
> >+    ch = *cp;
> >+
> >+    // These are the characters that are letters
> >+    // in the first 256 UTF-8 codepoints.
> >+    if ((ch >= 'a' && ch <= 'z') ||
> >+       (ch >= 'A' && ch <= 'Z') ||
> >+       (ch >= 192 && ch <= 214) ||
> >+       (ch >= 216 && ch <= 246) ||
> >+       (ch >= 248)) {
> >+      return true;
> >+    }
> Do we really not need numbers? Numbers may affect to the context quite a bit.
> Couldn't we just check the whitespace bit and return true if that is not set?

A number in the middle of a phrase will get translated, it's only a node that is a purely a number that won't. On my experiments with various popular websites there [1], not translating these nodes represents some significant savings in amount of text translated. This is a balancing things and we'll keep experimenting, if we feel numbers should be included later we can do it easily. There are also a lot of punctuation only nodes out there so we can't use TextIsOnlyWhitespace satisfactorily here. 
[1] e.g. see how many nodes we can save from translating in this page by not including pure number nodes: http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes
Attachment #8398574 - Attachment is obsolete: true
Attachment #8399462 - Flags: review?(bugs)
Comment on attachment 8399462 [details] [diff] [review]
translationnodes

Review of attachment 8399462 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsDOMWindowUtils.cpp
@@ +1591,5 @@
> +
> +  nsIContent* content = root;
> +  while ((content = content->GetNextNode(root))) {
> +  //for (iter->Next(); !iter->IsDone() && limit; iter->Next()) {
> +    //nsCOMPtr<nsIContent> content = do_QueryInterface(iter->GetCurrentNode());

ignore these two commented lines that I left
Whiteboard: [translation] p=13 s=it-31c-30a-29b.1 [qa-] → [translation] p=13 s=it-31c-30a-29b.2 [qa-]
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
Comment on attachment 8399462 [details] [diff] [review]
translationnodes


>   /**
>+   * Method to see if the text node contains data that is useful
>+   * for a translation: i.e., it consists of more than just whitespace,
>+   * digits and punctuation.
>+   * NOTE: Always returns false for elements.
>+   */
>+  virtual bool TextHasContentForTranslation() = 0;
Maybe HasTextForTranslation() would be tiny bit better name.
>+nsDOMWindowUtils::GetTranslationNodes(nsIDOMNode* aRoot,
>+                                      nsITranslationNodeList** aRetVal)
>+{
>+  if (!nsContentUtils::IsCallerChrome()) {
>+    return NS_ERROR_DOM_SECURITY_ERR;
>+  }
>+
>+  NS_ENSURE_ARG_POINTER(aRetVal);
>+  nsCOMPtr<nsIContent> root = do_QueryInterface(aRoot);
>+  NS_ENSURE_STATE(root);
>+  nsCOMPtr<nsIDocument> doc = GetDocument();
>+  NS_ENSURE_STATE(doc);
>+
>+  if (root->OwnerDoc() != doc) {
>+    return NS_ERROR_DOM_WRONG_DOCUMENT_ERR;
>+  }
>+
>+  nsTHashtable<nsPtrHashKey<nsIContent>> translationNodesHash(1000);
>+  nsRefPtr<nsTranslationNodeList> list = new nsTranslationNodeList;
>+
>+  uint32_t limit = 20000;
>+  // We begin iteration with iter->Next() because we want to explictly
>+  // skip the root tag from being a translation node.
>+
>+  nsIContent* content = root;
>+  while ((content = content->GetNextNode(root))) {
>+  //for (iter->Next(); !iter->IsDone() && limit; iter->Next()) {
>+    //nsCOMPtr<nsIContent> content = do_QueryInterface(iter->GetCurrentNode());
Remove these leftovers 


>+
>+    if (!content->IsHTML()) {
>+      continue;
>+    }
>+
>+    nsIAtom* localName = content->Tag();
>+
>+    // Skip elements that usually contain non-translatable text content.
>+    if (localName == nsGkAtoms::script ||
>+        localName == nsGkAtoms::iframe ||
>+        localName == nsGkAtoms::frameset ||
>+        localName == nsGkAtoms::frame ||
>+        localName == nsGkAtoms::code ||
>+        localName == nsGkAtoms::noscript ||
>+        localName == nsGkAtoms::style) {
I know this is still a preliminary approach, but in the future we might 
want to pass in a list of element names to bypass. That way js could easily filter out stuff.
But not needed now.


>+
>+        bool isTranslationRoot = isBlockFrame;
>+        if (!isBlockFrame) {
>+          // If an element is not a block element, it still
>+          // can be considered a translation root if the parent
>+          // of this element didn't make into the list of nodes
>+          // to be translated.
>+          bool parentInList = false;
>+          nsIContent* parent = content->GetParent();
>+          if (parent) {
>+            parentInList = translationNodesHash.Contains(parent);
>+          }
>+          isTranslationRoot = !parentInList;
>+        }
What is the use case for this? Could you explain.


>+        list->AppendElement(content->AsDOMNode(), isTranslationRoot);
>+        --limit;
Hmm, you don't actually use limit anywhere 

>+nsTranslationNodeList::Item(uint32_t index, nsIDOMNode** aRetVal)
aIndex

>+  NS_ENSURE_ARG_POINTER(aRetVal);
>+  NS_ADDREF(*aRetVal = mNodes.SafeElementAt(index));
NS_IF_ADDREF


>+nsTranslationNodeList::IsTranslationRootAtIndex(uint32_t index, bool* aRetVal)
aIndex

>+  void AppendElement(nsIDOMNode* element, bool isRoot)
aElement

>diff --git a/dom/base/test/test_getTranslationNodes.html b/dom/base/test/test_getTranslationNodes.html
I think you should test also a case when there are lots of nodes (since you missed use of limit variable).
Attachment #8399462 - Flags: review?(bugs) → review+
Comment on attachment 8399462 [details] [diff] [review]
translationnodes

Oops r-
Attachment #8399462 - Flags: review+ → review-
The use case for translation roots, and making inline elements as roots, is like this:

A run of text under the same block element (say, a <p>) can have nested inline elements, and they should all not become roots as long as their parent element is also a translation node (which means they will be connected to <p> as the root).

For example:
<p>Lorem ipsum <b>dolor sit <a>amet</a></b>.</p>

Everything is a translation node (because every element has text), but only p should be root here (so we translate it as a single meaningful unit).

Now, if a inline element is disconnected from a block element that has text in it (meaning: its parent is not a translation node), we make this inline element itself the root, so that we can discard the block element as useless, and save data for the translation payload.  For example:

<table>
  <tr>
    <td>
      <div>
        <span>Content</span>
      </div>
    </td>
  </tr>
</table>

We make <span> the root element, and that's the only thing that we're gonna translate.


I added a new extra testcase for this, to make sure that nested inline elements behave correctly.
Fixed comments and added a test for limit
Attachment #8399462 - Attachment is obsolete: true
Attachment #8402031 - Flags: review?(bugs)
Comment on attachment 8402031 [details] [diff] [review]
translationnodes

>+  //~nsTranslationNodeList() { };
You could just remove this.

>+
>+  NS_DECL_ISUPPORTS
>+  NS_DECL_NSITRANSLATIONNODELIST
Hmm, how do we expect to use this stuff. Should it be cycle collectable?
I guess we can live with this, since content js shouldn't have an edge to this object.

>+  /**
>+   * Get a list of nodes that have meaningful textual content to
>+   * be translated.
>+   *
>+   * This method requires chrome privileges.
>+   */
>+  nsITranslationNodeList getTranslationNodes(in nsIDOMNode aRoot);
Could you perhaps add a comment here that the API is rather experimental.
Attachment #8402031 - Flags: review?(bugs) → review+
https://hg.mozilla.org/mozilla-central/rev/79db871959fd
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Firefox 31
Status: RESOLVED → VERIFIED
Mass move of translation bugs to the new Translation component.
Component: General → Translation
Depends on: 1015535
You need to log in before you can comment on or make changes to this bug.