Last Comment Bug 777966 - Change reader mode replaceBrs logic
: Change reader mode replaceBrs logic
Status: RESOLVED FIXED
:
Product: Firefox for Android
Classification: Client Software
Component: Reader View (show other bugs)
: unspecified
: ARM Android
: -- normal (vote)
: Firefox 17
Assigned To: Brian Nicholson (:bnicholson)
:
Mentors:
Depends on:
Blocks: reader 779796
  Show dependency treegraph
 
Reported: 2012-07-26 15:28 PDT by Brian Nicholson (:bnicholson)
Modified: 2012-08-20 21:26 PDT (History)
2 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
add _replaceBrs() method to Readability (5.22 KB, patch)
2012-07-26 15:33 PDT, Brian Nicholson (:bnicholson)
no flags Details | Diff | Splinter Review
Add _replaceBrs() method to Readability, v2 (5.97 KB, patch)
2012-08-03 23:01 PDT, Brian Nicholson (:bnicholson)
lucasr.at.mozilla: review+
lukasblakk+bugs: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Brian Nicholson (:bnicholson) 2012-07-26 15:28:49 PDT
From Readability.js:

    // Turn all double br's into p's. Note, this is pretty costly as far
    // as processing goes. Maybe optimize later.
    doc.body.innerHTML =
        doc.body.innerHTML.replace(this.REGEXPS.replaceBrs, '</p><p>').

Aside from this being an expensive operation, it also produces invalid HTML. For example, with this regex replacement,

<div>
  hello<br><br>
  <div>foo<br><br>bar</div>
</div>

becomes

<div>
  hello</p><p>
  <div>foo</p><p>bar</div>
</div>

which is a terrible mess.
Comment 1 Brian Nicholson (:bnicholson) 2012-07-26 15:33:59 PDT
Created attachment 646368 [details] [diff] [review]
add _replaceBrs() method to Readability

This patch uses DOM traversal rather than regex to do the replacements. I tested several sites, and they all seemed to parse correctly. Here are some benchmarks:

http://en.m.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji
before: ~1600ms; after: ~600ms

http://m.yahoo.com/w/legobpengine/news/report-colorado-shooting-suspect-sent-plans-notebook-psychiatrist-170815999.html?orig_host_hdr=news.yahoo.com&.intl=US&.lang=en-US
before: ~16ms; after: ~40ms

http://www.nytimes.com/2012/07/27/business/media/aamir-khan-a-bollywood-star-remakes-himself-into-tv-conscience-of-social-ills.html?_r=1&hp
before: ~45ms; after: ~2ms
Comment 2 Lucas Rocha (:lucasr) 2012-07-30 10:17:28 PDT
Comment on attachment 646368 [details] [diff] [review]
add _replaceBrs() method to Readability

Review of attachment 646368 [details] [diff] [review]:
-----------------------------------------------------------------

The function looks a bit non-obvious at first sight. It would be nice if you could comment it a bit more. I'll only give r+ after my questions are answered and comments are added.

::: mobile/android/chrome/content/Readability.js
@@ +246,5 @@
> +    function nextElement(node) {
> +      let next = node;
> +      while (next
> +          && (next.nodeType != Node.ELEMENT_NODE)
> +          && !whitespace.test(next.textContent)) {

Does this condition cover empty tags? i.e. no text at all

@@ +253,5 @@
> +      return next;
> +    }
> +
> +    let brs = elem.getElementsByTagName("br");
> +    brs = Array.prototype.slice.call(brs, 0);

This array can get quite big depending on the page...

@@ +254,5 @@
> +    }
> +
> +    let brs = elem.getElementsByTagName("br");
> +    brs = Array.prototype.slice.call(brs, 0);
> +    for (let i = 0; i < brs.length; i++) {

Traverse in reverse order for better performance.

@@ +259,5 @@
> +      let br = brs[i];
> +      let next = br.nextSibling;
> +      let replaced = false;
> +
> +      while ((next = nextElement(next))

Isn't this call going to "skip" a element? You're calling nextElement() on br.nextSibling (which means getting the element after the next sibling from the current "br"). What am I missing?

@@ +260,5 @@
> +      let next = br.nextSibling;
> +      let replaced = false;
> +
> +      while ((next = nextElement(next))
> +          && (next.localName == "br")) {

nit: no need to break line here.

@@ +263,5 @@
> +      while ((next = nextElement(next))
> +          && (next.localName == "br")) {
> +        replaced = true;
> +        let sibling = next.nextSibling;
> +        next.parentNode.removeChild(next);

Does this mean that some brs might become orphans by the time the loop gets to them?

@@ +278,5 @@
> +          if (next.localName == "br") {
> +            let nextElem = nextElement(next);
> +            if (nextElem && nextElem.localName == "br") {
> +              break;
> +            }

Does this mean that you won't enclose a first blob of text followed by <br><br> with <p>? For example:

BLOB1
<br><br>
BLOB2

will become:

BLOB1
<p>BLOB2</p>

Right? Just wondering what kind of issues we might get when styling the resulting HTML in reader.
Comment 3 Brian Nicholson (:bnicholson) 2012-07-30 11:14:25 PDT
(In reply to Lucas Rocha (:lucasr) from comment #2)
> Comment on attachment 646368 [details] [diff] [review]
> add _replaceBrs() method to Readability
> 
> Review of attachment 646368 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The function looks a bit non-obvious at first sight. It would be nice if you
> could comment it a bit more. I'll only give r+ after my questions are
> answered and comments are added.
> 
> ::: mobile/android/chrome/content/Readability.js
> @@ +246,5 @@
> > +    function nextElement(node) {
> > +      let next = node;
> > +      while (next
> > +          && (next.nodeType != Node.ELEMENT_NODE)
> > +          && !whitespace.test(next.textContent)) {
> 
> Does this condition cover empty tags? i.e. no text at all
> 

This check just ignores whitespace between elements, not the elements themselves, so it does not ignore empty tags (which is what we want - the original regex didn't either).

> @@ +253,5 @@
> > +      return next;
> > +    }
> > +
> > +    let brs = elem.getElementsByTagName("br");
> > +    brs = Array.prototype.slice.call(brs, 0);
> 
> This array can get quite big depending on the page...

Elsewhere in Readability.js, we have a call to "page.getElementsByTagName('*')", which returns a much larger set of elements (all of them in the page). The only difference in this case is that we're putting them in an array, but it's just an array of object references, so I don't think this should be a concern.

> 
> @@ +254,5 @@
> > +    }
> > +
> > +    let brs = elem.getElementsByTagName("br");
> > +    brs = Array.prototype.slice.call(brs, 0);
> > +    for (let i = 0; i < brs.length; i++) {
> 
> Traverse in reverse order for better performance.
> 

I would normally, but that would change the algorithm in this case. For example, iterating forward:

<div>a<br><br>b</div> ==> <div>a<p>b</p></div>

iterating backward:

<div>a<br><br>b</div> ==> <div><p>a</p>b</div>

> @@ +259,5 @@
> > +      let br = brs[i];
> > +      let next = br.nextSibling;
> > +      let replaced = false;
> > +
> > +      while ((next = nextElement(next))
> 
> Isn't this call going to "skip" a element? You're calling nextElement() on
> br.nextSibling (which means getting the element after the next sibling from
> the current "br"). What am I missing?
> 

Sorry, this code needed more commenting like you said. But nextElement() looks at the current element before moving to the next sibling, so if the "next" node given to nextElement is an element node, nextElement() will return that same node.

> @@ +263,5 @@
> > +      while ((next = nextElement(next))
> > +          && (next.localName == "br")) {
> > +        replaced = true;
> > +        let sibling = next.nextSibling;
> > +        next.parentNode.removeChild(next);
> 
> Does this mean that some brs might become orphans by the time the loop gets
> to them?

Yes. This means the nextSibling is null, so nextElement() returns null, so the while loop is skipped, and then we move on (another thing I'll add more comments to).

> 
> @@ +278,5 @@
> > +          if (next.localName == "br") {
> > +            let nextElem = nextElement(next);
> > +            if (nextElem && nextElem.localName == "br") {
> > +              break;
> > +            }
> 
> Does this mean that you won't enclose a first blob of text followed by
> <br><br> with <p>? For example:
> 
> BLOB1
> <br><br>
> BLOB2
> 
> will become:
> 
> BLOB1
> <p>BLOB2</p>
> 
> Right? Just wondering what kind of issues we might get when styling the
> resulting HTML in reader.

Correct - but, as far as I could tell, this is what was happening already. All <br> chains were being replaced with </p><p>, resulting in the HTML from comment 0. I agree we may want to change this, though.
Comment 4 Kartikaya Gupta (email:kats@mozilla.com) 2012-07-30 13:02:14 PDT
(In reply to Brian Nicholson (:bnicholson) from comment #3)
> > > +    let brs = elem.getElementsByTagName("br");
> > > +    brs = Array.prototype.slice.call(brs, 0);
> > 
> > This array can get quite big depending on the page...
> 
> Elsewhere in Readability.js, we have a call to
> "page.getElementsByTagName('*')", which returns a much larger set of
> elements (all of them in the page). The only difference in this case is that
> we're putting them in an array, but it's just an array of object references,
> so I don't think this should be a concern.
> 

Drive-by comment: page.getElementsByTagName('*') returns a lazily-evaluated "live" NodeList, so that call alone is extremely cheap. It's actually walking through the list and grabbing references to the elements that has most of the cost. I don't know how applicable that information is to this particular situation.
Comment 5 Brian Nicholson (:bnicholson) 2012-07-30 14:12:50 PDT
(In reply to Kartikaya Gupta (:kats) from comment #4)
> 
> Drive-by comment: page.getElementsByTagName('*') returns a lazily-evaluated
> "live" NodeList, so that call alone is extremely cheap. It's actually
> walking through the list and grabbing references to the elements that has
> most of the cost. I don't know how applicable that information is to this
> particular situation.

Yeah, the fact that a live NodeList is returned is the reason I put these in an array (non-live) to begin with. I don't recall what the specific problem was, though; I'll look at it again to determine why/if a static list is necessary.

We'd have to fully iterate the list either way (array or no array), so I don't think there's a huge performance cost here (and even if there is, it still beats regex if my measurements from comment 1 were accurate). From comment 2, it sounded like Lucas was concerned with memory usage.
Comment 6 Lucas Rocha (:lucasr) 2012-07-31 02:54:38 PDT
(In reply to Brian Nicholson (:bnicholson) from comment #3)
> I would normally, but that would change the algorithm in this case. For
> example, iterating forward:
> 
> <div>a<br><br>b</div> ==> <div>a<p>b</p></div>
> 
> iterating backward:
> 
> <div>a<br><br>b</div> ==> <div><p>a</p>b</div>

Hmm, the loop iterations look index-independent. For instance, you don't seem to rely on the index of the "br" in the array to fetch the next sibling. Not a big deal though.
 
> > @@ +259,5 @@
> > > +      let br = brs[i];
> > > +      let next = br.nextSibling;
> > > +      let replaced = false;
> > > +
> > > +      while ((next = nextElement(next))
> > 
> > Isn't this call going to "skip" a element? You're calling nextElement() on
> > br.nextSibling (which means getting the element after the next sibling from
> > the current "br"). What am I missing?
> > 
> 
> Sorry, this code needed more commenting like you said. But nextElement()
> looks at the current element before moving to the next sibling, so if the
> "next" node given to nextElement is an element node, nextElement() will
> return that same node.

This function probably needs a better name then :-)

> > @@ +263,5 @@
> > > +      while ((next = nextElement(next))
> > > +          && (next.localName == "br")) {
> > > +        replaced = true;
> > > +        let sibling = next.nextSibling;
> > > +        next.parentNode.removeChild(next);
> > 
> > Does this mean that some brs might become orphans by the time the loop gets
> > to them?
> 
> Yes. This means the nextSibling is null, so nextElement() returns null, so
> the while loop is skipped, and then we move on (another thing I'll add more
> comments to).

Ok.

> > 
> > @@ +278,5 @@
> > > +          if (next.localName == "br") {
> > > +            let nextElem = nextElement(next);
> > > +            if (nextElem && nextElem.localName == "br") {
> > > +              break;
> > > +            }
> > 
> > Does this mean that you won't enclose a first blob of text followed by
> > <br><br> with <p>? For example:
> > 
> > BLOB1
> > <br><br>
> > BLOB2
> > 
> > will become:
> > 
> > BLOB1
> > <p>BLOB2</p>
> > 
> > Right? Just wondering what kind of issues we might get when styling the
> > resulting HTML in reader.
> 
> Correct - but, as far as I could tell, this is what was happening already.
> All <br> chains were being replaced with </p><p>, resulting in the HTML from
> comment 0. I agree we may want to change this, though.

It's the perfect time to fix this bug then :-)
Comment 7 Brian Nicholson (:bnicholson) 2012-08-03 23:01:21 PDT
Created attachment 648952 [details] [diff] [review]
Add _replaceBrs() method to Readability, v2

I couldn't think of a better name for nextElement(), so I kept that and added a comment. Let me know if you can think of something better.

I think the first block not being inside of a <p> might be better handled in another bug. The problem is that we don't know where we even need a <p> until we hit the first <br><br>, and we would have to backtrack through the previous siblings to add them to the first <p>. Though the existing implementation isn't ideal from an HTML standpoint, I haven't noticed any visible problems with it, and I don't know whether it's worth fixing if it results in a performance penalty.

Also, I removed the Array.prototype.slice.call() here because this bug should land with bug 779796. The JSDOMParser in bug 779796 won't support dynamic NodeLists; getElementById() in that implementation will just return an array (so there's no need to convert it from a dynamic to static data structure).
Comment 8 Lucas Rocha (:lucasr) 2012-08-06 06:43:08 PDT
Comment on attachment 648952 [details] [diff] [review]
Add _replaceBrs() method to Readability, v2

Review of attachment 648952 [details] [diff] [review]:
-----------------------------------------------------------------

How much faster is this code comparing to the current regex-based one?
Comment 9 Brian Nicholson (:bnicholson) 2012-08-06 10:43:24 PDT
(In reply to Lucas Rocha (:lucasr) from comment #8)
> Comment on attachment 648952 [details] [diff] [review]
> Add _replaceBrs() method to Readability, v2
> 
> Review of attachment 648952 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> How much faster is this code comparing to the current regex-based one?

I tested a few pages in comment 1, and the results varied. Aside from performance, though, I think the most important thing we gain from this bug is the ability to use bug 779796 (which cannot be used with invalid HTML).
Comment 10 Brian Nicholson (:bnicholson) 2012-08-06 10:59:51 PDT
Filed bug 780664 for dealing with the first <p> block.
Comment 11 Brian Nicholson (:bnicholson) 2012-08-09 10:45:35 PDT
http://hg.mozilla.org/integration/mozilla-inbound/rev/704e0a2a1a90
Comment 12 Ryan VanderMeulen [:RyanVM] 2012-08-09 19:58:15 PDT
https://hg.mozilla.org/mozilla-central/rev/704e0a2a1a90
Comment 13 Brian Nicholson (:bnicholson) 2012-08-20 10:05:40 PDT
Comment on attachment 648952 [details] [diff] [review]
Add _replaceBrs() method to Readability, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: reader parses may be slower
Testing completed (on m-c, etc.): m-c
Risk to taking this patch (and alternatives if risky): low risk
String or UUID changes made by this patch: none

If we want bug 779796 in 16, this patch is required.
Comment 14 Brian Nicholson (:bnicholson) 2012-08-20 21:26:11 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/e6ff85ba5b2e

Note You need to log in before you can comment on or make changes to this bug.