777966 - Change reader mode replaceBrs logic

Assignee

Description

•

12 years ago

From Readability.js:

    // Turn all double br's into p's. Note, this is pretty costly as far
    // as processing goes. Maybe optimize later.
    doc.body.innerHTML =
        doc.body.innerHTML.replace(this.REGEXPS.replaceBrs, '</p><p>').

Aside from this being an expensive operation, it also produces invalid HTML. For example, with this regex replacement,

<div>
  hello<br><br>
  <div>foo<br><br>bar</div>
</div>

becomes

<div>
  hello</p><p>
  <div>foo</p><p>bar</div>
</div>

which is a terrible mess.

Brian Nicholson (:bnicholson)

Assignee

Comment 1

•

12 years ago

Attached patch add _replaceBrs() method to Readability (obsolete) — Details — Splinter Review

This patch uses DOM traversal rather than regex to do the replacements. I tested several sites, and they all seemed to parse correctly. Here are some benchmarks:

http://en.m.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji
before: ~1600ms; after: ~600ms

http://m.yahoo.com/w/legobpengine/news/report-colorado-shooting-suspect-sent-plans-notebook-psychiatrist-170815999.html?orig_host_hdr=news.yahoo.com&.intl=US&.lang=en-US
before: ~16ms; after: ~40ms

http://www.nytimes.com/2012/07/27/business/media/aamir-khan-a-bollywood-star-remakes-himself-into-tv-conscience-of-social-ills.html?_r=1&hp
before: ~45ms; after: ~2ms

Attachment #646368 - Flags: review?(lucasr.at.mozilla)

Lucas Rocha (:lucasr)

Comment 2

•

12 years ago

Comment on attachment 646368 [details] [diff] [review]
add _replaceBrs() method to Readability

Review of attachment 646368 [details] [diff] [review]:
-----------------------------------------------------------------

The function looks a bit non-obvious at first sight. It would be nice if you could comment it a bit more. I'll only give r+ after my questions are answered and comments are added.

::: mobile/android/chrome/content/Readability.js
@@ +246,5 @@
> +    function nextElement(node) {
> +      let next = node;
> +      while (next
> +          && (next.nodeType != Node.ELEMENT_NODE)
> +          && !whitespace.test(next.textContent)) {

Does this condition cover empty tags? i.e. no text at all

@@ +253,5 @@
> +      return next;
> +    }
> +
> +    let brs = elem.getElementsByTagName("br");
> +    brs = Array.prototype.slice.call(brs, 0);

This array can get quite big depending on the page...

@@ +254,5 @@
> +    }
> +
> +    let brs = elem.getElementsByTagName("br");
> +    brs = Array.prototype.slice.call(brs, 0);
> +    for (let i = 0; i < brs.length; i++) {

Traverse in reverse order for better performance.

@@ +259,5 @@
> +      let br = brs[i];
> +      let next = br.nextSibling;
> +      let replaced = false;
> +
> +      while ((next = nextElement(next))

Isn't this call going to "skip" a element? You're calling nextElement() on br.nextSibling (which means getting the element after the next sibling from the current "br"). What am I missing?

@@ +260,5 @@
> +      let next = br.nextSibling;
> +      let replaced = false;
> +
> +      while ((next = nextElement(next))
> +          && (next.localName == "br")) {

nit: no need to break line here.

@@ +263,5 @@
> +      while ((next = nextElement(next))
> +          && (next.localName == "br")) {
> +        replaced = true;
> +        let sibling = next.nextSibling;
> +        next.parentNode.removeChild(next);

Does this mean that some brs might become orphans by the time the loop gets to them?

@@ +278,5 @@
> +          if (next.localName == "br") {
> +            let nextElem = nextElement(next);
> +            if (nextElem && nextElem.localName == "br") {
> +              break;
> +            }

Does this mean that you won't enclose a first blob of text followed by <br><br> with <p>? For example:

BLOB1
<br><br>
BLOB2

will become:

BLOB1
<p>BLOB2</p>

Right? Just wondering what kind of issues we might get when styling the resulting HTML in reader.

Brian Nicholson (:bnicholson)

Assignee

Comment 3

•

12 years ago

(In reply to Lucas Rocha (:lucasr) from comment #2)
> Comment on attachment 646368 [details] [diff] [review]
> add _replaceBrs() method to Readability
> 
> Review of attachment 646368 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The function looks a bit non-obvious at first sight. It would be nice if you
> could comment it a bit more. I'll only give r+ after my questions are
> answered and comments are added.
> 
> ::: mobile/android/chrome/content/Readability.js
> @@ +246,5 @@
> > +    function nextElement(node) {
> > +      let next = node;
> > +      while (next
> > +          && (next.nodeType != Node.ELEMENT_NODE)
> > +          && !whitespace.test(next.textContent)) {
> 
> Does this condition cover empty tags? i.e. no text at all
> 

This check just ignores whitespace between elements, not the elements themselves, so it does not ignore empty tags (which is what we want - the original regex didn't either).

> @@ +253,5 @@
> > +      return next;
> > +    }
> > +
> > +    let brs = elem.getElementsByTagName("br");
> > +    brs = Array.prototype.slice.call(brs, 0);
> 
> This array can get quite big depending on the page...

Elsewhere in Readability.js, we have a call to "page.getElementsByTagName('*')", which returns a much larger set of elements (all of them in the page). The only difference in this case is that we're putting them in an array, but it's just an array of object references, so I don't think this should be a concern.

> 
> @@ +254,5 @@
> > +    }
> > +
> > +    let brs = elem.getElementsByTagName("br");
> > +    brs = Array.prototype.slice.call(brs, 0);
> > +    for (let i = 0; i < brs.length; i++) {
> 
> Traverse in reverse order for better performance.
> 

I would normally, but that would change the algorithm in this case. For example, iterating forward:

<div>a<br><br>b</div> ==> <div>a<p>b</p></div>

iterating backward:

<div>a<br><br>b</div> ==> <div><p>a</p>b</div>

> @@ +259,5 @@
> > +      let br = brs[i];
> > +      let next = br.nextSibling;
> > +      let replaced = false;
> > +
> > +      while ((next = nextElement(next))
> 
> Isn't this call going to "skip" a element? You're calling nextElement() on
> br.nextSibling (which means getting the element after the next sibling from
> the current "br"). What am I missing?
> 

Sorry, this code needed more commenting like you said. But nextElement() looks at the current element before moving to the next sibling, so if the "next" node given to nextElement is an element node, nextElement() will return that same node.

> @@ +263,5 @@
> > +      while ((next = nextElement(next))
> > +          && (next.localName == "br")) {
> > +        replaced = true;
> > +        let sibling = next.nextSibling;
> > +        next.parentNode.removeChild(next);
> 
> Does this mean that some brs might become orphans by the time the loop gets
> to them?

Yes. This means the nextSibling is null, so nextElement() returns null, so the while loop is skipped, and then we move on (another thing I'll add more comments to).

> 
> @@ +278,5 @@
> > +          if (next.localName == "br") {
> > +            let nextElem = nextElement(next);
> > +            if (nextElem && nextElem.localName == "br") {
> > +              break;
> > +            }
> 
> Does this mean that you won't enclose a first blob of text followed by
> <br><br> with <p>? For example:
> 
> BLOB1
> <br><br>
> BLOB2
> 
> will become:
> 
> BLOB1
> <p>BLOB2</p>
> 
> Right? Just wondering what kind of issues we might get when styling the
> resulting HTML in reader.

Correct - but, as far as I could tell, this is what was happening already. All <br> chains were being replaced with </p><p>, resulting in the HTML from comment 0. I agree we may want to change this, though.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 4

•

12 years ago

(In reply to Brian Nicholson (:bnicholson) from comment #3)
> > > +    let brs = elem.getElementsByTagName("br");
> > > +    brs = Array.prototype.slice.call(brs, 0);
> > 
> > This array can get quite big depending on the page...
> 
> Elsewhere in Readability.js, we have a call to
> "page.getElementsByTagName('*')", which returns a much larger set of
> elements (all of them in the page). The only difference in this case is that
> we're putting them in an array, but it's just an array of object references,
> so I don't think this should be a concern.
> 

Drive-by comment: page.getElementsByTagName('*') returns a lazily-evaluated "live" NodeList, so that call alone is extremely cheap. It's actually walking through the list and grabbing references to the elements that has most of the cost. I don't know how applicable that information is to this particular situation.

Brian Nicholson (:bnicholson)

Assignee

Comment 5

•

12 years ago

(In reply to Kartikaya Gupta (:kats) from comment #4)
> 
> Drive-by comment: page.getElementsByTagName('*') returns a lazily-evaluated
> "live" NodeList, so that call alone is extremely cheap. It's actually
> walking through the list and grabbing references to the elements that has
> most of the cost. I don't know how applicable that information is to this
> particular situation.

Yeah, the fact that a live NodeList is returned is the reason I put these in an array (non-live) to begin with. I don't recall what the specific problem was, though; I'll look at it again to determine why/if a static list is necessary.

We'd have to fully iterate the list either way (array or no array), so I don't think there's a huge performance cost here (and even if there is, it still beats regex if my measurements from comment 1 were accurate). From comment 2, it sounded like Lucas was concerned with memory usage.

Lucas Rocha (:lucasr)

Comment 6

•

12 years ago

(In reply to Brian Nicholson (:bnicholson) from comment #3)
> I would normally, but that would change the algorithm in this case. For
> example, iterating forward:
> 
> <div>a<br><br>b</div> ==> <div>a<p>b</p></div>
> 
> iterating backward:
> 
> <div>a<br><br>b</div> ==> <div><p>a</p>b</div>

Hmm, the loop iterations look index-independent. For instance, you don't seem to rely on the index of the "br" in the array to fetch the next sibling. Not a big deal though.
 
> > @@ +259,5 @@
> > > +      let br = brs[i];
> > > +      let next = br.nextSibling;
> > > +      let replaced = false;
> > > +
> > > +      while ((next = nextElement(next))
> > 
> > Isn't this call going to "skip" a element? You're calling nextElement() on
> > br.nextSibling (which means getting the element after the next sibling from
> > the current "br"). What am I missing?
> > 
> 
> Sorry, this code needed more commenting like you said. But nextElement()
> looks at the current element before moving to the next sibling, so if the
> "next" node given to nextElement is an element node, nextElement() will
> return that same node.

This function probably needs a better name then :-)

> > @@ +263,5 @@
> > > +      while ((next = nextElement(next))
> > > +          && (next.localName == "br")) {
> > > +        replaced = true;
> > > +        let sibling = next.nextSibling;
> > > +        next.parentNode.removeChild(next);
> > 
> > Does this mean that some brs might become orphans by the time the loop gets
> > to them?
> 
> Yes. This means the nextSibling is null, so nextElement() returns null, so
> the while loop is skipped, and then we move on (another thing I'll add more
> comments to).

Ok.

> > 
> > @@ +278,5 @@
> > > +          if (next.localName == "br") {
> > > +            let nextElem = nextElement(next);
> > > +            if (nextElem && nextElem.localName == "br") {
> > > +              break;
> > > +            }
> > 
> > Does this mean that you won't enclose a first blob of text followed by
> > <br><br> with <p>? For example:
> > 
> > BLOB1
> > <br><br>
> > BLOB2
> > 
> > will become:
> > 
> > BLOB1
> > <p>BLOB2</p>
> > 
> > Right? Just wondering what kind of issues we might get when styling the
> > resulting HTML in reader.
> 
> Correct - but, as far as I could tell, this is what was happening already.
> All <br> chains were being replaced with </p><p>, resulting in the HTML from
> comment 0. I agree we may want to change this, though.

It's the perfect time to fix this bug then :-)

Brian Nicholson (:bnicholson)

Assignee

Comment 7

•

12 years ago

Attached patch Add _replaceBrs() method to Readability, v2 — Details — Splinter Review

I couldn't think of a better name for nextElement(), so I kept that and added a comment. Let me know if you can think of something better.

I think the first block not being inside of a <p> might be better handled in another bug. The problem is that we don't know where we even need a <p> until we hit the first <br><br>, and we would have to backtrack through the previous siblings to add them to the first <p>. Though the existing implementation isn't ideal from an HTML standpoint, I haven't noticed any visible problems with it, and I don't know whether it's worth fixing if it results in a performance penalty.

Also, I removed the Array.prototype.slice.call() here because this bug should land with bug 779796. The JSDOMParser in bug 779796 won't support dynamic NodeLists; getElementById() in that implementation will just return an array (so there's no need to convert it from a dynamic to static data structure).

Attachment #646368 - Attachment is obsolete: true

Attachment #646368 - Flags: review?(lucasr.at.mozilla)

Attachment #648952 - Flags: review?(lucasr.at.mozilla)

Brian Nicholson (:bnicholson)

Assignee

Updated

•

12 years ago

Blocks: 779796

Lucas Rocha (:lucasr)

Comment 8

•

12 years ago

Comment on attachment 648952 [details] [diff] [review]
Add _replaceBrs() method to Readability, v2

Review of attachment 648952 [details] [diff] [review]:
-----------------------------------------------------------------

How much faster is this code comparing to the current regex-based one?

Attachment #648952 - Flags: review?(lucasr.at.mozilla) → review+

Brian Nicholson (:bnicholson)

Assignee

Comment 9

•

12 years ago

(In reply to Lucas Rocha (:lucasr) from comment #8)
> Comment on attachment 648952 [details] [diff] [review]
> Add _replaceBrs() method to Readability, v2
> 
> Review of attachment 648952 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> How much faster is this code comparing to the current regex-based one?

I tested a few pages in comment 1, and the results varied. Aside from performance, though, I think the most important thing we gain from this bug is the ability to use bug 779796 (which cannot be used with invalid HTML).

Brian Nicholson (:bnicholson)

Assignee

Comment 10

•

12 years ago

Filed bug 780664 for dealing with the first <p> block.

Brian Nicholson (:bnicholson)

Assignee

Comment 11

•

12 years ago

http://hg.mozilla.org/integration/mozilla-inbound/rev/704e0a2a1a90

Brian Nicholson (:bnicholson)

Assignee

Updated

•

12 years ago

status-firefox16: --- → affected

Ryan VanderMeulen [:RyanVM]

Comment 12

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/704e0a2a1a90

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → Firefox 17

Brian Nicholson (:bnicholson)

Assignee

Comment 13

•

12 years ago

Comment on attachment 648952 [details] [diff] [review]
Add _replaceBrs() method to Readability, v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: reader parses may be slower
Testing completed (on m-c, etc.): m-c
Risk to taking this patch (and alternatives if risky): low risk
String or UUID changes made by this patch: none

If we want bug 779796 in 16, this patch is required.

Attachment #648952 - Flags: approval-mozilla-aurora?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

12 years ago

Attachment #648952 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Brian Nicholson (:bnicholson)

Assignee

Comment 14

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/e6ff85ba5b2e

status-firefox16: affected → fixed

BMO Automation

Updated

•

3 years ago

Product: Firefox for Android → Firefox for Android Graveyard

add _replaceBrs() method to Readability 12 years ago Brian Nicholson (:bnicholson) 5.22 KB, patch		Details \| Diff \| Splinter Review
Add _replaceBrs() method to Readability, v2 12 years ago Brian Nicholson (:bnicholson) 5.97 KB, patch	lucasr : review+ lsblakk : approval-mozilla-aurora+	Details \| Diff \| Splinter Review