Closed Bug 1177619 Opened 4 years ago Closed 3 years ago

Reader mode isn't offered on Blogger/Blogspot based blogs

Categories

(Toolkit :: Reader Mode, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: braiamp, Assigned: evanxd)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [reader-mode-readability-algorithm][reader-mode-isreadable-detection])

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.0.1
Build ID: 20150526223604

Steps to reproduce:

Open any article/post on a blogger/blogspot blog with adecuate lenght.



Expected results:

Reader icon should appear as it does on WordPress based blogs (and any other content with some considerable lenght)
I'm not sure if this should block bug 1102450 (platform says Android?) or bug 558882.
Hardware: Unspecified → All
OS: Unspecified → All
Hardware: All → Unspecified
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:41.0) Gecko/20100101 Firefox/41.0
20150626030206
Status: UNCONFIRMED → NEW
Component: Untriaged → Reader Mode
Ever confirmed: true
Product: Firefox → Toolkit
Hardware: Unspecified → All
Also on desktop.
Version: 38 Branch → Trunk
Whiteboard: [reader-mode-readability-algorithm][reader-mode-isreadable-detection]
Blocks: 1286221
Hi Gijs,

I've sent a pull request[1] to the GitHub repo to fix the issue.

The solution here is score "<br>" tags because blogger use "<br>" tags to typeset their articles not "<p>" tags.

Their DOM structure looks like
```
<div>
  Dare to predict and suggest and improve your prediction based on time. Always try to add some more value to the user.<br />
  Its a continuous process, reiterate the above point &nbsp;time to time.<br />
  Let me know your thoughts how to get a wow ....<br />
  <br />
  Dare to predict and suggest and improve your prediction based on time. Always try to add some more value to the user.<br />
  Its a continuous process, reiterate the above point &nbsp;time to time.<br />
  Let me know your thoughts how to get a wow ....<br />
</div>
```

[1]: https://github.com/mozilla/readability/pull/310
Could you review it on the PR[1]?

Thanks.

[1]: https://github.com/mozilla/readability/pull/310
Flags: needinfo?(gijskruitbosch+bugs)
Assignee: nobody → evan
Commented on github.
Flags: needinfo?(gijskruitbosch+bugs)
Try to figure out an algorithm[1] to detect blogger's articles. 

[1]: https://github.com/mozilla/readability/pull/310#discussion_r82941697
Updated patch[1] for review comments. And need to check the benchmarks of the patch. It might be OK if  it's 5-10% slower.

[1]: https://github.com/mozilla/readability/pull/310/commits/9583038ec0822fb4ef04b897d3e4a7e306f92808
Did benchmark experiment and created a report[1] for it. The result is 7.46% slower after patched the patch[2]. Looks like the benchmarks of the patch could be acceptable, just like we discussed on GitHub[3].

I Run `npm run perf-reference` for 30 times for original Readability.js and patched[2] Readability.js and counted the average values to get the slower rate(7.46%). 

[1]: https://docs.google.com/spreadsheets/d/1dxFYsg7bk8bkZqVd5BZuozcnjjXNWxlmhmkocX0TiCk
[2]: https://github.com/mozilla/readability/pull/310/commits/001295afa1254fec562c26414dce03206ef9fc2f
[3]: https://github.com/mozilla/readability/pull/310#discussion_r83250236
But there are still some errors with the patch[1]. Let me continue to fix it...

```
  1) Test pages ars-1 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  2) Test pages remove-extra-brs jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  3) Test pages replace-brs jsdom should probably be readerable:

      AssertionError: expected false to deeply equal true
      + expected - actual

      -false
      +true
      
      at Assertion.assertEql (node_modules/chai/lib/chai/core/assertions.js:473:10)
      at Assertion.ctx.(anonymous function) [as eql] (node_modules/chai/lib/chai/utils/addMethod.js:40:25)
      at Context.<anonymous> (test/test-readability.js:140:43)

  4) Test pages salon-1 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  5) Test pages tmz-1 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  6) Test pages wapo-1 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  7) Test pages wapo-2 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  8) Test pages webmd-1 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)

  9) Test pages webmd-2 jsdom "before all" hook:
     TypeError: Cannot read property 'trim' of undefined
      at Object.<anonymous> (Readability.js:1762:47)
      at Array.some (native)
      at Object.Readability._someNode (Readability.js:195:33)
      at Object.Readability.isProbablyReaderable (Readability.js:1748:17)
      at Context.<anonymous> (test/test-readability.js:62:35)
```

[1]: https://github.com/mozilla/readability/pull/310/commits/001295afa1254fec562c26414dce03206ef9fc2f
The benchmark report with the fix on Comment 12.

The result(with jsdom)[2] is 38.46% slower after patched the patch[1]. Additionally, the result with JSDOMParser[3] is 24.38% slower. Can check the detail on the reports[2][3].

Discussing the benchmark report on the GitHub's pull request...

P.S. The slower rate = (Patched Result - Original Result)/Original Result

[1]: https://github.com/mozilla/readability/pull/310/commits/b5602a4b715efb9211200af3b7660d0929005ed6
[2]: https://docs.google.com/spreadsheets/d/1s8CXR8QCdSZuAyrVvuYM8Zh-4aq6H9ZRP5g6wpKHIi0
[3]: https://docs.google.com/spreadsheets/d/1I4S0yQZ4-I9p3YI3NIP4jiuSasL3qDtp4QGEcKuqt7Y
Updated patch[1] for review comments. Continue to do benchmark experiment.

[1]: https://github.com/mozilla/readability/pull/310/commits/61af7898401869d49a0ffbb1e0c769b6b8893bd6
The benchmark report with the fix on Comment 14.

The result(with jsdom)[1] is 38.87% slower.

[1]: https://docs.google.com/spreadsheets/d/140SfoJ3YTkExhrsBgFSuL8SWTinPNY_kVl7TS7EfQ6U
Did a benchmark experiment with gecko for review comments[2].

Added two performance checkpoints[1] to evaluate the performance of patched `isProbablyReaderable` method. We could check how much time took by `isProbablyReaderable` method in the performance experiment. For sure, lower is better.
```
var t0 = content.performance.now();
var isProbablyReaderable = ReaderMode.isProbablyReaderable(content.document);
var t1 = content.performance.now();
dump("Call to isProbablyReaderable took " + (t1 - t0) + " milliseconds. \n");
```

Then randomly chose 4 articles (listed in the below) from yahoo, wikipedia, and blogger to test. Two articles have 0 `<br>` node, and two articles some `<br>` node (selected by `div > br`).
* Article 1: https://www.yahoo.com/tech/virtual-reality-systems-compared-144526498.html (with 0 `<br>` node)
* Article 2: https://en.wikipedia.org/wiki/Mozilla (with 0 `<br>` node)
* Article 3: http://www.stevenhoneyman.co.uk/2016/02/bga-on-budget.html (with 11 `<br>` nodes)
* Article 4: http://siliconexposed.blogspot.tw/2016/05/open-verilog-flow-for-silego-greenpak4.html (with 75 `<br>` nodes)

Did performance test for 10 times for each article and averaged them. The experiment result is
|                    | Article 1 | Article 2 | Article 3 | Article 4 |
|--------------------|-----------|-----------|-----------|-----------|
| Original           | 4.121 ms  | 4.009 ms  | 4.139 ms  | 3.924 ms  |
| Patched            | 4.201 ms  | 4.199 ms  | 4.397 ms  | 5.077 ms  |
| Time Increase Rate | 1.9%      | 4.7%      | 6.2%      | 29.38     |
P.S. We could say `1.9%` slower here if Time Increase Rate is `1.9%`. You can check the experiment details here[2].

Then we could find out
* If there is no `<br>` node (can be selected by `div > br`), the performance could be almost same after patched.
* If there are too many(`> 75`) `<br>` node (can be selected by `div > br`), the performance is not good.

Performance issues might be occurred if there are too many `<br>` node (can be selected by `div > br`). But maybe we should accept it because
1. Seems there might be no other way to fix this issue.
2. Only webpages with many `<br>` nodes have the performance issue. The performance of others looks good.

[1]: https://gist.github.com/evanxd/fc5eb3e33732567957aa0c02baaffc00#file-tab-content-js-L9-L12
[2]: https://docs.google.com/spreadsheets/d/1vk7bU_Z0KG4F7OEGlVWX_SA8IrH94lwrNVcw6zYBR6E
[3]: https://github.com/mozilla/readability/pull/310#issuecomment-255344452
Weird, the failures seems not be caused by the patch. Investigate...

```
/home/travis/build/mozilla/readability/Readability.js

  1846:14  error  Expected indentation of 6 spaces but found 13  indent

  1847:14  error  Expected indentation of 6 spaces but found 13  indent

  1848:14  error  Expected indentation of 6 spaces but found 13  indent

  1849:14  error  Expected indentation of 6 spaces but found 13  indent

  1850:14  error  Expected indentation of 6 spaces but found 13  indent

  1851:14  error  Expected indentation of 6 spaces but found 13  indent

  1852:14  error  Expected indentation of 6 spaces but found 13  indent
```
Fix the eslint failures. The root cause is mentioned here[1].

[1]: https://github.com/mozilla/readability/pull/310#issuecomment-257528601
We should probably do an update of Readability in m-c (this is not automatic). It should basically just mean copying over Readability.js and JSDOMParser.js and fixing the disclaimer at the top of the file so that doesn't change, and then pushing that to try to make sure we haven't accidentally broken any m-c tests. Evan, can you do this?
Flags: needinfo?(evan)
Sure, I like to do that.
Flags: needinfo?(evan)
I've sent a review request and triggered a try.
Comment on attachment 8806287 [details]
No bug - update readability from github repo, includes fix for Bug 1177619,

https://reviewboard.mozilla.org/r/89790/#review89218

Were these the only changes? I thought there was some other stuff in the pipeline that hadn't made it to m-c yet.
Attachment #8806287 - Flags: review?(gijskruitbosch+bugs) → review+
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #27)
> Yes, other changes on
> https://github.com/mozilla/readability/pull/310/commits/
> be37f569d32ce4d7e59bcb30d52a0e8a2b130824 is about tests(run on Travis CI).

Yes, but the last change to readability.js was in April, and there were commits in github in July and August.
Got it. Misunderstood before.

I've added all changes into mozilla-central. Could you check the patch again? And this is the try[2].

If anything is OK, could you help land the patch?(I don't have level 3) Thanks.

One question, why don't we apply the commits in July and August before?

Thanks for reviewing.

[1]: https://reviewboard.mozilla.org/r/89790/diff/3#index_header
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=259cbb7ff1d7
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #32)
> One question, why don't we apply the commits in July and August before?

Because I forgot. :-)

Also because there isn't a whole lot in there that makes a big difference to Firefox - some of the bugfixes are specifically for cases where people use the library with a "real" DOM implementation, and so the combination of using querySelectorAll or getElementsByTagName and removing some of the items while iterating plays out differently than it does with jsdom or JSDOMParser.

In any case, now that there's something we want for us, we should update all the files.

We should probably also add the .eslintrc. I just added a github commit to move it to .eslintrc.js like we do in mozilla-central. Can you add it to the commit and then ping me again so I can land? :-)
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(evan)
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ef3d294a1cf6
No bug - update readability from github repo, includes fix for Bug 1177619, r=Gijs
Hi Gijs,

We got the eslint errors on the try[1] after added `.eslintrc` file.
```
TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/toolkit/components/reader/AboutReader.jsm:121:2 | Missing semicolon. (semi)
TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/toolkit/components/reader/AboutReader.jsm:699:6 | Missing semicolon. (semi)
TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/toolkit/components/reader/AboutReader.jsm:710:10 | Missing semicolon. (semi)
TEST-UNEXPECTED-ERROR | /home/worker/checkouts/gecko/toolkit/components/reader/ReaderMode.jsm:292:8 | Missing semicolon. (semi)
```

We should write a patch for that.

And one more thing, looks like we also need to remove the below two lines in .eslintignore file[2] to let eslint check the files, right?
```
toolkit/components/reader/Readability.js
toolkit/components/reader/JSDOMParser.js
```

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=803d231136ab&selectedJob=30300566
[2]: http://searchfox.org/mozilla-central/source/.eslintignore#240-241
Flags: needinfo?(evan) → needinfo?(gijskruitbosch+bugs)
Getting sheriffs to land a followup to fix the bustage. Not taking the eslintignore off Readability.js and JSDOMParser.js because eslint still enables rules enabled in files higher up, meaning that those files generate errors for block-spacing and spaced-comment:

gijs-mbp:firefox-unified gkruitbosch$ ./mach eslint toolkit/components/reader/
/Users/gkruitbosch/dev/firefox-unified/toolkit/components/reader/JSDOMParser.js
     1:1   error  Expected space or tab after '/*' in comment.  spaced-comment (eslint)
  1024:69  error  Expected space or tab after '/*' in comment.  spaced-comment (eslint)

/Users/gkruitbosch/dev/firefox-unified/toolkit/components/reader/Readability.js
    1:1   error  Expected space or tab after '/*' in comment.   spaced-comment (eslint)
  301:17  error  Requires a space after '{'.                    block-spacing (eslint)
  301:60  error  Requires a space before '}'.                   block-spacing (eslint)
  637:3   error  Expected space or tab after '/**' in comment.  spaced-comment (eslint)

(this is without re-adding the disclaimers at the top so YMMV).

I'll file a followup to enable even more rules and take the exceptions out.
Flags: needinfo?(gijskruitbosch+bugs)
https://hg.mozilla.org/mozilla-central/rev/ef3d294a1cf6
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
QA Whiteboard: [good first verify]
You need to log in before you can comment on or make changes to this bug.