Closed
Bug 1300697
Opened 8 years ago
Closed 7 years ago
Reader View missed first few paragraphs on New York Times website
Categories
(Toolkit :: Reader Mode, defect)
Toolkit
Reader Mode
Tracking
()
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox54 | --- | fixed |
People
(Reporter: fireattack, Assigned: evanxd)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [reader-mode-readability-algorithm])
Attachments
(1 file)
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0 Build ID: 20160902105049 Steps to reproduce: Open http://www.nytimes.com/2016/07/30/business/dealbook/yahoos-sale-to-verizon-leaves-shareholders-with-little-say.html Enter Reader View. Actual results: The first few paragraphs are missing in the Reader View. Expected results: The first few paragraphs should be there.
Reporter | ||
Updated•8 years ago
|
Component: Untriaged → Reader Mode
Product: Firefox → Toolkit
Assignee | ||
Updated•7 years ago
|
Whiteboard: [reader-mode-readability-algorithm]
Reader view sometimes drops the first few paragraphs, e.g. if they're in a larger font, or drops the intended article and uses the text from an entirely different article.
Assignee | ||
Comment 2•7 years ago
|
||
I can reproduce the issue mentioned on Comment 0 for 100%. The root cause is that we still choose a wrong top candidate `<div class="story-body story-body-2">`. We should choose the `<article id="story" class="story theme-main">` element since the structure of DOM tree is ``` <article id="story" class="story theme-main"> <div class="story-body-supplemental"> <div class="story-body story-body-1"> </div> <div class="story-body-supplemental"> <div class="story-body story-body-2"> </div> </article> ``` The `<div class="story-body story-body-1">` element contains the missing paragraphs. Choosing a better top candidate can fix it. But not sure how to do it yet.
Comment 3•7 years ago
|
||
(In reply to Evan Tseng [:evanxd] from comment #2) > I can reproduce the issue mentioned on Comment 0 for 100%. > > The root cause is that we still choose a wrong top candidate `<div > class="story-body story-body-2">`. We should choose the `<article id="story" > class="story theme-main">` element since the structure of DOM tree is > ``` > <article id="story" class="story theme-main"> > <div class="story-body-supplemental"> > <div class="story-body story-body-1"> > </div> > <div class="story-body-supplemental"> > <div class="story-body story-body-2"> > </div> > </article> > ``` > > The `<div class="story-body story-body-1">` element contains the missing > paragraphs. Choosing a better top candidate can fix it. But not sure how to > do it yet. Is this not fixed by https://github.com/mozilla/readability/pull/309 ?
Flags: needinfo?(evan)
Reporter | ||
Updated•7 years ago
|
Summary: Reader View missed first view paragraphs on New York Times website → Reader View missed first few paragraphs on New York Times website
Assignee | ||
Comment 4•7 years ago
|
||
No, PR 309 doesn't fix it. Because other top candidates don't close enough with the `topCandidate` node. And there is only one other top candidate is part of main content. We can update the two number, 0.25 to 0.8 and `numberOfHighScoreNode >= 3` to `numberOfHighScoreNode >= 1`, to fix this issue with PR 309. But I think that way is not what we want. I think we need to find a new way to fix it. I have an idea, we can try to select a new `topCandidate` in below dom structure if there is other node with `story-body` or `story-body-2` class and its `innerText` is long enough. (We could try to believe other nodes with same class which `topCandidate` has might be part of main content) Then we can select their parent as new `topCandidate`, e.g. in this case, if we search node with `story-body` class and ensure its `innerText` is long enough we'll find out `<article id="story" class="story theme-main">` is the new `topCandidate`(parent of `<div class="story-body story-body-1">` and `<div class="story-body story-body-2">` nodes). ``` <article id="story" class="story theme-main"> <div class="story-body-supplemental"> <div class="story-body story-body-1"> </div> <div class="story-body-supplemental"> <div class="story-body story-body-2"> </div> </article> ``` I know the above idea might be a little bit tricky. What do you think or have other idea?
Flags: needinfo?(evan) → needinfo?(gijskruitbosch+bugs)
Comment 5•7 years ago
|
||
Doing some kind of class matching does seem like an interesting idea. Something else though, is that at least in the mock DOM structure you provided in comment #2 and comment #4 is that there's <div> elements with only one element child, another <div>. I believe that some of the code in the last few commits in pr https://github.com/mozilla/readability/pull/313 (specifically, https://github.com/mozilla/readability/pull/313/commits/8ec087f0ef72585ae03a63249868eb3b2689a9d8 and https://github.com/mozilla/readability/pull/313/commits/310b636e79bea25e2a1f8583e8b13b740610f01b ) might help with solving this, as it specifically mentions nytimes.com and IIRC was concerned with making the "sibling candidate" logic that we have right now work in trees that have these singly-nested cases.
Flags: needinfo?(gijskruitbosch+bugs)
Assignee | ||
Comment 6•7 years ago
|
||
This patch[1] could find a better `topCandidate` for the article[2] directly, unless we change `while (parentOfTopCandidate.tagName != "BODY" && parentOfTopCandidate.children.length == 1)` to `while (parentOfTopCandidate.tagName != "BODY" && parentOfTopCandidate.children.length <= 2)`. But this change will still cause many existed test failed[3] and seems not make sense(why <= 2?). Sorry I didn't show full dom structure of the `<article>` node previously. The full dom structure looks like the below (at least two elements belong to each `<div class="story-body-supplemental">` node) So I think we still need a new idea to fix this issue. How about the previous idea on Comment 4, do class matching? What do you think? ``` <article id="story" class="story theme-main "> <div id="TragedyAd" class="ad tragedy-ad nocontent robots-nocontent"></div> <div class="dealbook-branding"></div> <header id="story-header" class="story-header"></header> <div class="story-body-supplemental"> <div class="story-body story-body-1"> <p></p> <p></p> <p></p> <p></p> </div> <div class="supplemental first short" id="supplemental-1" data-between-flex-ads="true" data-pre-height="673" data-max-items="1" data-remaining="673" data-minimum="400" data-last-item-height="873" data-flex-ad-adjacency="true" data-post-height="673"></div> </div> <div class="story-interrupter" id="story-continues-1"></div> <div class="story-body-supplemental"> <div class="story-body story-body-2"> <p></p> <p></p> <p></p> <p></p> </div> <div class="supplemental " id="supplemental-2" data-pre-height="2132" data-max-items="2" data-remaining="242" data-minimum="400" data-last-item-height="945" data-flex-ad-adjacency="false" data-post-height="2132"></div> </div> <div class="reader-satisfaction-survey prompt feedback-prompt story-content hidden"></div> <div id="storage-drawer" class="hidden"></div> </article> ``` [1]: https://github.com/mozilla/readability/pull/337/files#diff-06d8d22df421dacde90a2268083424abR874 [2]: http://www.nytimes.com/2016/07/30/business/dealbook/yahoos-sale-to-verizon-leaves-shareholders-with-little-say.html [3]: https://travis-ci.org/mozilla/readability/builds/191589520
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(gijskruitbosch+bugs)
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → evan
Assignee | ||
Comment 7•7 years ago
|
||
Already wrote tests for the issue: https://github.com/mozilla/readability/pull/337/files#diff-06d8d22df421dacde90a2268083424abR874
Comment 8•7 years ago
|
||
You don't seem to have used the second commit that removes empty <div> elements, which also seems useful and specifically mentions the NYT (in fact, I think we can probably extend it to remove any empty container such as <div> or <section> or <header> or <h#n> - of course we can't remove empty <br> or <img> or <video>, so we can't remove *every* empty element... but quite some!). It's hard to say from the test failures whether they are serious. At a glance, it seems like we end up picking different container elements. That's only really bad if we end up including content we shouldn't be, or excluding content we do need, and I can't tell whether we are or are not doing that based on the diff and the test results. Can you maybe create a gist of the changes that we'd need to make to the testcase to make this pass?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(evan)
Assignee | ||
Comment 9•7 years ago
|
||
Thanks for the review comments. Let's continue to discuss on the pull request page[1]. I've updated for the comments and fixed some test failures. [1]: https://github.com/mozilla/readability/pull/337#issuecomment-273445003
Flags: needinfo?(evan)
Assignee | ||
Comment 10•7 years ago
|
||
Fixed test failures and discussed the patch with reviewer[1]. [1]: https://github.com/mozilla/readability/pull/337/commits/6b133591224d4e485ed6abbb294b362842ddc64f
Assignee | ||
Comment 11•7 years ago
|
||
Updated patch for review comments: https://github.com/mozilla/readability/pull/337/commits/37185c7702eaa84ab871ecdf95efd248ba7db9ea
Assignee | ||
Updated•7 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•7 years ago
|
||
Landed in GitHub repo: https://github.com/mozilla/readability/commit/15e1f0326181c9cd96cf1cda3d1695641cc915bd
Comment 14•7 years ago
|
||
Looks like the same issue: https://www.nytimes.com/2017/01/30/technology/technology-companies-fight-trump-immigration-order-in-court.html expected first paragraph: SEATTLE — Technology executives have for days assailed President Trump’s executive order suspending immigration from seven mostly Muslim countries, framing their arguments largely in moral terms. Found: “Expedia believes that the executive order jeopardizes its corporate mission and could have a detrimental impact on its business and employees, as well as the broader U.S. and global travel and tourism industry,” Robert Dzielak, the company’s general counsel, wrote in the filing. which happens to be after an advertisement.
Assignee | ||
Comment 15•7 years ago
|
||
Kus, Thanks for reporting the issue. The patch[1] can also fix it. [1]: https://github.com/mozilla/readability/commit/15e1f0326181c9cd96cf1cda3d1695641cc915bd
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 18•7 years ago
|
||
mozreview-review |
Comment on attachment 8833844 [details] Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo, https://reviewboard.mozilla.org/r/109976/#review110948 ::: toolkit/components/reader/Readability.js:1983 (Diff revision 2) > } > }; > > if (typeof module === "object") { > module.exports = Readability; > } Let's ensure the try is good[1] before we land it. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed0984533adf
Assignee | ||
Comment 19•7 years ago
|
||
The test failure "INFO TEST-UNEXPECTED-FAIL | testing/mochitest/tests/Harness_sanity/test_SpecialPowersPushPermissions.html | Test timed out."[1] looks like not related with the patch here. Let's re-trigger it again. The failure might be gone. [1]: https://treeherder.mozilla.org/logviewer.html#?job_id=74690063&repo=try&lineNumber=2080
Assignee | ||
Comment 20•7 years ago
|
||
mozreview-review |
Comment on attachment 8833844 [details] Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo, https://reviewboard.mozilla.org/r/109976/#review110988 ::: toolkit/components/reader/Readability.js:1983 (Diff revision 2) > } > }; > > if (typeof module === "object") { > module.exports = Readability; > } The try is good.
Assignee | ||
Comment 21•7 years ago
|
||
The try is good now, ready to land.
Comment 22•7 years ago
|
||
mozreview-review |
Comment on attachment 8833844 [details] Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo, https://reviewboard.mozilla.org/r/109976/#review111022 Can you update the commit message to move the bug numbers to the beginning?
Attachment #8833844 -
Flags: review?(gijskruitbosch+bugs) → review+
Comment hidden (mozreview-request) |
Assignee | ||
Comment 24•7 years ago
|
||
Sure, done. Could you help land the patch? Thanks. :)
Flags: needinfo?(gijskruitbosch+bugs)
Comment 25•7 years ago
|
||
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/d1bef3268b21 Bug 1259763, Bug 1167568 - Update readability from github repo, r=Gijs
Updated•7 years ago
|
Flags: needinfo?(gijskruitbosch+bugs)
Comment 26•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d1bef3268b21
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox54:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in
before you can comment on or make changes to this bug.
Description
•