Closed Bug 1300697 Opened 8 years ago Closed 7 years ago

Reader View missed first few paragraphs on New York Times website

Categories

(Toolkit :: Reader Mode, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: fireattack, Assigned: evanxd)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [reader-mode-readability-algorithm])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0
Build ID: 20160902105049

Steps to reproduce:

Open http://www.nytimes.com/2016/07/30/business/dealbook/yahoos-sale-to-verizon-leaves-shareholders-with-little-say.html

Enter Reader View.




Actual results:

The first few paragraphs are missing in the Reader View.


Expected results:

The first few paragraphs should be there.
Component: Untriaged → Reader Mode
Product: Firefox → Toolkit
Blocks: 1324630
Whiteboard: [reader-mode-readability-algorithm]
Reader view sometimes drops the first few paragraphs, e.g. if they're in a larger font, or drops the intended article and uses the text from an entirely different article.
I can reproduce the issue mentioned on Comment 0 for 100%.

The root cause is that we still choose a wrong top candidate `<div class="story-body story-body-2">`. We should choose the `<article id="story" class="story theme-main">` element since the structure of DOM tree is
```
<article id="story" class="story theme-main">
  <div class="story-body-supplemental">
    <div class="story-body story-body-1">
  </div>
  <div class="story-body-supplemental">
    <div class="story-body story-body-2">
  </div>
</article>
```

The `<div class="story-body story-body-1">` element contains the missing paragraphs. Choosing a better top candidate can fix it. But not sure how to do it yet.
(In reply to Evan Tseng [:evanxd] from comment #2)
> I can reproduce the issue mentioned on Comment 0 for 100%.
> 
> The root cause is that we still choose a wrong top candidate `<div
> class="story-body story-body-2">`. We should choose the `<article id="story"
> class="story theme-main">` element since the structure of DOM tree is
> ```
> <article id="story" class="story theme-main">
>   <div class="story-body-supplemental">
>     <div class="story-body story-body-1">
>   </div>
>   <div class="story-body-supplemental">
>     <div class="story-body story-body-2">
>   </div>
> </article>
> ```
> 
> The `<div class="story-body story-body-1">` element contains the missing
> paragraphs. Choosing a better top candidate can fix it. But not sure how to
> do it yet.

Is this not fixed by https://github.com/mozilla/readability/pull/309 ?
Flags: needinfo?(evan)
Summary: Reader View missed first view paragraphs on New York Times website → Reader View missed first few paragraphs on New York Times website
No, PR 309 doesn't fix it. Because other top candidates don't close enough with the `topCandidate` node. And there is only one other top candidate is part of main content.

We can update the two number, 0.25 to 0.8 and `numberOfHighScoreNode >= 3` to `numberOfHighScoreNode >= 1`, to fix this issue with PR 309. But I think that way is not what we want.

I think we need to find a new way to fix it. I have an idea, we can try to select a new `topCandidate` in below dom structure if there is other node with `story-body` or `story-body-2` class and its `innerText` is long enough. (We could try to believe other nodes with same class which `topCandidate` has might be part of main content) 

Then we can select their parent as new `topCandidate`, e.g. in this case, if we search node with `story-body` class and ensure its `innerText` is long enough we'll find out `<article id="story" class="story theme-main">` is the new `topCandidate`(parent of `<div class="story-body story-body-1">` and `<div class="story-body story-body-2">` nodes).
```
<article id="story" class="story theme-main">
  <div class="story-body-supplemental">
    <div class="story-body story-body-1">
  </div>
  <div class="story-body-supplemental">
    <div class="story-body story-body-2">
  </div>
</article>
```

I know the above idea might be a little bit tricky. What do you think or have other idea?
Flags: needinfo?(evan) → needinfo?(gijskruitbosch+bugs)
Doing some kind of class matching does seem like an interesting idea. Something else though, is that at least in the mock DOM structure you provided in comment #2 and comment #4 is that there's <div> elements with only one element child, another <div>. I believe that some of the code in the last few commits in pr https://github.com/mozilla/readability/pull/313 (specifically, https://github.com/mozilla/readability/pull/313/commits/8ec087f0ef72585ae03a63249868eb3b2689a9d8 and https://github.com/mozilla/readability/pull/313/commits/310b636e79bea25e2a1f8583e8b13b740610f01b ) might help with solving this, as it specifically mentions nytimes.com and IIRC was concerned with making the "sibling candidate" logic that we have right now work in trees that have these singly-nested cases.
Flags: needinfo?(gijskruitbosch+bugs)
This patch[1] could find a better `topCandidate` for the article[2] directly, unless we change `while (parentOfTopCandidate.tagName != "BODY" && parentOfTopCandidate.children.length == 1)` to `while (parentOfTopCandidate.tagName != "BODY" && parentOfTopCandidate.children.length <= 2)`. But this change will still cause many existed test failed[3] and seems not make sense(why <= 2?).

Sorry I didn't show full dom structure of the `<article>` node previously. The full dom structure looks like the below (at least two elements belong to each `<div class="story-body-supplemental">` node)

So I think we still need a new idea to fix this issue. How about the previous idea on Comment 4, do class matching? What do you think?

```
<article id="story" class="story theme-main   ">

    <div id="TragedyAd" class="ad tragedy-ad nocontent robots-nocontent"></div>

    <div class="dealbook-branding"></div>

    <header id="story-header" class="story-header"></header>

    <div class="story-body-supplemental">
        <div class="story-body story-body-1">
			<p></p>
			<p></p>
			<p></p>
			<p></p>
        </div>
        <div class="supplemental first short" id="supplemental-1" data-between-flex-ads="true" data-pre-height="673" data-max-items="1" data-remaining="673" data-minimum="400" data-last-item-height="873" data-flex-ad-adjacency="true" data-post-height="673"></div>
    </div>

    <div class="story-interrupter" id="story-continues-1"></div>

    <div class="story-body-supplemental">
        <div class="story-body story-body-2">
			<p></p>
			<p></p>
			<p></p>
			<p></p>
        </div>
        <div class="supplemental " id="supplemental-2" data-pre-height="2132" data-max-items="2" data-remaining="242" data-minimum="400" data-last-item-height="945" data-flex-ad-adjacency="false" data-post-height="2132"></div>
    </div>

    <div class="reader-satisfaction-survey prompt feedback-prompt story-content hidden"></div>

    <div id="storage-drawer" class="hidden"></div>
</article>
```

[1]: https://github.com/mozilla/readability/pull/337/files#diff-06d8d22df421dacde90a2268083424abR874
[2]: http://www.nytimes.com/2016/07/30/business/dealbook/yahoos-sale-to-verizon-leaves-shareholders-with-little-say.html
[3]: https://travis-ci.org/mozilla/readability/builds/191589520
Flags: needinfo?(gijskruitbosch+bugs)
Assignee: nobody → evan
You don't seem to have used the second commit that removes empty <div> elements, which also seems useful and specifically mentions the NYT (in fact, I think we can probably extend it to remove any empty container such as <div> or <section> or <header> or <h#n> - of course we can't remove empty <br> or <img> or <video>, so we can't remove *every* empty element... but quite some!).

It's hard to say from the test failures whether they are serious. At a glance, it seems like we end up picking different container elements. That's only really bad if we end up including content we shouldn't be, or excluding content we do need, and I can't tell whether we are or are not doing that based on the diff and the test results. Can you maybe create a gist of the changes that we'd need to make to the testcase to make this pass?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(evan)
Thanks for the review comments. Let's continue to discuss on the pull request page[1]. I've updated for the comments and fixed some test failures.

[1]: https://github.com/mozilla/readability/pull/337#issuecomment-273445003
Flags: needinfo?(evan)
Fixed test failures and discussed the patch with reviewer[1].

[1]: https://github.com/mozilla/readability/pull/337/commits/6b133591224d4e485ed6abbb294b362842ddc64f
Status: NEW → ASSIGNED
Looks like the same issue: 

https://www.nytimes.com/2017/01/30/technology/technology-companies-fight-trump-immigration-order-in-court.html 

expected first paragraph: 
SEATTLE — Technology executives have for days assailed President Trump’s executive order suspending immigration from seven mostly Muslim countries, framing their arguments largely in moral terms. 

Found: 
“Expedia believes that the executive order jeopardizes its corporate mission and could have a detrimental impact on its business and employees, as well as the broader U.S. and global travel and tourism industry,” Robert Dzielak, the company’s general counsel, wrote in the filing. 

which happens to be after an advertisement.
Kus,

Thanks for reporting the issue.

The patch[1] can also fix it.

[1]: https://github.com/mozilla/readability/commit/15e1f0326181c9cd96cf1cda3d1695641cc915bd
Comment on attachment 8833844 [details]
Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo,

https://reviewboard.mozilla.org/r/109976/#review110948

::: toolkit/components/reader/Readability.js:1983
(Diff revision 2)
>    }
>  };
>  
>  if (typeof module === "object") {
>    module.exports = Readability;
>  }

Let's ensure the try is good[1] before we land it.

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed0984533adf
The test failure "INFO TEST-UNEXPECTED-FAIL | testing/mochitest/tests/Harness_sanity/test_SpecialPowersPushPermissions.html | Test timed out."[1] looks like not related with the patch here.

Let's re-trigger it again. The failure might be gone.

[1]: https://treeherder.mozilla.org/logviewer.html#?job_id=74690063&repo=try&lineNumber=2080
Comment on attachment 8833844 [details]
Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo,

https://reviewboard.mozilla.org/r/109976/#review110988

::: toolkit/components/reader/Readability.js:1983
(Diff revision 2)
>    }
>  };
>  
>  if (typeof module === "object") {
>    module.exports = Readability;
>  }

The try is good.
The try is good now, ready to land.
Comment on attachment 8833844 [details]
Bug 1300697, Bug 1259763, Bug 1167568 - Update readability from github repo,

https://reviewboard.mozilla.org/r/109976/#review111022

Can you update the commit message to move the bug numbers to the beginning?
Attachment #8833844 - Flags: review?(gijskruitbosch+bugs) → review+
Sure, done. Could you help land the patch? Thanks. :)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(gijskruitbosch+bugs)
https://hg.mozilla.org/mozilla-central/rev/d1bef3268b21
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Blocks: 1329358
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: