Closed Bug 1255978 Opened 4 years ago Closed 3 years ago

Reader View displays list of other articles when used on article on www.independent.co.uk

Categories

(Toolkit :: Reader Mode, defect, P2)

45 Branch
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: cagdasalagoz, Assigned: evanxd)

References

(Blocks 1 open bug)

Details

(Whiteboard: [reader-mode-readability-algorithm])

Attachments

(1 file)

Attached image 2.png
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160309193552

Steps to reproduce:

1. Go here: http://www.independent.co.uk/news/business/news/seven-secrets-that-hotel-owners-dont-want-you-to-know-10506160.html
2. Start the reader view from clicking the icon in the right side of the address bar.



Actual results:

It doesn't show you what you need. Displays other texts from anywhere in the page.


Expected results:

It should display the text people wanted to read. Wich is news itself.
Component: Untriaged → Reader Mode
Product: Firefox → Toolkit
Priority: -- → P2
Summary: Reader View doesn't display quite right at www.independent.co.uk → Reader View displays list of other articles when used on article on www.independent.co.uk
Status: UNCONFIRMED → NEW
Ever confirmed: true
Whiteboard: [reader-mode-readability-algorithm]
Blocks: 1286221
Hi Gijs,

Looks like the root cause is that we score the `ul` list contains lots of paragraphs of text in the webpage[1].

In other some cases, list might contain lots of text. We might need to avoid scoring list(e.g. <ul> or <ol> tags) like below(HTML code from the webpage listed on Comment 0[1]) to fix this issue. What do you think? Is that a good approach to fix this issue?

```
<ul class="legends">
    <li data-gallery-legend="1">
        <h2><span class="counter">1/50</span> United goin down </h2>
        <p>Manchester United's absence from the Champions League hurt more than the fans' pride last year - it also dented the bottom line. Revenues at the New Uork listed club dipped 8.8 per cent to £395.2m in the year to June, triggering a £1.2 million loss after broadcasting and sponsorship deals dried up. The club said it was now looking to raise $400m from a share issue.</p>
        <p class="credits">GETTY IMAGES</p>
    </li>
    <li>
        ...
    </li>
    ...
</ul>
```
(full html code[2])

[1]: http://www.independent.co.uk/news/business/news/seven-secrets-that-hotel-owners-dont-want-you-to-know-10506160.html
[2]: https://gist.github.com/evanxd/15286711e489a8469595aed586a81ebb
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #1)
> In other some cases, list might contain lots of text. We might need to avoid
> scoring list(e.g. <ul> or <ol> tags) like below(HTML code from the webpage
> listed on Comment 0[1]) to fix this issue. What do you think? Is that a good
> approach to fix this issue?

No, the lists in the main content of the webpage could still be important.

Instead, we should downscore the 'legends' class (using one of these regexes: https://github.com/mozilla/readability/blob/master/Readability.js#L115-L119 ), and hope that that's sufficient.
Flags: needinfo?(gijskruitbosch+bugs)
Thanks, Gijs. Adding `legends` into the `unlikelyCandidates` regex can work.

But there is another issue here after I added it, `stripUnlikelyCandidates` will be `false` when `matchString` is `legends `. Looks like this might be another new issue not related with this bug. Do you know what's going on there?

Current I add a workaround in a WIP patch(including tests)[1] to make it work. Could you take a look? Thanks.

[1]: https://github.com/evanxd/readability/commit/1651de7b8b4eb3354bc66bac42b9b0dc1e800937
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #3)
> Thanks, Gijs. Adding `legends` into the `unlikelyCandidates` regex can work.
> 
> But there is another issue here after I added it, `stripUnlikelyCandidates`
> will be `false` when `matchString` is `legends `. Looks like this might be
> another new issue not related with this bug. Do you know what's going on
> there?
> 
> Current I add a workaround in a WIP patch(including tests)[1] to make it
> work. Could you take a look? Thanks.
> 
> [1]:
> https://github.com/evanxd/readability/commit/
> 1651de7b8b4eb3354bc66bac42b9b0dc1e800937

Note: your patch replaces "popup" with "popu" in that regex.

As for what's wrong, the readability algorithm will first run and remove unlikely candidates. If too much stuff is gone from the HTML by then, and it can't find anything to return, it'll rerun on the original DOM and not remove unlikely candidates. So it sounds like the first time around when we would strip unlikely candidates, maybe we already strip some ancestor of the sidebar, but also end up stripping the main content, then find nothing and rerun the algorithm without stripping unlikely candidates? So what's causing the main content to be removed or not recognized the first time we run (when we *are* stripping unlikely candidates?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(evan)
(In reply to :Gijs Kruitbosch (PTO Nov. 17-20 inclusive) from comment #4)
> (In reply to Evan Tseng [:evanxd][:愛聞插低] from comment #3)
> > Thanks, Gijs. Adding `legends` into the `unlikelyCandidates` regex can work.
> > 
> > But there is another issue here after I added it, `stripUnlikelyCandidates`
> > will be `false` when `matchString` is `legends `. Looks like this might be
> > another new issue not related with this bug. Do you know what's going on
> > there?
> > 
> > Current I add a workaround in a WIP patch(including tests)[1] to make it
> > work. Could you take a look? Thanks.
> > 
> > [1]:
> > https://github.com/evanxd/readability/commit/
> > 1651de7b8b4eb3354bc66bac42b9b0dc1e800937
> 
> Note: your patch replaces "popup" with "popu" in that regex.
Thanks, updated[4].

> As for what's wrong, the readability algorithm will first run and remove
> unlikely candidates. If too much stuff is gone from the HTML by then, and it
> can't find anything to return, it'll rerun on the original DOM and not
> remove unlikely candidates. So it sounds like the first time around when we
> would strip unlikely candidates, maybe we already strip some ancestor of the
> sidebar, but also end up stripping the main content, then find nothing and
> rerun the algorithm without stripping unlikely candidates? So what's causing
> the main content to be removed or not recognized the first time we run (when
> we *are* stripping unlikely candidates?

The `legends` <ul> node in the webpage[1] supports to be grabbed and removed in the first cycle of the loop in the `_grabArticle` method[2] with using the patch, but it doesn't.

I don't know the root cause yet. I'll continue to investigate and find out the root cause. Thanks for the hint.

[1]: https://raw.githubusercontent.com/evanxd/readability/1651de7b8b4eb3354bc66bac42b9b0dc1e800937/test/test-pages/bug-1255978/source.html
[2]: https://github.com/mozilla/readability/blob/master/Readability.js#L644
[3]: https://github.com/evanxd/readability/commit/1651de7b8b4eb3354bc66bac42b9b0dc1e800937
[4]: https://github.com/evanxd/readability/commit/8b4a9de7e9156110d62814c902659b714e5705dc
Flags: needinfo?(evan)
Found out the root cause. The `legends` <ul> node cannot be grabbed in first round because its parent node's id, "gigya-share-btns-2_gig_containerParent" contains the "share" keyword which means our algorithm will remove the unlikely candidate.

According to the DOM structure of the parent node, like below, I would like to a new item `text` into the `okMaybeItsACandidate` regexp to fix it.

```
<div class="text-wrapper" itemprop="articleBody" id="gigya-share-btns-2_gig_containerParent">
    <p>Most people go to hotels for the pleasure of sleeping in a giant bed with clean white sheets and waking up to fresh towels in the morning.</p>
    ...
    ...
        <ul class="legends">
            <li data-gallery-legend="1">
            ...
```
Sent a pull request for reviewing.
Assignee: nobody → evan
Discussed regression issues with reviewer: https://github.com/mozilla/readability/pull/319#issuecomment-262446512
The patch[1] in Bug 1173548 updates readability from github repo, includes fix for Bug 1173548 and Bug 1255978. It's on review process now.

[1]: https://reviewboard.mozilla.org/r/95940/diff/2/
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(gijskruitbosch+bugs)
landed: https://hg.mozilla.org/mozilla-central/rev/e9ff40069326
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.