Article showing twice in reader mode

RESOLVED FIXED in Firefox 33

Status

()

Firefox for Android
Reader View
RESOLVED FIXED
5 years ago
2 years ago

People

(Reporter: Daniel, Assigned: eedens)

Tracking

({testcase})

24 Branch
Firefox 33
ARM
Android
testcase
Points:
---

Firefox Tracking Flags

(firefox23 affected, firefox24 affected, firefox33 verified, fennec+)

Details

(Whiteboard: [lang=js])

Attachments

(3 attachments, 3 obsolete attachments)

(Reporter)

Description

5 years ago
Created attachment 760230 [details]
Screenshot of duplicated article

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20130608 Firefox/24.0 (Nightly/Aurora)
Build ID: 20130608031212

Steps to reproduce:

Visit http://infrequently.org/2013/06/sfuturepromiseg/
Enter reader mode.


Actual results:

The text of the article is duplicated as shown in the screenshot (The tiny font size is only to fit the whole article). The first instance has no paragraphs and smaller font size than usual.

I'm using an Asus Transformer TF300.
(Reporter)

Updated

5 years ago
OS: Windows 7 → Android
Hardware: x86_64 → ARM

Updated

5 years ago
Status: UNCONFIRMED → NEW
tracking-fennec: --- → ?
status-firefox23: --- → affected
status-firefox24: --- → affected
Ever confirmed: true
Flags: needinfo?(bnicholson)
Sounds like an issue with the Readability.js parse, but I don't really have any information outside of that.
Flags: needinfo?(bnicholson)

Updated

5 years ago
Flags: needinfo?(lucasr.at.mozilla)
Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html still need to check what classes cause it.
Created attachment 760661 [details]
minimal test case
Keywords: testcase
Assignee: nobody → bnicholson
tracking-fennec: ? → 23+
This is happening because of the div with a class called "author". We're picking it as the author blob for the article. We should probably improve our author-fetching logic in the parser to cover this case.
Flags: needinfo?(lucasr.at.mozilla)
(Reporter)

Comment 5

5 years ago
This bug affects https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/ as well.

There is a "show-author" class but no plain "author" class on the article element.
23+ ship has sailed. Need to re-triage this.
tracking-fennec: 23+ → ?

Updated

5 years ago
Whiteboard: [mentor=lucasr][lang=js]
tracking-fennec: ? → +
Mmmm ... not sure what to suggest here ...

Readability.js makes a best guess as to the node to use for the author information based on whether the class has a string that regex matches |/byline|author|dateline|writtenby/i| ... (it uses the first one found)

The example http://infrequently.org/2013/06/sfuturepromiseg/ has multiple matches for |author| ... the body has |s-author-alex| (which it uses) and also there's a div class match |author-alex|, a span class |author|, a div class |comment-author| and etc.

The Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html matches div class |author-alex| as designed ...

The example https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/ matches an article class |show-author|, an address class |author fn|, and somewhere around 8 other variations.
Relying on class names is, well, not reliable. Some possible ideas to improve it:

- Fetch author information from meta tags if present e.g. <meta name="author" content="Lucas Rocha">
- Fetch author information from twitter cards if present. See: https://dev.twitter.com/docs/cards/markup-reference
- Fetch author information from well-known/widely used metadata schemas/conversions in major news sites. See: http://schema.org/CreativeWork, http://microformats.org/wiki/hnews#Schema, http://dublincore.org/documents/dcmi-terms/, http://ogp.me.

The idea is to prioritize proper metadata over a random element in the page with an author-like class name. We should at least limit the length of the author blob. If it's too long, ignore it.

The same idea probably applies to article publish date.
Should probably unassign myself if Lucas wants to take this as a mentor bug!
Assignee: bnicholson → nobody
(Assignee)

Updated

4 years ago
Assignee: nobody → eedens
(Assignee)

Comment 10

4 years ago
Created attachment 8423381 [details] [diff] [review]
WIP - Detect whether byline is a huge chunk of text.

If the byline is really big (over 40 char, or bigger than 1/10 of article), then it is rejected and an empty string is returned.

lucasr, I like your idea for adding extractors based on meta tags, html5 tags, etc. I'll create a bug to manage those, since they're more of an enhancement, rather than a fix for this particular issue.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla)
Comment on attachment 8423381 [details] [diff] [review]
WIP - Detect whether byline is a huge chunk of text.

Review of attachment 8423381 [details] [diff] [review]:
-----------------------------------------------------------------

For this particular issue, I think we can just do the simplest thing and just check for a maximum acceptable length before setting articleByline (something like 100 is probably good enough?) i.e. no need to compare with the article's length ratio.

I'd rather focus on fetching the right data from meta tags, twitter cards, and the like.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla) → feedback+
(Assignee)

Comment 12

4 years ago
Created attachment 8424045 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text.

Simplified check of byline length: simply checks whether less than 100 chars.
Attachment #8423381 - Attachment is obsolete: true
Attachment #8424045 - Flags: review?(lucasr.at.mozilla)
Comment on attachment 8424045 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text.

Review of attachment 8424045 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/chrome/content/Readability.js
@@ +1530,5 @@
>      // }
>  
>      let excerpt = this._getExcerpt(articleContent);
>  
> +    let byline = this._getByline(articleContent);

If you do this here, you might be missing an opportunity to fetch the 'right' node while parsing the page. In other words, maybe we shouldn't even assign a value to this._articleByline if the length doesn't seem right. So, maybe this check should be done here?

http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/Readability.js#449
Attachment #8424045 - Flags: review?(lucasr.at.mozilla)
(Assignee)

Comment 14

4 years ago
Created attachment 8426559 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text

I like your suggestion. This version accepts bylines that are less than 100 chars. If a byline fails, it is tried again on the next pass.
Attachment #8424045 - Attachment is obsolete: true
Attachment #8426559 - Flags: review?(lucasr.at.mozilla)
Comment on attachment 8426559 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text

Review of attachment 8426559 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, please test on a few other sites looking for any regressions. Unfortunately, the integration tests I wrote in bug 786638 got blocked on some infrastructure issues...

::: mobile/android/chrome/content/Readability.js
@@ +723,5 @@
>    /**
> +   * Check whether the input string could be a byline. 
> +   * This verifies that the input is a string, and that the length 
> +   * is less than 100 chars.
> +   * 

nit: remove trailing space.
Attachment #8426559 - Flags: review?(lucasr.at.mozilla) → review+
Eric, any reason for not having landed this yet?
Flags: needinfo?(eedens)
Mentor: lucasr.at.mozilla@gmail.com
Whiteboard: [mentor=lucasr][lang=js] → [lang=js]
Mentor: lucasr.at.mozilla@gmail.com
Duplicate of this bug: 1026147
(Assignee)

Comment 18

4 years ago
Created attachment 8443024 [details] [diff] [review]
bug-881091-fix.patch

Fixed nit  (removed 8 trailing whitespace characters)
Attachment #8426559 - Attachment is obsolete: true
(Assignee)

Comment 19

4 years ago
Lucas, no reason -- this just fell off the radar.

I fixed the trailing whitepace, and also tested on a fresh MC pull.

+checkin-needed
Flags: needinfo?(eedens)
Keywords: checkin-needed
Can we please run this through Try first? Thanks :)
Keywords: checkin-needed
(Assignee)

Comment 21

4 years ago
Thanks Ryan :)
All passing: https://tbpl.mozilla.org/?tree=Try&rev=cf9b0f97e987
Keywords: checkin-needed
https://hg.mozilla.org/integration/fx-team/rev/64fb83a88fac
Keywords: checkin-needed
Whiteboard: [lang=js] → [lang=js][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/64fb83a88fac
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Whiteboard: [lang=js][fixed-in-fx-team] → [lang=js]
Target Milestone: --- → Firefox 33
Verified as fixed in:
Build: Firefox for Android 33.0a2 (2014-06-30)
Device: Asus Transformer Pad TF300T (Android 4.2.1) and Nexus 4 (Android 4.4.2)
status-firefox33: --- → verified
You need to log in before you can comment on or make changes to this bug.