Closed Bug 881091 Opened 11 years ago Closed 10 years ago

Article showing twice in reader mode

Categories

(Firefox for Android Graveyard :: Reader View, defect)

24 Branch
ARM
Android
defect
Not set
normal

Tracking

(firefox23 affected, firefox24 affected, firefox33 verified, fennec+)

RESOLVED FIXED
Firefox 33
Tracking Status
firefox23 --- affected
firefox24 --- affected
firefox33 --- verified
fennec + ---

People

(Reporter: daniel.nr01, Assigned: eedens)

References

Details

(Keywords: testcase, Whiteboard: [lang=js])

Attachments

(3 files, 3 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20130608 Firefox/24.0 (Nightly/Aurora)
Build ID: 20130608031212

Steps to reproduce:

Visit http://infrequently.org/2013/06/sfuturepromiseg/
Enter reader mode.


Actual results:

The text of the article is duplicated as shown in the screenshot (The tiny font size is only to fit the whole article). The first instance has no paragraphs and smaller font size than usual.

I'm using an Asus Transformer TF300.
OS: Windows 7 → Android
Hardware: x86_64 → ARM
Status: UNCONFIRMED → NEW
tracking-fennec: --- → ?
Ever confirmed: true
Flags: needinfo?(bnicholson)
Sounds like an issue with the Readability.js parse, but I don't really have any information outside of that.
Flags: needinfo?(bnicholson)
Flags: needinfo?(lucasr.at.mozilla)
Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html still need to check what classes cause it.
Keywords: testcase
Assignee: nobody → bnicholson
tracking-fennec: ? → 23+
This is happening because of the div with a class called "author". We're picking it as the author blob for the article. We should probably improve our author-fetching logic in the parser to cover this case.
Flags: needinfo?(lucasr.at.mozilla)
This bug affects https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/ as well.

There is a "show-author" class but no plain "author" class on the article element.
23+ ship has sailed. Need to re-triage this.
tracking-fennec: 23+ → ?
Whiteboard: [mentor=lucasr][lang=js]
tracking-fennec: ? → +
Mmmm ... not sure what to suggest here ...

Readability.js makes a best guess as to the node to use for the author information based on whether the class has a string that regex matches |/byline|author|dateline|writtenby/i| ... (it uses the first one found)

The example http://infrequently.org/2013/06/sfuturepromiseg/ has multiple matches for |author| ... the body has |s-author-alex| (which it uses) and also there's a div class match |author-alex|, a span class |author|, a div class |comment-author| and etc.

The Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html matches div class |author-alex| as designed ...

The example https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/ matches an article class |show-author|, an address class |author fn|, and somewhere around 8 other variations.
Relying on class names is, well, not reliable. Some possible ideas to improve it:

- Fetch author information from meta tags if present e.g. <meta name="author" content="Lucas Rocha">
- Fetch author information from twitter cards if present. See: https://dev.twitter.com/docs/cards/markup-reference
- Fetch author information from well-known/widely used metadata schemas/conversions in major news sites. See: http://schema.org/CreativeWork, http://microformats.org/wiki/hnews#Schema, http://dublincore.org/documents/dcmi-terms/, http://ogp.me.

The idea is to prioritize proper metadata over a random element in the page with an author-like class name. We should at least limit the length of the author blob. If it's too long, ignore it.

The same idea probably applies to article publish date.
Should probably unassign myself if Lucas wants to take this as a mentor bug!
Assignee: bnicholson → nobody
Assignee: nobody → eedens
If the byline is really big (over 40 char, or bigger than 1/10 of article), then it is rejected and an empty string is returned.

lucasr, I like your idea for adding extractors based on meta tags, html5 tags, etc. I'll create a bug to manage those, since they're more of an enhancement, rather than a fix for this particular issue.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla)
Comment on attachment 8423381 [details] [diff] [review]
WIP - Detect whether byline is a huge chunk of text.

Review of attachment 8423381 [details] [diff] [review]:
-----------------------------------------------------------------

For this particular issue, I think we can just do the simplest thing and just check for a maximum acceptable length before setting articleByline (something like 100 is probably good enough?) i.e. no need to compare with the article's length ratio.

I'd rather focus on fetching the right data from meta tags, twitter cards, and the like.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla) → feedback+
Simplified check of byline length: simply checks whether less than 100 chars.
Attachment #8423381 - Attachment is obsolete: true
Attachment #8424045 - Flags: review?(lucasr.at.mozilla)
Comment on attachment 8424045 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text.

Review of attachment 8424045 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/chrome/content/Readability.js
@@ +1530,5 @@
>      // }
>  
>      let excerpt = this._getExcerpt(articleContent);
>  
> +    let byline = this._getByline(articleContent);

If you do this here, you might be missing an opportunity to fetch the 'right' node while parsing the page. In other words, maybe we shouldn't even assign a value to this._articleByline if the length doesn't seem right. So, maybe this check should be done here?

http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/Readability.js#449
Attachment #8424045 - Flags: review?(lucasr.at.mozilla)
I like your suggestion. This version accepts bylines that are less than 100 chars. If a byline fails, it is tried again on the next pass.
Attachment #8424045 - Attachment is obsolete: true
Attachment #8426559 - Flags: review?(lucasr.at.mozilla)
Comment on attachment 8426559 [details] [diff] [review]
REV - Detect whether byline is a huge chunk of text

Review of attachment 8426559 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, please test on a few other sites looking for any regressions. Unfortunately, the integration tests I wrote in bug 786638 got blocked on some infrastructure issues...

::: mobile/android/chrome/content/Readability.js
@@ +723,5 @@
>    /**
> +   * Check whether the input string could be a byline. 
> +   * This verifies that the input is a string, and that the length 
> +   * is less than 100 chars.
> +   * 

nit: remove trailing space.
Attachment #8426559 - Flags: review?(lucasr.at.mozilla) → review+
Eric, any reason for not having landed this yet?
Flags: needinfo?(eedens)
Mentor: lucasr.at.mozilla
Whiteboard: [mentor=lucasr][lang=js] → [lang=js]
Mentor: lucasr.at.mozilla
Fixed nit  (removed 8 trailing whitespace characters)
Attachment #8426559 - Attachment is obsolete: true
Lucas, no reason -- this just fell off the radar.

I fixed the trailing whitepace, and also tested on a fresh MC pull.

+checkin-needed
Flags: needinfo?(eedens)
Keywords: checkin-needed
Can we please run this through Try first? Thanks :)
Keywords: checkin-needed
Thanks Ryan :)
All passing: https://tbpl.mozilla.org/?tree=Try&rev=cf9b0f97e987
Keywords: checkin-needed
https://hg.mozilla.org/integration/fx-team/rev/64fb83a88fac
Keywords: checkin-needed
Whiteboard: [lang=js] → [lang=js][fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/64fb83a88fac
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [lang=js][fixed-in-fx-team] → [lang=js]
Target Milestone: --- → Firefox 33
Verified as fixed in:
Build: Firefox for Android 33.0a2 (2014-06-30)
Device: Asus Transformer Pad TF300T (Android 4.2.1) and Nexus 4 (Android 4.4.2)
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: