Created attachment 760230 [details] Screenshot of duplicated article User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20130608 Firefox/24.0 (Nightly/Aurora) Build ID: 20130608031212 Steps to reproduce: Visit http://infrequently.org/2013/06/sfuturepromiseg/ Enter reader mode. Actual results: The text of the article is duplicated as shown in the screenshot (The tiny font size is only to fit the whole article). The first instance has no paragraphs and smaller font size than usual. I'm using an Asus Transformer TF300.
Status: UNCONFIRMED → NEW
tracking-fennec: --- → ?
status-firefox23: --- → affected
status-firefox24: --- → affected
Ever confirmed: true
Sounds like an issue with the Readability.js parse, but I don't really have any information outside of that.
Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html still need to check what classes cause it.
Assignee: nobody → bnicholson
tracking-fennec: ? → 23+
This is happening because of the div with a class called "author". We're picking it as the author blob for the article. We should probably improve our author-fetching logic in the parser to cover this case.
23+ ship has sailed. Need to re-triage this.
tracking-fennec: 23+ → ?
Relying on class names is, well, not reliable. Some possible ideas to improve it: - Fetch author information from meta tags if present e.g. <meta name="author" content="Lucas Rocha"> - Fetch author information from twitter cards if present. See: https://dev.twitter.com/docs/cards/markup-reference - Fetch author information from well-known/widely used metadata schemas/conversions in major news sites. See: http://schema.org/CreativeWork, http://microformats.org/wiki/hnews#Schema, http://dublincore.org/documents/dcmi-terms/, http://ogp.me. The idea is to prioritize proper metadata over a random element in the page with an author-like class name. We should at least limit the length of the author blob. If it's too long, ignore it. The same idea probably applies to article publish date.
Should probably unassign myself if Lucas wants to take this as a mentor bug!
Assignee: bnicholson → nobody
Created attachment 8423381 [details] [diff] [review] WIP - Detect whether byline is a huge chunk of text. If the byline is really big (over 40 char, or bigger than 1/10 of article), then it is rejected and an empty string is returned. lucasr, I like your idea for adding extractors based on meta tags, html5 tags, etc. I'll create a bug to manage those, since they're more of an enhancement, rather than a fix for this particular issue.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla)
Comment on attachment 8423381 [details] [diff] [review] WIP - Detect whether byline is a huge chunk of text. Review of attachment 8423381 [details] [diff] [review]: ----------------------------------------------------------------- For this particular issue, I think we can just do the simplest thing and just check for a maximum acceptable length before setting articleByline (something like 100 is probably good enough?) i.e. no need to compare with the article's length ratio. I'd rather focus on fetching the right data from meta tags, twitter cards, and the like.
Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla) → feedback+
Created attachment 8424045 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text. Simplified check of byline length: simply checks whether less than 100 chars.
Comment on attachment 8424045 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text. Review of attachment 8424045 [details] [diff] [review]: ----------------------------------------------------------------- ::: mobile/android/chrome/content/Readability.js @@ +1530,5 @@ > // } > > let excerpt = this._getExcerpt(articleContent); > > + let byline = this._getByline(articleContent); If you do this here, you might be missing an opportunity to fetch the 'right' node while parsing the page. In other words, maybe we shouldn't even assign a value to this._articleByline if the length doesn't seem right. So, maybe this check should be done here? http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/Readability.js#449
Created attachment 8426559 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text I like your suggestion. This version accepts bylines that are less than 100 chars. If a byline fails, it is tried again on the next pass.
Comment on attachment 8426559 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text Review of attachment 8426559 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, please test on a few other sites looking for any regressions. Unfortunately, the integration tests I wrote in bug 786638 got blocked on some infrastructure issues... ::: mobile/android/chrome/content/Readability.js @@ +723,5 @@ > /** > + * Check whether the input string could be a byline. > + * This verifies that the input is a string, and that the length > + * is less than 100 chars. > + * nit: remove trailing space.
Attachment #8426559 - Flags: review?(lucasr.at.mozilla) → review+
Eric, any reason for not having landed this yet?
Whiteboard: [mentor=lucasr][lang=js] → [lang=js]
Duplicate of this bug: 1026147
Created attachment 8443024 [details] [diff] [review] bug-881091-fix.patch Fixed nit (removed 8 trailing whitespace characters)
Attachment #8426559 - Attachment is obsolete: true
Lucas, no reason -- this just fell off the radar. I fixed the trailing whitepace, and also tested on a fresh MC pull. +checkin-needed
Can we please run this through Try first? Thanks :)
Thanks Ryan :) All passing: https://tbpl.mozilla.org/?tree=Try&rev=cf9b0f97e987
Whiteboard: [lang=js] → [lang=js][fixed-in-fx-team]
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Whiteboard: [lang=js][fixed-in-fx-team] → [lang=js]
Target Milestone: --- → Firefox 33
Verified as fixed in: Build: Firefox for Android 33.0a2 (2014-06-30) Device: Asus Transformer Pad TF300T (Android 4.2.1) and Nexus 4 (Android 4.4.2)
status-firefox33: --- → verified
You need to log in before you can comment on or make changes to this bug.