<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 2

•

12 years ago

Minimumish test case at http://people.mozilla.org/~kbrosnan/tmp/881091-edit.html still need to check what classes cause it.

Comment 3

•

12 years ago

Attached file minimal test case — Details

Mark Finkle (:mfinkle) (use needinfo?)

Updated

•

12 years ago

Keywords: testcase

Updated

•

12 years ago

Assignee: nobody → bnicholson

tracking-fennec: ? → 23+

Comment 4

•

12 years ago

This is happening because of the div with a class called "author". We're picking it as the author blob for the article. We should probably improve our author-fetching logic in the parser to cover this case.

Flags: needinfo?(lucasr.at.mozilla)

Daniel

Reporter

Comment 5

•

12 years ago

This bug affects https://blog.mozilla.org/javascript/2013/07/18/clawing-our-way-back-to-precision/ as well. There is a "show-author" class but no plain "author" class on the article element.

Comment 6

•

12 years ago

23+ ship has sailed. Need to re-triage this.

tracking-fennec: 23+ → ?

Mark Finkle (:mfinkle) (use needinfo?)

Updated

•

12 years ago

Whiteboard: [mentor=lucasr][lang=js]

Updated

•

12 years ago

tracking-fennec: ? → +

Mark Capella [:capella]

Comment 7

•

12 years ago

Brian Nicholson (:bnicholson)

Comment 8

•

12 years ago

Relying on class names is, well, not reliable. Some possible ideas to improve it: - Fetch author information from meta tags if present e.g. <meta name="author" content="Lucas Rocha"> - Fetch author information from twitter cards if present. See: https://dev.twitter.com/docs/cards/markup-reference - Fetch author information from well-known/widely used metadata schemas/conversions in major news sites. See: http://schema.org/CreativeWork, http://microformats.org/wiki/hnews#Schema, http://dublincore.org/documents/dcmi-terms/, http://ogp.me. The idea is to prioritize proper metadata over a random element in the page with an author-like class name. We should at least limit the length of the author blob. If it's too long, ignore it. The same idea probably applies to article publish date.

Comment 9

•

12 years ago

Should probably unassign myself if Lucas wants to take this as a mentor bug!

Assignee: bnicholson → nobody

Assignee

Updated

•

11 years ago

Assignee: nobody → eedens

Assignee

Comment 10

•

11 years ago

Attached patch WIP - Detect whether byline is a huge chunk of text. (obsolete) — Details — Splinter Review

If the byline is really big (over 40 char, or bigger than 1/10 of article), then it is rejected and an empty string is returned. lucasr, I like your idea for adding extractors based on meta tags, html5 tags, etc. I'll create a bug to manage those, since they're more of an enhancement, rather than a fix for this particular issue.

Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla)

Comment 11

•

11 years ago

Comment on attachment 8423381 [details] [diff] [review] WIP - Detect whether byline is a huge chunk of text. Review of attachment 8423381 [details] [diff] [review]: ----------------------------------------------------------------- For this particular issue, I think we can just do the simplest thing and just check for a maximum acceptable length before setting articleByline (something like 100 is probably good enough?) i.e. no need to compare with the article's length ratio. I'd rather focus on fetching the right data from meta tags, twitter cards, and the like.

Attachment #8423381 - Flags: feedback?(lucasr.at.mozilla) → feedback+

Assignee

Comment 12

•

11 years ago

Attached patch REV - Detect whether byline is a huge chunk of text. (obsolete) — Details — Splinter Review

Simplified check of byline length: simply checks whether less than 100 chars.

Attachment #8423381 - Attachment is obsolete: true

Attachment #8424045 - Flags: review?(lucasr.at.mozilla)

Comment 13

•

11 years ago

Comment on attachment 8424045 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text. Review of attachment 8424045 [details] [diff] [review]: ----------------------------------------------------------------- ::: mobile/android/chrome/content/Readability.js @@ +1530,5 @@ > // } > > let excerpt = this._getExcerpt(articleContent); > > + let byline = this._getByline(articleContent); If you do this here, you might be missing an opportunity to fetch the 'right' node while parsing the page. In other words, maybe we shouldn't even assign a value to this._articleByline if the length doesn't seem right. So, maybe this check should be done here? http://mxr.mozilla.org/mozilla-central/source/mobile/android/chrome/content/Readability.js#449

Attachment #8424045 - Flags: review?(lucasr.at.mozilla)

Assignee

Comment 14

•

11 years ago

Attached patch REV - Detect whether byline is a huge chunk of text (obsolete) — Details — Splinter Review

I like your suggestion. This version accepts bylines that are less than 100 chars. If a byline fails, it is tried again on the next pass.

Attachment #8424045 - Attachment is obsolete: true

Attachment #8426559 - Flags: review?(lucasr.at.mozilla)

Comment 15

•

11 years ago

Comment on attachment 8426559 [details] [diff] [review] REV - Detect whether byline is a huge chunk of text Review of attachment 8426559 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, please test on a few other sites looking for any regressions. Unfortunately, the integration tests I wrote in bug 786638 got blocked on some infrastructure issues... ::: mobile/android/chrome/content/Readability.js @@ +723,5 @@ > /** > + * Check whether the input string could be a byline. > + * This verifies that the input is a string, and that the length > + * is less than 100 chars. > + * nit: remove trailing space.

Attachment #8426559 - Flags: review?(lucasr.at.mozilla) → review+

Nobody; OK to take it and work on it

Comment 16

•

11 years ago

Eric, any reason for not having landed this yet?

Flags: needinfo?(eedens)

Updated

•

11 years ago

Mentor: lucasr.at.mozilla

Whiteboard: [mentor=lucasr][lang=js] → [lang=js]

Mark Capella [:capella]

Updated

•

11 years ago

Mentor: lucasr.at.mozilla

Assignee

Comment 18

•

11 years ago

Attached patch bug-881091-fix.patch — Details — Splinter Review

Fixed nit (removed 8 trailing whitespace characters)

Attachment #8426559 - Attachment is obsolete: true

Ryan VanderMeulen (PTO, back 6-April)

Assignee

Comment 19

•

11 years ago

Lucas, no reason -- this just fell off the radar. I fixed the trailing whitepace, and also tested on a fresh MC pull. +checkin-needed

Flags: needinfo?(eedens)

Keywords: checkin-needed

Comment 20

•

11 years ago

Can we please run this through Try first? Thanks :)

Keywords: checkin-needed