Open Bug 1144407 Opened 8 years ago Updated 5 years ago

Incorrect byline if there are multiple .byline elements on a page

Categories

(Toolkit :: Reader Mode, defect, P3)

defect

Tracking

()

People

(Reporter: Margaret, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [reader-mode-readability-algorithm])

We get the wrong byline from this page:
http://www.salon.com/2015/01/02/coca_colas_anti_american_outsourcing_scheme_how_big_soda_gets_the_public_to_shoulder_its_costs/

Looking at Readability.js, it doesn't seem like we do anything smart to handle the possibility of multiple nodes that look like bylines:
https://github.com/mozilla/readability/blob/master/Readability.js#L454
Duplicate of this bug: 1112911
The Salon URL is WFM, but The Atlantic fails. But yeah -- they sidebar / headers on the site show previews of other articles, and the authors listed use similar byline styling.

Seems like this is probably tricky to fix though, eg on The Atlantic the first byline isn't what we want (it's the header). We'd need to somehow find the one closest to the article body or something like that?

Oh, Bryan notes that (at least on The Atlantic) there's microdata correctly specifying who the author is -- so we should use that data first, and then fallback to heuristics?
Priority: -- → P4
(In reply to Justin Dolske [:Dolske] from comment #3)

> Oh, Bryan notes that (at least on The Atlantic) there's microdata correctly
> specifying who the author is -- so we should use that data first, and then
> fallback to heuristics?

We do already have logic for getting the author metadata (which is why this problem has improved since this bug was originally filed):
https://github.com/mozilla/readability/blob/master/Readability.js#L920

But that The Atlantic article doesn't actually have what we're looking for.

I do see this:

<meta name='parsely-page' content='{"author":"Olga Khazan","pub_date":"2015-03-10T12:22:00-04:00","section":"Health","type":"post","post_id":"mt387388","image_url":"//cdn.theatlantic.com/static/newsroom/img/mt/2015/03/6018412616_feda2f2e6a_b/lead_large.jpg?nl0c7k","link":"http://www.theatlantic.com/health/archive/2015/03/the-power-of-good-enough/387388/","title":"The Power of &#039;Good Enough&#039;"}'>

Maybe we should add support for this parsely-page metadata?

Bryan, what data were you seeing?
Flags: needinfo?(clarkbw)
I used this in the browser console:

Cu.import("resource://gre/modules/XPCOMUtils.jsm");
XPCOMUtils.defineLazyModuleGetter(this, "PageMetadata", "resource://gre/modules/PageMetadata.jsm");
var data = this.PageMetadata.getData(gBrowser.selectedBrowser._contentWindow.document);

Looking at it again I don't see an Author name like I thought I saw last time.  There is an Author array that includes the URL but doesn't give the actual textContent of the element so we don't get the name but I suppose we could do a lookup for it or change the microdata retrieval to include it.
Flags: needinfo?(clarkbw)
Duplicate of this bug: 1208783
To save readers the trouble of following the link to 1208783: The page in question is this one:

http://www.resilience.org/stories/2015-09-10/ecomodernism-a-response-to-my-critics

(And yes, I did search before filing the duplicate, but my search-fu was clearly not good enough.)
I think I'd like to air an opinion on this bug. I believe it is in fact a quite serious bug, far from a cosmetic issue. The reason is that it may trick readers into attributing what they read to the wrong author, and then to propagate the wrongful attribution in social media and other writing. We all know how easily wrong information can propagate and how hard it is to stamp out, and Firefox should not be complicit in initiating it.

It follows that the reader software should be extremely conservative in assigning the byline. If a correct byline cannot be found with near 100% confidence, it is better to leave it out altogether. The user can always exit reader view to find that information if they wish.
Priority: P4 → P3
Whiteboard: [reader-mode-readability-algorithm]
See Also: → 1444982
You need to log in before you can comment on or make changes to this bug.