Closed Bug 1151087 Opened 9 years ago Closed 8 years ago

Failed to load article from page when viewing Bing search results in reader mode

Categories

(Toolkit :: Reader Mode, defect, P3)

x86
macOS
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: pdehaan, Unassigned)

References

Details

Attachments

(1 file)

### Steps to reproduce:
1. Go to http://www.bing.com/search?q=Deion+Sanders+son+tweet&filters=tnTID:%22D3BF6235-CD99-45c9-B4D4-585CC9C8DBBE%22+tnVersion:%22834471%22+segment:%22popularnow.carousel%22+tnCol:%220%22+tnOrder:%226b69c5ac-9800-4514-b15d-f9e4b969d752%22&FORM=BSPN01&crslsl=0
2. Enter reader view


### Actual results:
"Failed to load article from page"


### Expected results:
Not sure if readability view makes sense for search results, but I wouldn't expect a reader view button if reader view isnt available.
The same issue exists with articles on http://www.newscientist.com .

1) Go to http://www.newscientist.com
2) Click on any article, such as:
   http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-how-it-gets-massive.html#.VR_bPn2Aksd or
http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc

3) Click on the "Reading Mode" icon in the address bar
4) Reading Mode opens but it says "Failed to load article from page".
I'm using Firefox Nightly 40.0a1 (2015-04-03) on Windows 7 Service Pack 1.
(In reply to Peter deHaan [:pdehaan] from comment #0)
> Created attachment 8588172 [details]
> Failed_to_load_article_from_page.png
> 
> ### Steps to reproduce:
> 1. Go to
> http://www.bing.com/search?q=Deion+Sanders+son+tweet&filters=tnTID:
> %22D3BF6235-CD99-45c9-B4D4-585CC9C8DBBE%22+tnVersion:%22834471%22+segment:
> %22popularnow.carousel%22+tnCol:%220%22+tnOrder:%226b69c5ac-9800-4514-b15d-
> f9e4b969d752%22&FORM=BSPN01&crslsl=0
> 2. Enter reader view
> 
> 
> ### Actual results:
> "Failed to load article from page"
> 
> 
> ### Expected results:
> Not sure if readability view makes sense for search results, but I wouldn't
> expect a reader view button if reader view isnt available.

Yes, I think what we should be doing here is refining `isProbablyReaderable` to return false.

(In reply to Gert Van Waelvelde from comment #1)
> The same issue exists with articles on http://www.newscientist.com .
> 
> 1) Go to http://www.newscientist.com
> 2) Click on any article, such as:
>   
> http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-
> how-it-gets-massive.html#.VR_bPn2Aksd or
> http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-
> strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc
> 
> 3) Click on the "Reading Mode" icon in the address bar
> 4) Reading Mode opens but it says "Failed to load article from page".

Both of these testcases work for me on today's Nighty (2015-04-06). Do they still fail for you? I wonder what could cause that to behave differently for us. Do you have any add-ons installed that modify page content?
(In reply to :Margaret Leibovic from comment #3)
> (In reply to Peter deHaan [:pdehaan] from comment #0)
> > Created attachment 8588172 [details]
> > Failed_to_load_article_from_page.png
> > 
> > ### Steps to reproduce:
> > 1. Go to
> > http://www.bing.com/search?q=Deion+Sanders+son+tweet&filters=tnTID:
> > %22D3BF6235-CD99-45c9-B4D4-585CC9C8DBBE%22+tnVersion:%22834471%22+segment:
> > %22popularnow.carousel%22+tnCol:%220%22+tnOrder:%226b69c5ac-9800-4514-b15d-
> > f9e4b969d752%22&FORM=BSPN01&crslsl=0
> > 2. Enter reader view
> > 
> > 
> > ### Actual results:
> > "Failed to load article from page"
> > 
> > 
> > ### Expected results:
> > Not sure if readability view makes sense for search results, but I wouldn't
> > expect a reader view button if reader view isnt available.
> 
> Yes, I think what we should be doing here is refining `isProbablyReaderable`
> to return false.
> 
> (In reply to Gert Van Waelvelde from comment #1)
> > The same issue exists with articles on http://www.newscientist.com .
> > 
> > 1) Go to http://www.newscientist.com
> > 2) Click on any article, such as:
> >   
> > http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-
> > how-it-gets-massive.html#.VR_bPn2Aksd or
> > http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-
> > strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc
> > 
> > 3) Click on the "Reading Mode" icon in the address bar
> > 4) Reading Mode opens but it says "Failed to load article from page".
> 
> Both of these testcases work for me on today's Nighty (2015-04-06). Do they
> still fail for you? I wonder what could cause that to behave differently for
> us. Do you have any add-ons installed that modify page content?

Both of these testcases still fail for me on Nightly (2015-04-06). I have no add-ons enabled.
Reproduced this issue with latest 40.0a1 (2015-04-13) under Windows 7 64 bit, OS X 10.9.5 and Ubuntu 14.04 32 bit on:
- http://en.wikipedia.org/wiki/Phoenicia
- http://www.bbc.com 
-- E.g.: http://www.bbc.com/sport/0/ or http://www.bbc.com/culture

Screencast: http://goo.gl/BthI3m
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #5)
> Reproduced this issue with latest 40.0a1 (2015-04-13) under Windows 7 64
> bit, OS X 10.9.5 and Ubuntu 14.04 32 bit on:
> - http://en.wikipedia.org/wiki/Phoenicia
> - http://www.bbc.com 
> -- E.g.: http://www.bbc.com/sport/0/ or http://www.bbc.com/culture
> 
> Screencast: http://goo.gl/BthI3m

This makes no sense. comment #0 was about bing.com, and neither of these sites are bing.com. Same with the other comments relating to newscientist.com. One issue per bug. The error is generic and can happen due to any number of things.

(In reply to Gert Van Waelvelde from comment #4)
> (In reply to :Margaret Leibovic from comment #3)
> > (In reply to Gert Van Waelvelde from comment #1)
> > > The same issue exists with articles on http://www.newscientist.com .
> > > 
> > > 1) Go to http://www.newscientist.com
> > > 2) Click on any article, such as:
> > >   
> > > http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-
> > > how-it-gets-massive.html#.VR_bPn2Aksd or
> > > http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-
> > > strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc
> > > 
> > > 3) Click on the "Reading Mode" icon in the address bar
> > > 4) Reading Mode opens but it says "Failed to load article from page".
> > 
> > Both of these testcases work for me on today's Nighty (2015-04-06). Do they
> > still fail for you? I wonder what could cause that to behave differently for
> > us. Do you have any add-ons installed that modify page content?
> 
> Both of these testcases still fail for me on Nightly (2015-04-06). I have no
> add-ons enabled.

I can't reproduce this either, they seem to work on current Nightly, both when loaded directly and when clicking other articles from the newscientist.com website. I have a feeling the latter might have been fixed by bug 1147337 which landed after the 6th. Can you retest with current nightly, and if you're still seeing this issue, file a new bug with detailed steps of what you're doing? Thanks!
Flags: needinfo?(gvanwaelvelde)
(In reply to :Gijs Kruitbosch from comment #6)
> (In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #5)
> > Reproduced this issue with latest 40.0a1 (2015-04-13) under Windows 7 64
> > bit, OS X 10.9.5 and Ubuntu 14.04 32 bit on:
> > - http://en.wikipedia.org/wiki/Phoenicia
> > - http://www.bbc.com 
> > -- E.g.: http://www.bbc.com/sport/0/ or http://www.bbc.com/culture
> > 
> > Screencast: http://goo.gl/BthI3m
> 
> This makes no sense. comment #0 was about bing.com, and neither of these
> sites are bing.com. Same with the other comments relating to
> newscientist.com. One issue per bug. The error is generic and can happen due
> to any number of things.
> 
> (In reply to Gert Van Waelvelde from comment #4)
> > (In reply to :Margaret Leibovic from comment #3)
> > > (In reply to Gert Van Waelvelde from comment #1)
> > > > The same issue exists with articles on http://www.newscientist.com .
> > > > 
> > > > 1) Go to http://www.newscientist.com
> > > > 2) Click on any article, such as:
> > > >   
> > > > http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-
> > > > how-it-gets-massive.html#.VR_bPn2Aksd or
> > > > http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-
> > > > strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc
> > > > 
> > > > 3) Click on the "Reading Mode" icon in the address bar
> > > > 4) Reading Mode opens but it says "Failed to load article from page".
> > > 
> > > Both of these testcases work for me on today's Nighty (2015-04-06). Do they
> > > still fail for you? I wonder what could cause that to behave differently for
> > > us. Do you have any add-ons installed that modify page content?
> > 
> > Both of these testcases still fail for me on Nightly (2015-04-06). I have no
> > add-ons enabled.
> 
> I can't reproduce this either, they seem to work on current Nightly, both
> when loaded directly and when clicking other articles from the
> newscientist.com website. I have a feeling the latter might have been fixed
> by bug 1147337 which landed after the 6th. Can you retest with current
> nightly, and if you're still seeing this issue, file a new bug with detailed
> steps of what you're doing? Thanks!

I can no longer reproduce the newscientist.com testcases and I can only reproduce one of the BBC testcases.

On Nightly (2015-04-14) I can reproduce the following testcases:

- http://www.bing.com/search?q=Deion+Sanders+son+tweet&filters=tnTID:%22D3BF6235-CD99-45c9-B4D4-585CC9C8DBBE%22+tnVersion:%22834471%22+segment:%22popularnow.carousel%22+tnCol:%220%22+tnOrder:%226b69c5ac-9800-4514-b15d-f9e4b969d752%22&FORM=BSPN01&crslsl=0

- http://en.wikipedia.org/wiki/Phoenicia
- http://www.bbc.com/sport/0/

I CANNOT reproduce these testcases on Nightly (2015-04-14):

- http://www.bbc.com/culture 
- http://www.newscientist.com/article/dn27289-baby-star-beforeandafter-shows-how-it-gets-massive.html#.VR_bPn2Aksd
- http://www.newscientist.com/article/mg22630153.600-is-this-et-mystery-of-strange-radio-bursts-from-space.html?page=1#.VR_Zzn2Aksc

I hope this information is useful.
Flags: needinfo?(gvanwaelvelde)
(In reply to Gert Van Waelvelde from comment #7)
> I hope this information is useful.

Great, thanks!

I strongly suspect that the bing results issue is a regression from bug 1149859. :-(

I almost wonder if it's worth disabling reader mode by domain for a number of high profile sites...

Margaret, NiKo, thoughts? Other ways we can mitigate this?
Blocks: 1149859
Flags: needinfo?(nperriault)
Flags: needinfo?(margaret.leibovic)
We should definitely not detect search engine result pages as "readerable". I don't have a clear sense how to achieve that reliably without relying on some "blacklist" though :(
Flags: needinfo?(nperriault)
(In reply to Nicolas Perriault (:NiKo`) — needinfo me if you need my attention from comment #9)
> We should definitely not detect search engine result pages as "readerable".
> I don't have a clear sense how to achieve that reliably without relying on
> some "blacklist" though :(

I think search engine results pages may have a few unique elements that may make them easily distinguishable from regular webpages.

1) They have an <input> that already contains a value when you load the page (the search string), whereas other pages tend to have only emtpy <input> fields when you load them. 

2) That search string will also be in the <title> of the page
3) That search string will also be in the page URL as a querystring parameter.

So, if you have page with an <input> already containing a value, and that same value appears in <title> and also in the page URL as a querystring parameter, then there's a good chance it's a search engine results page. 

What do you think?
Perhaps some other criteria can be added to make detection more reliable.
Flags: needinfo?(nperriault)
(In reply to Gert Van Waelvelde from comment #10)
> (In reply to Nicolas Perriault (:NiKo`) — needinfo me if you need my
> attention from comment #9)
> > We should definitely not detect search engine result pages as "readerable".
> > I don't have a clear sense how to achieve that reliably without relying on
> > some "blacklist" though :(
> 
> I think search engine results pages may have a few unique elements that may
> make them easily distinguishable from regular webpages.

I think you underestimate "easy". I mean, the code is not hard to write - from that perspective it is indeed easy. But now imagine that we have to run this detection code for every single webpage you load and you start seeing why custom-making algorithms for pages that need to be excluded will start to impact page load times and why it is less feasible as a strategy to fix an issue like this... Lists of sites where we block reader mode might be "uglier", but they are also much much quicker to check than scanning the entire page DOM...

> 1) They have an <input> that already contains a value when you load the page
> (the search string), whereas other pages tend to have only emtpy <input>
> fields when you load them. 

A lot of pages write their own "placeholder" attribute support with JS which would break this logic.

> 2) That search string will also be in the <title> of the page
> 3) That search string will also be in the page URL as a querystring
> parameter.

It isn't on the version of Google served to modern browsers. It might not even be in the hash anymore, I haven't looked at these in detail in a while.
(In reply to :Gijs Kruitbosch from comment #11)
> (In reply to Gert Van Waelvelde from comment #10)
> I think you underestimate "easy".

Yeah, I think everybody having touched this code has been burnt with the same findings. Very hard to stay generic, while you don't usually want to start implementing lots of tests trying to cover all the thousands ultra-specific cases… For search engines, while they might look really similar visually, they all do things differently, technically. You can always suggest a patch on github[0] (including tests) if you think you can cover most of the cases though.

[0]: https://github.com/mozilla/readability/
Flags: needinfo?(nperriault)
(In reply to Nicolas Perriault (:NiKo`) — needinfo me if you need my attention from comment #12)
> (In reply to :Gijs Kruitbosch from comment #11)
> > (In reply to Gert Van Waelvelde from comment #10)
> > I think you underestimate "easy".
> 
> Yeah, I think everybody having touched this code has been burnt with the
> same findings. Very hard to stay generic, while you don't usually want to
> start implementing lots of tests trying to cover all the thousands
> ultra-specific cases… For search engines, while they might look really
> similar visually, they all do things differently, technically. You can
> always suggest a patch on github[0] (including tests) if you think you can
> cover most of the cases though.
> 
> [0]: https://github.com/mozilla/readability/

I have suggested a patch. Maybe it's worth to take a look at? Thanks.
https://github.com/mozilla/readability/pull/146/files
(In reply to :Gijs Kruitbosch from comment #6)
> This makes no sense. comment #0 was about bing.com, and neither of these
> sites are bing.com. Same with the other comments relating to
> newscientist.com. One issue per bug. The error is generic and can happen due
> to any number of things.

Where do you want me to log the issue from comment 4? On github or Bugzilla?
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to Alexandra Lucinet, QA Mentor [:adalucinet] from comment #14)
> (In reply to :Gijs Kruitbosch from comment #6)
> > This makes no sense. comment #0 was about bing.com, and neither of these
> > sites are bing.com. Same with the other comments relating to
> > newscientist.com. One issue per bug. The error is generic and can happen due
> > to any number of things.
> 
> Where do you want me to log the issue from comment 4? On github or Bugzilla?

Github please, thanks!
Flags: needinfo?(gijskruitbosch+bugs)
(In reply to :Gijs Kruitbosch from comment #8)
> (In reply to Gert Van Waelvelde from comment #7)
> > I hope this information is useful.
> 
> Great, thanks!
> 
> I strongly suspect that the bing results issue is a regression from bug
> 1149859. :-(
> 
> I almost wonder if it's worth disabling reader mode by domain for a number
> of high profile sites...
> 
> Margaret, NiKo, thoughts? Other ways we can mitigate this?

Sorry for the slow reply in this bug, but we talked about this a bit earlier.

I think that we should explore turning the dial a bit back from what we did in bug 1149859. I think 100 characters is too few to consider a valid "content" paragraph, so maybe turning that up to 150 might help this problem.

We could also take advantage of the Firefox search service to check to see if a URL is one that came from one of your installed Firefox search engines, so this may be a more performant/robust way to never show the button for search engine results pages.
Flags: needinfo?(margaret.leibovic)
> I almost wonder if it's worth disabling reader mode by domain for a number
> of high profile sites...

> We could also take advantage of the Firefox search service to check to see
> if a URL is one that came from one of your installed Firefox search engines,
> so this may be a more performant/robust way to never show the button for
> search engine results pages.

I agree.
I would suggest to disable reader mode for the domains of well-known search engines and additionally to use the Firefox search provider to weed out any lesser known search engines that some users might have installed as a search provider. That way I guess we'll probably cover 99% of all normal searches.
Any other method would be messy at best and will probably break sooner or later.
Priority: -- → P3
What is a search engine domain?
For example, yahoo.com does searches and has news articles.

https://www.yahoo.com/ -- shows news and looks like a portal.
https://search.yahoo.com/search;_ylt=AvmjjITBpPWGk0Ot7RFKxbqbvZx4?p=stir+friday&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-339&fp=1 -- search results from the search box at the top of www.yahoo.com
http://news.yahoo.com/ -- another portal-y page with random results pointing to 3rd party sites.
https://www.yahoo.com/tech/exclusive-william-shatners-30-billion-116672789084.html -- article page on yahoo.com
http://sports.yahoo.com/news/2-weeks-mayweather-pacquiao-not-ticket-seen-181405519--box.html -- random sports article.

So it doesn't look like you could disable reader mode for www.yahoo.com since it looks like both a portal and has actual hosted articles. We could block search.yahoo.com since I doubt that the site has content. Not sure about news.yahoo.com, it looked like a portal, but there may be random links that point to hosted content.
I am seeing the same problem here. Any ETA on this?

There are to pages that are coded almost identically.

#1 - https://simplyfound.com/article/6d733be2aef4/apple-pay-slow-and-steady-wins-the-race

The above works very well when clicking on the READER icon.

#2 - https://simplyfound.com/article/eb9a5e137034/raspberry-pi-3-the-credit-card-sized-pc-that-cost-only-35-all-time-bestselling-computer-in-uk

The above fails with the "Failed to load article from page" message.

Same logic produces both pages, but one works and the other not.

It would be cool to have a errorno of some sort displayed alongside the error message for debugging purposes.

Any help is debugging this is much appreciated.


Extra info:
OS X El Capitan - 10.11.3
Firefox 44.0.2
Page links: Refer to the message above

Thanks
(In reply to un33kvu from comment #19)
> I am seeing the same problem here. Any ETA on this?
> 
> There are to pages that are coded almost identically.
> 
> #1 -
> https://simplyfound.com/article/6d733be2aef4/apple-pay-slow-and-steady-wins-
> the-race
> 
> The above works very well when clicking on the READER icon.
> 
> #2 -
> https://simplyfound.com/article/eb9a5e137034/raspberry-pi-3-the-credit-card-
> sized-pc-that-cost-only-35-all-time-bestselling-computer-in-uk
> 
> The above fails with the "Failed to load article from page" message.

Those aren't search result pages and so this bug is not related to the issue you're seeing - you're just getting the same error message.

Please file an issue on github at https://github.com/mozilla/readability/issues instead.
Flags: needinfo?(un33kvu)
The original case here is WFM, and it seems like we're unlikely to get a reply for comment #20 at this stage, so marking this WFM. If people are still seeing this on pages, please file issues in the github repo instead.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(un33kvu)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: