Open Bug 1355164 Opened 7 years ago Updated 3 months ago

Lazily loaded / srcset-using images are dropped on bbc.com

Categories

(Toolkit :: Reader Mode, enhancement, P2)

enhancement

Tracking

()

People

(Reporter: justindarc, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [reader-mode-readability-algorithm])

As noted in Bug 1340907, <img> elements with [src="data:..."] attributes are dropped by readability.js.

Here is a URL to an example article: http://www.bbc.co.uk/sport/football/39017152

In this particular case, images under the "How social media reacted" section of the article are dropped completely. This behavior is the same on both Firefox Desktop and Firefox for iOS. Inspecting the original content's DOM reveals that these <img> elements are using "data:" URLs.
Blocks: 1340907
Priority: -- → P2
Whiteboard: [reader-mode-readability-algorithm]

:Gijs, I am not able to see any images on that web-page under "How social media reacted" even in the normal mode in Firefox, whereas I do not face this issue in Chrome. Is this a Firefox bug or a Reader Mode bug?

(In reply to sonali18317 from comment #1)

:Gijs, I am not able to see any images on that web-page under "How social media reacted" even in the normal mode in Firefox, whereas I do not face this issue in Chrome. Is this a Firefox bug or a Reader Mode bug?

I see images. Perhaps you've enabled strict content blocking or something?

Flags: needinfo?(sonali18317)

:Gijs, My content blocking is set to standard.
I am not even able to see the rest of the images on the page but there I see a notification which says "Media playback is not supported on this device". I am using MacBook Air.
If I disable content blocking for this website, then I am able to see all the images.

Flags: needinfo?(sonali18317)

:Gijs, can I work on this bug?

(In reply to sonali18317 from comment #4)

:Gijs, can I work on this bug?

Sure! I'm not sure what causes the images to be removed, though... step 1 will be figuring that out.

I spent time yesterday trying to figure out the issue. It seems like while fixing the URLs in readability.js the _fixRelativeUris function changes the img tag to <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/8C70/production/_94725953_will.jpg" data-sizes="auto" alt="Will">
We need to change this by removing the current src href (its a 1x1 pixel image and I think thats why it isnt visible) and making data-src as src.
Further we would need {width}{hidpi} replaced by actual values present in the data-srcset img attribute of the original web page. For example: <img src="https://ichef.bbci.co.uk/onesport/cps/624/cpsprodpb/8C70/production/_94725953_will.jpg" data-sizes="auto" alt="Will"> seemed to work.

What I think is that the function _fixRelativeUris in readability.js does not take into account the presence of data-src or data-srcset attributes in an img tag.

(In reply to sonali18317 from comment #7)

What I think is that the function _fixRelativeUris in readability.js does not take into account the presence of data-src or data-srcset attributes in an img tag.

This doesn't make a lot of sense to me. There's a github issue to be able to fix up data-src/data-srcset and other lazy image loading things (e.g. https://github.com/mozilla/readability/issues/299 ).

But here, there's no JS involved. It just looks like the srcset attribute gets removed when we open the page in reader mode. Why does that happen? Although I guess it's a remote possibility, I doubt that has anything to do with _fixRelativeUris...

Flags: needinfo?(sonali18317)

The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.

The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.

Flags: needinfo?(sonali18317)

(In reply to sonali18317 from comment #9)

The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.

The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.

On the BBC url in comment #0, for me the images get the srcset attribute immediately.

(In reply to :Gijs (he/him) from comment #10)

Hey, this is how it is for me https://youtu.be/fV2GlLAw3Vs?t=76. Perhaps the image was near viewport or browser scrolled down automatically when you clicked on the img tag in inspect element??

If not, maybe it is an OS Specific nature?

(In reply to sonali18317 from comment #9)

The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.

The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.

On the BBC url in comment #0, for me the images get the srcset attribute immediately.

Gijs, I need some help. Could you please point me to the relevant files/functions for this bug?

(In reply to sonali18317 from comment #12)

Gijs, I need some help. Could you please point me to the relevant files/functions for this bug?

Well, it depends. Readability.js is the main thing, which might be stripping something it shouldn't. You can turn on debug mode near the top of the file, which should hopefully dump some output in the browser console and/or stdout/stderr.

The other thing that could be going wrong here is https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/toolkit/components/reader/AboutReader.jsm#881-884 , which strips some content for security reasons. It's possible that that is doing the removals; using the JS debugger around that code should make it possible to work out whether that's happening. You can use the "Browser Content Toolbox" to set breakpoints in AboutReader.jsm .

Does that help?

Flags: needinfo?(sonali18317)

Thanks Gijs, I had a look at Readability.js debug output.
The image tag for all 4 stages (pre-prep, post-prep, after paging and grabbing) is the same
<img class="sp-lazyload" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/D716/production/_94726055_jordan_archer.jpg" data-sizes="auto" alt="Jordan Archer"/>

I was trying to set breakpoints in resource://gre/modules/AboutReader.jsm but I am not able to find the file. Go to file shows: https://pasteboard.co/I8piuIW.png

Flags: needinfo?(sonali18317)

(In reply to sonali18317 from comment #14)

Thanks Gijs, I had a look at Readability.js debug output.
The image tag for all 4 stages (pre-prep, post-prep, after paging and grabbing) is the same
<img class="sp-lazyload" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/D716/production/_94726055_jordan_archer.jpg" data-sizes="auto" alt="Jordan Archer"/>

I'm a bit confused - are you scrolling down to get the srcset attribute to be set for you before invoking readability, or no?

I was trying to set breakpoints in resource://gre/modules/AboutReader.jsm but I am not able to find the file. Go to file shows: https://pasteboard.co/I8piuIW.png

This shows the browser toolbox, not the browser content toolbox, which is privileged (like the browser toolbox, and unlike the "normal" developer tools) and runs against the content process (unlike the browser toolbox, which runs against the parent/browser process).

Flags: needinfo?(sonali18317)

Ah yes, I wasn't scrolling before invoking readability. I was able to spot the issue after that.
Readablity.js seems to be working fine, it doesnt't modify the image tag.
Then I guess the problem should be in the other code you sent. I have been working for a couple of days trying to understand and fix the issue, but it's slow because of my exams.

Flags: needinfo?(sonali18317)

Gijs, I printed article.content and contentFragment.firstElementChild.innerHTML to see how the content changes in parseFragment function in AboutReader.jsm (on line 885). article has the srcset tag whereas contentFragment does not. parseFragment seems to be stripping away that tag. I am looking at https://searchfox.org/mozilla-central/source/parser/html/nsIParserUtils.idl to understand what the function does but the actual code is in cpp. I am not very fluent with that.

(In reply to sonali18317 from comment #17)

Gijs, I printed article.content and contentFragment.firstElementChild.innerHTML to see how the content changes in parseFragment function in AboutReader.jsm (on line 885). article has the srcset tag whereas contentFragment does not. parseFragment seems to be stripping away that tag. I am looking at https://searchfox.org/mozilla-central/source/parser/html/nsIParserUtils.idl to understand what the function does but the actual code is in cpp. I am not very fluent with that.

Ah. Thanks, this is helpful. I will file a separate dependent bug to fix this in the tree sanitizer. It might not be straightforward because it'll require teaching our sanitizer how to parse the srcset attribute...

Do you have something else to work on in the meantime? Perhaps you could tackle the JS side of making sure the data-srcset / data-src stuff gets transformed correctly, so scrolling down is not necessary?

Flags: needinfo?(sonali18317)
Depends on: 1546235

Glad I could help somewhere. Have been trying to fix this for quite some time now.

I had tried to look at the JS side earlier, but I could not come up with any possible solutions with it. There were not many resources on the web either, that I could refer to/learn from.

I was trying to print img.getAttribute("data-srcset"); in Redability.js but that returns undefined if I don't scroll down.
Could you please give me a lead to follow?

Flags: needinfo?(sonali18317)

(In reply to sonali18317 from comment #19)

Glad I could help somewhere. Have been trying to fix this for quite some time now.

I had tried to look at the JS side earlier, but I could not come up with any possible solutions with it. There were not many resources on the web either, that I could refer to/learn from.

I was trying to print img.getAttribute("data-srcset"); in Redability.js but that returns undefined if I don't scroll down.
Could you please give me a lead to follow?

Sorry for the slow response. It looks like I misread, and for you the BBC article doesn't include a data-srcset, only a template for what the srcset should be. Because the template has {width}{hidpi} in it, and reader mode doesn't know how to fill in those values, I don't think there's anything else we can do here. I'm sorry you've done so much investigative work here to get such an unsatisfactory outcome. :-(

Perhaps you would like to work on bug 1188382 or bug 1395824 next? They're both quite different but should be relatively straightforward.

Sorry, I meant to needinfo for comment #20. :-)

Flags: needinfo?(sonali18317)

Gijs, Is it possible to increase the size of the viewport before collecting HTML of the page? That way maybe we could collect the complete HTML instead of those template urls?
I had a feeling that fixing the BBC website was tough so I was also looking at some other webpages where data-src attribute could be transformed to get the images to load without scrolling.

Specifically medium articles from the link you said above https://github.com/mozilla/readability/issues/299 . Here I observed that data-src attribute is present and still the images don't load even if I transform data-src to src because the div tag in which the image is nested in is conditionally cleaned by Readability.js.
I don't understand why the div is cleaned but the issue seems fixable to some extent. It would be really cool to have more images in reader mode.
I might pick this up later to fix if you agree.

Thanks, I will have a look at the bugs and start working on one.

Flags: needinfo?(sonali18317)

(In reply to sonali18317 from comment #22)

Gijs, Is it possible to increase the size of the viewport before collecting HTML of the page? That way maybe we could collect the complete HTML instead of those template urls?

I don't think this is easily doable on Firefox, no...

I had a feeling that fixing the BBC website was tough so I was also looking at some other webpages where data-src attribute could be transformed to get the images to load without scrolling.

Right. Someone (I think not you?) has just (last 24h) put up a pull request for this on github, at https://github.com/mozilla/readability/pull/542 .

Specifically medium articles from the link you said above https://github.com/mozilla/readability/issues/299 . Here I observed that data-src attribute is present and still the images don't load even if I transform data-src to src because the div tag in which the image is nested in is conditionally cleaned by Readability.js.
I don't understand why the div is cleaned but the issue seems fixable to some extent. It would be really cool to have more images in reader mode.
I might pick this up later to fix if you agree.

Sure, I think the PR I referenced mentioned that this is being cleaned up because of the "media" class matching with the negative candidates regular expression. Fixing that might be tricky, as is pointed out in that pull request...

Thanks, I will have a look at the bugs and start working on one.

Perfect, let me know on the bug which one and I can help there if you have more questions.

Summary: Images with data: URLs are dropped → Lazily loaded / srcset-using images are dropped on bbc.com
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.