Lazily loaded / srcset-using images are dropped on bbc.com
Categories
(Toolkit :: Reader Mode, enhancement, P2)
Tracking
()
People
(Reporter: justindarc, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Whiteboard: [reader-mode-readability-algorithm])
Updated•8 years ago
|
Comment 1•6 years ago
|
||
:Gijs, I am not able to see any images on that web-page under "How social media reacted" even in the normal mode in Firefox, whereas I do not face this issue in Chrome. Is this a Firefox bug or a Reader Mode bug?
Comment 2•6 years ago
|
||
(In reply to sonali18317 from comment #1)
:Gijs, I am not able to see any images on that web-page under "How social media reacted" even in the normal mode in Firefox, whereas I do not face this issue in Chrome. Is this a Firefox bug or a Reader Mode bug?
I see images. Perhaps you've enabled strict content blocking or something?
Comment 3•6 years ago
|
||
:Gijs, My content blocking is set to standard.
I am not even able to see the rest of the images on the page but there I see a notification which says "Media playback is not supported on this device". I am using MacBook Air.
If I disable content blocking for this website, then I am able to see all the images.
Comment 4•6 years ago
|
||
:Gijs, can I work on this bug?
Comment 5•6 years ago
|
||
(In reply to sonali18317 from comment #4)
:Gijs, can I work on this bug?
Sure! I'm not sure what causes the images to be removed, though... step 1 will be figuring that out.
Comment 6•6 years ago
|
||
I spent time yesterday trying to figure out the issue. It seems like while fixing the URLs in readability.js the _fixRelativeUris function changes the img tag to <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/8C70/production/_94725953_will.jpg" data-sizes="auto" alt="Will">
We need to change this by removing the current src href (its a 1x1 pixel image and I think thats why it isnt visible) and making data-src as src.
Further we would need {width}{hidpi} replaced by actual values present in the data-srcset img attribute of the original web page. For example: <img src="https://ichef.bbci.co.uk/onesport/cps/624/cpsprodpb/8C70/production/_94725953_will.jpg" data-sizes="auto" alt="Will"> seemed to work.
Comment 7•6 years ago
|
||
What I think is that the function _fixRelativeUris in readability.js does not take into account the presence of data-src or data-srcset attributes in an img tag.
Comment 8•6 years ago
|
||
(In reply to sonali18317 from comment #7)
What I think is that the function _fixRelativeUris in readability.js does not take into account the presence of data-src or data-srcset attributes in an img tag.
This doesn't make a lot of sense to me. There's a github issue to be able to fix up data-src/data-srcset and other lazy image loading things (e.g. https://github.com/mozilla/readability/issues/299 ).
But here, there's no JS involved. It just looks like the srcset attribute gets removed when we open the page in reader mode. Why does that happen? Although I guess it's a remote possibility, I doubt that has anything to do with _fixRelativeUris...
Comment 9•6 years ago
|
||
The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.
The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.
Comment 10•6 years ago
|
||
(In reply to sonali18317 from comment #9)
The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.
The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.
On the BBC url in comment #0, for me the images get the srcset attribute immediately.
Comment 11•6 years ago
|
||
(In reply to :Gijs (he/him) from comment #10)
Hey, this is how it is for me https://youtu.be/fV2GlLAw3Vs?t=76. Perhaps the image was near viewport or browser scrolled down automatically when you clicked on the img tag in inspect element??
If not, maybe it is an OS Specific nature?
(In reply to sonali18317 from comment #9)
The problem seems to be because of the lazyloading feature of images (the images have the class attribute as "lazyload"). It prevents the browser from loading images until they are in the browser's viewport. Therefore when I reach the images, the <img> tag changes, it's class changes from "lazy load" to "lazy loaded" and the rest of the attributes: data-srcset, srcset etc appear.
The image attributes in reader mode are from before the images are loaded (as no one manually scrolls down and brings the image in the viewport) and that's why those attributes are missing.
On the BBC url in comment #0, for me the images get the srcset attribute immediately.
Comment 12•6 years ago
|
||
Gijs, I need some help. Could you please point me to the relevant files/functions for this bug?
Comment 13•6 years ago
|
||
(In reply to sonali18317 from comment #12)
Gijs, I need some help. Could you please point me to the relevant files/functions for this bug?
Well, it depends. Readability.js is the main thing, which might be stripping something it shouldn't. You can turn on debug mode near the top of the file, which should hopefully dump some output in the browser console and/or stdout/stderr.
The other thing that could be going wrong here is https://searchfox.org/mozilla-central/rev/201450283cddc9e409cec707acb65ba6cf6037b1/toolkit/components/reader/AboutReader.jsm#881-884 , which strips some content for security reasons. It's possible that that is doing the removals; using the JS debugger around that code should make it possible to work out whether that's happening. You can use the "Browser Content Toolbox" to set breakpoints in AboutReader.jsm .
Does that help?
Comment 14•6 years ago
|
||
Thanks Gijs, I had a look at Readability.js debug output.
The image tag for all 4 stages (pre-prep, post-prep, after paging and grabbing) is the same
<img class="sp-lazyload" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/D716/production/_94726055_jordan_archer.jpg" data-sizes="auto" alt="Jordan Archer"/>
I was trying to set breakpoints in resource://gre/modules/AboutReader.jsm but I am not able to find the file. Go to file shows: https://pasteboard.co/I8piuIW.png
Comment 15•6 years ago
|
||
(In reply to sonali18317 from comment #14)
Thanks Gijs, I had a look at Readability.js debug output.
The image tag for all 4 stages (pre-prep, post-prep, after paging and grabbing) is the same
<img class="sp-lazyload" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" data-src="https://ichef.bbci.co.uk/onesport/cps/{width}{hidpi}/cpsprodpb/D716/production/_94726055_jordan_archer.jpg" data-sizes="auto" alt="Jordan Archer"/>
I'm a bit confused - are you scrolling down to get the srcset attribute to be set for you before invoking readability, or no?
I was trying to set breakpoints in
resource://gre/modules/AboutReader.jsmbut I am not able to find the file. Go to file shows: https://pasteboard.co/I8piuIW.png
This shows the browser toolbox, not the browser content toolbox, which is privileged (like the browser toolbox, and unlike the "normal" developer tools) and runs against the content process (unlike the browser toolbox, which runs against the parent/browser process).
Comment 16•6 years ago
|
||
Ah yes, I wasn't scrolling before invoking readability. I was able to spot the issue after that.
Readablity.js seems to be working fine, it doesnt't modify the image tag.
Then I guess the problem should be in the other code you sent. I have been working for a couple of days trying to understand and fix the issue, but it's slow because of my exams.
Comment 17•6 years ago
|
||
Gijs, I printed article.content and contentFragment.firstElementChild.innerHTML to see how the content changes in parseFragment function in AboutReader.jsm (on line 885). article has the srcset tag whereas contentFragment does not. parseFragment seems to be stripping away that tag. I am looking at https://searchfox.org/mozilla-central/source/parser/html/nsIParserUtils.idl to understand what the function does but the actual code is in cpp. I am not very fluent with that.
Comment 18•6 years ago
|
||
(In reply to sonali18317 from comment #17)
Gijs, I printed
article.contentandcontentFragment.firstElementChild.innerHTMLto see how the content changes in parseFragment function in AboutReader.jsm (on line 885).articlehas the srcset tag whereascontentFragmentdoes not. parseFragment seems to be stripping away that tag. I am looking at https://searchfox.org/mozilla-central/source/parser/html/nsIParserUtils.idl to understand what the function does but the actual code is in cpp. I am not very fluent with that.
Ah. Thanks, this is helpful. I will file a separate dependent bug to fix this in the tree sanitizer. It might not be straightforward because it'll require teaching our sanitizer how to parse the srcset attribute...
Do you have something else to work on in the meantime? Perhaps you could tackle the JS side of making sure the data-srcset / data-src stuff gets transformed correctly, so scrolling down is not necessary?
Comment 19•6 years ago
|
||
Glad I could help somewhere. Have been trying to fix this for quite some time now.
I had tried to look at the JS side earlier, but I could not come up with any possible solutions with it. There were not many resources on the web either, that I could refer to/learn from.
I was trying to print img.getAttribute("data-srcset"); in Redability.js but that returns undefined if I don't scroll down.
Could you please give me a lead to follow?
Comment 20•6 years ago
|
||
(In reply to sonali18317 from comment #19)
Glad I could help somewhere. Have been trying to fix this for quite some time now.
I had tried to look at the JS side earlier, but I could not come up with any possible solutions with it. There were not many resources on the web either, that I could refer to/learn from.
I was trying to print img.getAttribute("data-srcset"); in Redability.js but that returns undefined if I don't scroll down.
Could you please give me a lead to follow?
Sorry for the slow response. It looks like I misread, and for you the BBC article doesn't include a data-srcset, only a template for what the srcset should be. Because the template has {width}{hidpi} in it, and reader mode doesn't know how to fill in those values, I don't think there's anything else we can do here. I'm sorry you've done so much investigative work here to get such an unsatisfactory outcome. :-(
Perhaps you would like to work on bug 1188382 or bug 1395824 next? They're both quite different but should be relatively straightforward.
Comment 21•6 years ago
|
||
Sorry, I meant to needinfo for comment #20. :-)
Comment 22•6 years ago
|
||
Gijs, Is it possible to increase the size of the viewport before collecting HTML of the page? That way maybe we could collect the complete HTML instead of those template urls?
I had a feeling that fixing the BBC website was tough so I was also looking at some other webpages where data-src attribute could be transformed to get the images to load without scrolling.
Specifically medium articles from the link you said above https://github.com/mozilla/readability/issues/299 . Here I observed that data-src attribute is present and still the images don't load even if I transform data-src to src because the div tag in which the image is nested in is conditionally cleaned by Readability.js.
I don't understand why the div is cleaned but the issue seems fixable to some extent. It would be really cool to have more images in reader mode.
I might pick this up later to fix if you agree.
Thanks, I will have a look at the bugs and start working on one.
Comment 23•6 years ago
|
||
(In reply to sonali18317 from comment #22)
Gijs, Is it possible to increase the size of the viewport before collecting HTML of the page? That way maybe we could collect the complete HTML instead of those template urls?
I don't think this is easily doable on Firefox, no...
I had a feeling that fixing the BBC website was tough so I was also looking at some other webpages where data-src attribute could be transformed to get the images to load without scrolling.
Right. Someone (I think not you?) has just (last 24h) put up a pull request for this on github, at https://github.com/mozilla/readability/pull/542 .
Specifically medium articles from the link you said above https://github.com/mozilla/readability/issues/299 . Here I observed that data-src attribute is present and still the images don't load even if I transform data-src to src because the div tag in which the image is nested in is conditionally cleaned by Readability.js.
I don't understand why the div is cleaned but the issue seems fixable to some extent. It would be really cool to have more images in reader mode.
I might pick this up later to fix if you agree.
Sure, I think the PR I referenced mentioned that this is being cleaned up because of the "media" class matching with the negative candidates regular expression. Fixing that might be tricky, as is pointed out in that pull request...
Thanks, I will have a look at the bugs and start working on one.
Perfect, let me know on the bug which one and I can help there if you have more questions.
Updated•6 years ago
|
Updated•3 years ago
|
Comment 25•4 months ago
|
||
Just ran into this on a ft.com article in bug 1980330. In that case, there's an article with a bunch of high-res images, which have lower-res versions available via srcset.
Firefox's reader-mode drops the srcset version and hence only knows about the high-res version; so we end up showing unnecessarily-high-res images in reader-mode (e.g. a 5158-pixel-wide image in a 600px-wide img element), and we end up producing massive file-sizes when printing from reader-mode, relative to other browsers which preserve and honor the srcset attribute to get lower-res versions of the image.
Updated•4 months ago
|
Comment 26•4 months ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #25)
Just ran into this on a ft.com article in bug 1980330. In that case, there's an article with a bunch of high-res images, which have lower-res versions available via
srcset.Firefox's reader-mode drops the
srcsetversion and hence only knows about the high-res version; so we end up showing unnecessarily-high-res images in reader-mode (e.g. a5158-pixel-wide image in a600px-wide img element), and we end up producing massive file-sizes when printing from reader-mode, relative to other browsers which preserve and honor thesrcsetattribute to get lower-res versions of the image.
Is this still a result of the sanitizer issue, ie bug 1546235 ? That's what I'd expect...
Comment 27•4 months ago
|
||
Yeah, I think you're right - I'll restore the summary here and morph bug 1980330 to be a dupe of that bug.
Updated•4 months ago
|
Description
•