Insufficient caching of images leading to resource wastage and malicious exploitation by websites
Categories
(Core :: Networking: Cache, enhancement, P3)
Tracking
()
People
(Reporter: pedrov, Unassigned)
Details
(Whiteboard: [necko-triaged])
The issue is mostly Bug 120809 , but back then it was considered mostly an annoyance, so I'll try to explain how it's actually a bug actively being exploited.
It's a tricky problem, possibly consisting of the following elements:
- Even when an image is cached, "Open Image in New Tab" / "Save Image As..." can still make a network request. This is likely an issue with URL not being the only cache key, so likely this is a problem with the mentioned functions acting as if the user just manually opened a new tab and copied the URL there.
- As I understand from the other bug reports, the image file may not be cached at all as it may be just converted to the internally used image format, and the file itself gets discarded if it's not allowed to be cached. Now the desire to pin such files in the cache may seem to be an enhancement request, but due to how this doesn't match user's expectations, and how it's being exploited, I deem the issue a bug at this point.
Some examples of how this problem can be experienced:
- As mostly the older bug reports describe, downloading images once again is a waste of resources and time. Not great, but I'd say this problem is a low priority one, however in some cases it can be a more uncomfortable issue when an image gets downloaded for an uncomfortably long time, or worse, the data cap plague still pushed by some ISPs can make this really bad for some users. Also consider that adding "Open Image in New Tab" into the mix can end up with 3 downloads of the same image in cases of caching being prevented.
- The image file may be already gone by the time the user desires to download it, even though the image is still visible. The severity of this really depends on the user. Those who keep their sessions clean are going to mostly brush it aside, but for example the way I browse, I open up a whole lot of tabs I want to process, and some of them may not be dealt with for a long time. While this can be argued to be inherently risky due to issues like the browser crashing anyway, or hostile websites like Twitter forcing a page refresh after a while, if the image itself is shown, it's really bad user experience that it can't be downloaded even though the browser offers that functionality, it just stops working in some cases in an unpredictable manner.
- Some websites abuse this behavior to possibly prevent downloading, or prevent the image being viewed on its own without a mostly malicious frame. This is done by exploiting the network request being issued which presents the opportunity for the server to serve up different content, either a dummy image, or potentially even an HTML page instead of the image. I'd deem this a critical problem because it just shouldn't be possible to happen, and the lack of caching is being maliciously exploited.
I've personally experienced all examples quite a few times each, but today the last one is what's bringing me here especially as based on the ancient bug reports being closed, I don't think the significance of the problem was understood earlier, and after various mostly image hosting sites taking advantage of it here and there, now it's a "mainstream" problem with Reddit actively exploiting it.
Example abuse of the bug:
https://preview.redd.it/2vzxlupz0fab1.jpg?width=1024&auto=webp&v=enabled&s=63031fe734b8f4ef83e5eb0e193f55cfc902692e
The link leads to a page which embeds an image with the same URL the page itself has.
That seems to exploit multiple problems:
- As mentioned earlier, given that opening image in new tab and downloading are not just local actions as expected, the page traps the user on itself. It doesn't work all the time, but there's a high chance that "Open Image in New Tab" just opens the same page in a new tab with all the extra bloat and tracking instead of just the image.
- The same URL gets reused likely by the server making a distinction from how the browser requests it which is different for the embedded <img> compared to the first request. This one itself may not be an issue in need of fixing, but this is yet another reason why what should be just a local action shouldn't turn into a network request as it gets abused.
- The image in this case actually gets cached fine even though it shares the URL with the page, yet "Open Image in New Tab" still doesn't use the cached image. This is likely based on the combination of the earlier guessed problem of the image cache key being more than just the URL, and <img> src being requested differently than the page which should be the cached image to begin with.
This is not even the worst possible form of abuse, the other bug reports already noted how image download prevention was seen in the wild already, so I hope with how apparently the exploitation of the issue is becoming mainstream, this will be properly considered a bug given that it allows functions to be made broken intentionally.
Hopefully the selected component for the bug report is mostly right, although I believe this issue could need 2 different kind of treatments:
- The image files still "in use" would need to be pinned in cache to avoid the file disappearing while the image is still shown. The tricky problem here is that new requests would desire an update, so the cache would still need to be updated without losing the original file.
- Opening in new tab or saving an image should be ensured to be local operations, not being affected by the server's caching policy, and later cache updates. The original file belonging to the shown image is what should be used, neither reaching out to the server again, nor using a potentially updated response for the same URL in the cache.
I wonder if this is a regression, or what was likely intended just happened to work before the web got so complex and malicious, but was never really ensured.
In the quite large web of related bug reports (the linked bug report seems to have a quite large collection), I've found this comment quite accurately describing the general expectation of the user with a quite amusing profile name:
https://bugzilla.mozilla.org/show_bug.cgi?id=55583#c280
Comment 1•2 years ago
•
|
||
Thanks for the report and detailed argument.
Of the 3 issues you raise:
- waste of resources: True, but a relatively minor point. The data cap issue is probably more of an issue than the time required (which was an issue in the days of dialup 56K or slower modems).
- Image may no longer be served by the server: true, again relatively low impact. The biggest actual issue would be a 'live' image (webcam, etc) which may change on every download. But even that's relatively small.
- Usage to prevent downloading or force it to be seen in a site-provided frame - this is the strongest argument. There are many ways to block or partially block right-click actions (and some ways that few users know about to avoid the blocks), so sites will generally be successful in blocking. they can also (on open image in new tab) return a redirect (if a network request occurs, and if it's marked non-cache it will, though in theory we could reuse the internal representation, even though it may have lost information).
Changing this would be a significant re-architecture of the internals of the browser, especially trying to retain the original data when an image is marked non-cacheable; this has the potential to break real usecases such as webcams (special cases for save image as, etc would have to be created, and a bunch of cache changes to ignore cached-but-not-supposed-to-be files. And the lifetime of these cache entries would have to be tied to the visibility of the image -- and that would interact with things like the back-forward cache. I'll also note that at least the example you gave above for reddit acts the same in Chrome.
Updated•2 years ago
|
Reporter | ||
Comment 2•2 years ago
|
||
Suspected the cache change to be potentially hard to deal with, and also would generally expect pushback there as it would increase storage/memory space usage for a function not everyone may use, but on the other hand as described, the broken functionality is frustrating.
However I'm still curious about the usage of cache as-is as the part of the problem that is likely to be less controversial, and also may not be as difficult to deal with.
In the given Reddit example the image is actually cached, so the problem is with the mentioned functionalities not having a cache hit when they should. I'd break this part into two subparts:
- Theoretically that cache hit should be able to happen to begin with the image being in cache. As I understand, there's some trickery with the URL not being the sole cache key, although trying to determine more by just casually looking into the code led me to figure quick that I'm not even sure how would the "origin" differ for the 2 URLs, so I don't have a good understanding what makes this "break", I just determined that the image is in the cache after all.
- As mentioned earlier, I generally believe that users of the named functionalities have an expectation (as voiced by others in the other bug reports too) that there would be no network usage at all caused by the usage of these functions. Even if they rely on the cache, and even if the files can't be pinned (yet) or cache pinning is not ever planned to be implemented, then I believe the functions should use the cache first even if the content in the cache is expired, and there should be only a lookup if there's no cache hit at all.
Going in this direction only and leaving cache alone makes the "Networking: Cache" really not stick, but I do understand that a cache rework might be undesirable, so I'm just trying to see what could be a reasonable middle ground.
I believe that most of the functionality of at least the easier to deal with desires is already present, it's just not presented to the user in a reasonable way.
For example going back to the example Reddit URL, I could dig up:
about:cache-entry?storage=memory&context=O^privateBrowsingId=1&partitionKey=%28https%2Credd.it%29,p,&eid=&uri=https://preview.redd.it/2vzxlupz0fab1.jpg?width=1024&auto=webp&v=enabled&s=63031fe734b8f4ef83e5eb0e193f55cfc902692e
(Private browsing was used for the test, as I see the exact same URL could be potentially reused with that.)
Amusingly the "key:" URL doesn't lead to the cached content either but loads the "trap" page showing one of the fundamental issues here.
This is an already pretty good page which shows that the functionality for a lot of what may be desired is there:
- There's already a page which shows the cache entry, even if it's only showing the file as a hexdump instead of decoded media
- The URL is really good, even if only the image would be shown, it's possible to get the original URL by just removing the "prefix" part
- As every identifying info is in the URL, it really doesn't matter how it's opened, it will always show the same cache entry
I'm not sure about the UI details of how this could be presented well because I do understand that not every user would be happy already about just having to remove a "prefix" to get potentially fresh content, so there may be additional needs there like context menu offering to go to the original URL, or maybe (Ctrl)-F5 functionality getting hijacked to do that, but I believe that trying to use such an internal page first could be helpful here to achieve the desire of leaving cache alone, while also trying to use what's in cache first, and only falling back to a network request if there's nothing there.
Description
•