Open Bug 1445211 Opened 6 years ago Updated 9 months ago

Download (save page as...) with some ad-blocker fails (because some subresources are blocked) and succeeds when retried

Categories

(Core :: DOM: Serializers, defect, P3)

defect

Tracking

()

Tracking Status
firefox71 --- affected
firefox72 --- affected
firefox73 --- affected
firefox79 --- affected

People

(Reporter: reibjerk, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [tor 32225])

Attachments

(6 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Build ID: 20180206200532

Steps to reproduce:

When downloading most (not the very basic ones) web pages the download will fail, probably because of an ad-blocker functionality, like blocking host names with uMatrix or MVPS Hosts. (etc/hosts)



Actual results:

The download status is failed.
Choosing to Retry the download will always succeed.
So this has to be done for more than 95% of all web pages, only pages that are very basic will not fail.


Expected results:

If FF web page download could be able to download a page in the same manner as the 'Retry' -functionality, one would not have to Retry each and every time one downloads.
Hi reibjerk,

Could you please provide test web page to try this? Or Do you reproduce the issue without the adblocker?  Thanks
Flags: needinfo?(reibjerk)
With web page  https://www.mozilla.org/en-US/
I have installed uMatrix with default ad-block (etc/hosts file dummy entries for 90000 ad sites) functionality,
I download (with Ctrl-S) the above page and get the error;
'Internet for people, not profit -- Mozilla.htm
Failed'

After emptying the windows /etc/hosts file the following message appears for the menu
'Display the progress of ongoing downloads (Ctrl-J)' <--- Display downloads

'Internet for people ..etc
Completed -- 104 kB'

-
I can also 'Retry' the download, which will always succeed.
Flags: needinfo?(reibjerk)
I'm sorry to say that there might be a problem here - on my behalf...
When trying to 'restore' the hosts file by updateing the 'hosts entries' in uMatrix I DID NOT SEE
the expected change in the hosts file. I wonder if uMatrix used this file at all or maybe some
internal 'DNS lookup' mechanism ???
The entries in the hosts file could just be remnants from the use of the MVPS-Hosts mechanism,
which I used before uMatrix for ad-blocking purposes.

I will have to investigate this further, unless you or anyone reading this knows how uMatrix
'hosts file' functionality is implemented...

If desired, you may feel free to discard this bug report because of this uncertainty I have now discovered!

The download _does_ fail if there are dummy references to ad sites in the hosts file,
but the scenario I expected to be the base of this problem is now uncertain.
So far I've been fumbling around; since uMatrix is a browser extension it does not have its own install location I assume, but could/should have its data as a part of Firefox Data Profile. In this location I do find both some sqlite DB files - which ones belong to Firefox and which ones (if any) does not, I do not know...
Also there is a directory called browser-extension-data, but I only find a small .js file for uMatrix.

There is also files/sqlite db-files like webappstore.sqlite, but it seems to have an overly complex db structure for simply storing hosts entries... (Not that its overly complex, just that its too complex for storing hosts entries..)

As said I'm fumbling blindly trying to find where uMatrix keeps its hosts additions...
Time to ask the author, Raymond Hill ?

I'm still flabbergasted that the systems hosts file did not seem to be used...
Here is a single entry in the hosts file that will make Firefox download of a web page fail;
0.0.0.0 www.googletagmanager.com

Put this in your hosts file and the above URL download will fail.
Could you please test this out?

There are probably more dummy entries (blocking ads and tracing that are not desired)
that will instigate download fails, but this takes time to find out.

There seems to be some built-in dependancy on the google 'trace everybody'-infrastructure..
Flags: needinfo?(amasresha)
Flags: needinfo?(amasresha)
Flags: needinfo?(amasresha)
I installed the addon [uMatrix] and was able to save the complete web page. Is this the correct steps to reproduce the issue?  If not, could you provide more information on how you download the page? A screen capture would be great.
Component: Untriaged → Extension Compatibility
Flags: needinfo?(amasresha)
If possible, you could try out my suggestion in my previous post. 
See the post above yours starting with 'Here is a single entry...'

Adding an ad-blocker (dummy) entry in your host file like the single line I have narrowed it down to will make my download of the before mentioned web page fail. ( https://www.mozilla.org/en-US/  )
This single line is just 1 line that were originally from the MVPS Hosts utility.

Re-upping for QA reproduction. Otherwise, we should close this.

Component: Extension Compatibility → General
Flags: needinfo?(kraj)
Product: Firefox → WebExtensions

Alex, can you help reproduce this issue? Thanks!

Flags: needinfo?(alexandru.cornestean)

If I can help, I will.

It may help to know that I'm still experiencing this issue, but to a less extent.

It was originally reported on Windows 7 I think, but I'm now on Windows 10 latest, with latest FF of course.
So this is, compared to the original problem, on a from scratch installed new system. (No fluke on the Win7)

Also, I no longer run with ad-blocking through dummy entries in my hosts file, but indirectly through use of uMatrix add-on.

The mis-behaviour happens on some/many sites, when I try to download them for local storage.
I just picked one site now; https://edition.cnn.com/
and did a Ctrl-S, save locally.
The DL icon gets this yellow dot that indicated an error. And, as before, when I retry it always succeeds.

But, and this may be related, I have issues with watching the local web copies afterwards.
They seldom work/display as desired.

It would (hopefully) be strange if I'm the only one experiencing this, but I may not be the most typical user,
using uMatrix and (totally unrelated !) have my shared ext4 partition on /dev/sdi15 :-)

Just tell me what trace/files you need.

Or, you could just try the single line in your hosts file;
0.0.0.0 www.googletagmanager.com

I was able to consistently reproduce the issue on the latest Release (67.0.3 / 20190618025334), Beta (68.0b12 / 20190619234730) and Nightly (69.0a1 / 20190619214046) under Windows 10 Pro 64-bit and macOS High Sierra 10.13.6, following the provided STR.

To be more specific regarding what I’ve done to reproduce the issue, I’ve created a new/fresh profile, installed the latest version of uMatrix (version 1.3.16 from https://addons.mozilla.org/en-US/firefox/addon/umatrix/), proceeded to the mentioned websites (and a couple more) where I attempted to download the web pages via CTRL+S.

These are the results I’ve noticed when reproducing the issue:

The page fails to download on the first attempt even with uMatrix disabled or not installed at all. Upon retrying, the page was downloaded successfully. This, obviously occurs with the add-on being installed as well, however, I believe this does not influence the results i.e. the page fails to download on the first attempt.

Opening the locally stored page did not properly display the contents (see screenshots 1 and 2). The page contents are not displayed properly regardless of uMatrix being enabled or disabled, or installed for that matter.

The page is downloaded successfully on the first try with uMatrix disabled or not installed at all.

With the add-on installed, download fails on the first attempt however, upon retrying, the page was downloaded successfully.

Opening the locally stored page did not properly display the contents, as well (see screenshots 3 to 6). For screenshots 5 and 6, uMatrix seems to still be blocking some content and thus the page appears as depicted. Disabling the add-on and reloading the saved page will properly display it as the original, non-downloaded page.

Tried with https://www.facebook.com/ as well. With the add-on enabled, the page downloads only after retrying the process. The page is however correctly displayed when loaded from local storage, regardless of having the add-on enabled or disabled.

With https://www.youtube.com/, download succeeds on the first try with the add-on enabled, though the page is not correctly displayed when loaded locally, regardless of having the add-on enabled or disabled (the page initially loads correctly and immediately after it is fully loaded, it goes blank).

Regarding the alternate method of reproducing the issue (0.0.0.0 www.googletagmanager.com added to the hosts file), I am not sure on how to exactly do this so I would like to ask you to provide more detailed STR to attempt this as well, just in case. Thanks !

Flags: needinfo?(reibjerk)
Flags: needinfo?(kraj)
Flags: needinfo?(alexandru.cornestean)

On the topic (only) of adding
0.0.0.0 www.googletagmanager.com

or similar to the hosts file;

  1. What this does is making a local "DNS-like" lookup entry for host names.
    You thereby tell the computer at what address this host can be fount at. It can be useful for naming server aliases for instance.
    www.googletagmanager.com is actually at address "2a00:1450:400f:809::2008:" (ipv6), so by saying it is at address 0 (0.0.0.0 ipv4)
    you are disabling access to this address.

Since you may want to do this for ad-servers, this becomes an ad-block method.

This can be done manually, as in this case, or through use of a tool, for instance mvps (http://winhelp2002.mvps.org/hosts.htm)

  1. How you do it;
    The 'hosts' file is a plain text file, located in the directory
    C:\Windows\System32\drivers\etc
    for Windows, and /etc in Linux/Unix.

Take a backup of the original file and just use an editor to add
0.0.0.0 www.googletagmanager.com

at the end of the file. You need to be admin (root) to change this file.
You may want to test this mechanism by adding
1.2.3.4 myhost
for instance in the hosts file. Then you can do a
ping myhost
at the command line to see that the host dummy entry change is working. You will not get any reply with the ping but you will see that the host name 'myhost' resolves to the ip address 1.2.3.4 and hence the mechanism is working.

Flags: needinfo?(reibjerk)

Hello,

I have configured the ‘hosts’ file as you have detailed above, for both Windows 10 Pro 64-bit and macOS High Sierra 10.13.6 and have managed to reproduce the issue (with the same results as when using uMatrix) on the latest versions of Firefox (Release - 67.0.4 / 20190619235627; Beta - 68.0b12 / 20190619234730; Nightly – 69.0a1 / 20190620220631).

The only differences I have managed to observe are that loading the locally saved https://www.mozilla.org/en-US/ will now display it’s contents properly and https://www.facebook.com/ will download on the first attempt.

Also, I have tried another webpage (https://www.timesnewroman.ro) which will fail to download on the first attempt and will succeed only after retrying. Loading the saved page will not display the contents correctly.

As a conclusion, the reported issue is present and consistently reproducible, either using the uMatrix extension or by modifying the ‘hosts’ file, with a wide range of websites being affected by this.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Based on the above comments it seems that this issue can be reproduced not just with an extension but also by changing the /etc/hosts file on the system, and so it doesn't seem to be an issue specific to a WebExtensions API.

I'm moving it into the "Toolkit :: Downloads API" to be re-triaged (but it could also be that the right bugzilla component is "Firefox :: File Handling", based on the component description of "Toolkit :: Downloads API").

Component: General → Downloads API
Product: WebExtensions → Toolkit

This looks like bug 1536530, but for any website where a subresource fails to load as a result of an adblocker / hosts block. I expect the retry works because (I expect, haven't verified) we've network-cached the fact that the request failed and somehow that doesn't break the webbrowserpersist code in the same way.

I'm not sure what we want to do here. Failing the download is in principle correct, as one (or more) of the requests that were part of the download failed. However, it's clearly not very helpful here. The crux is likely to be whether we can distinguish the nature of the failure in the webbrowserpersist code (from "real" network failures) and do something else. Luca, do you know how uMatrix and other such solutions reject these types of requests in the webrequest API, and what the resulting XPCOM error is?

As for how "correct" the resulting page is, that's not really related here -- saving a webpage locally is always tricky and best-effort. For instance, if you save a page without any scripts, some elements won't work. But if you save a page and include the script code, it might run differently (when it realizes it's not being served by a webpage at the original http(s) address), all the more so if you save the "live" DOM instead of the as-requested-from-the-server DOM. So I wouldn't worry about that in relation to this issue.

Component: Downloads API → DOM: Core & HTML
Flags: needinfo?(lgreco)
Product: Toolkit → Core
See Also: → 1536530

(In reply to :Gijs (he/him) from comment #23)

Luca, do you know how uMatrix and other such solutions reject these types of requests in the webrequest API, and what the resulting XPCOM error is?

From a very quick look to the uMatrix sources, it looks that blocked subresources are blocked by returning {cancel: true} from a blocking webRequest listener:

That return value is then used from WebRequest.jsm to actually cancel the request (by calling the ChannelWrapper's cancel method with Cr.NS_ERROR_ABORT as a parameter):

Flags: needinfo?(lgreco)

Looks like bug 1493599 added some info that should allow us to distinguish some of these cases, but it also looks like there's no specific blocked status for things cancelled through the webrequest listener. It looks to me like we just don't show those resources in the network inspector. Honza, does that look right? Is there some other way to distinguish these requests, given that lots of things call nsIRequest::cancel with NS_ERROR_ABORT ?

Flags: needinfo?(odvarko)

(In reply to :Gijs (he/him) from comment #25)

Looks like bug 1493599 added some info that should allow us to distinguish some of these cases, but it also looks like there's no specific blocked status for things cancelled through the webrequest listener. It looks to me like we just don't show those resources in the network inspector. Honza, does that look right? Is there some other way to distinguish these requests, given that lots of things call nsIRequest::cancel with NS_ERROR_ABORT ?

I don't know if changes introduced in bug 1493599 can be any helpful here, it's for request blocked by the platform (CORS, CSP, etc.) not by addons.

But, there is another bug 1555057 for requests blocked by add-ons and one of the suggestion is introducing cancelWithReason and use it in WebRequest :
https://searchfox.org/mozilla-central/rev/0671407b7b9e3ec1ba96676758b33316f26887a4/toolkit/components/extensions/webrequest/WebRequest.jsm#830

This API is also mentioned in bug 1556451 and it sounds like it could be useful for several things (including this bug report).

Honza

Flags: needinfo?(odvarko)
Depends on: 1555057
Summary: Download with some ad-blocker fails and succeeds when retried → Download (save page as...) with some ad-blocker fails (because some subresources are blocked) and succeeds when retried

This obviously affects the Tor Browser too (via NoScript), see https://trac.torproject.org/projects/tor/ticket/32225#comment:9

Version: 58 Branch → Trunk
Whiteboard: [tor 32225]
Status: NEW → RESOLVED
Closed: 5 years ago
No longer depends on: 1555057
QA Contact: Virtual
Resolution: --- → DUPLICATE
See Also: 1536530
Status: RESOLVED → REOPENED
QA Contact: Virtual
Resolution: DUPLICATE → ---
Whiteboard: [tor 32225]
Depends on: 1604618

fixing the downloads issue shouldn't depend on devtools netmonitor changes. The webextension functionality is ready with bug 1604618.

No longer depends on: 1555057
See Also: → 1555057

(In reply to :Gijs (he/him) from comment #34)

It'd be helpful if someone could either debug or clarify from the webext side, at what point in the channel lifetime webextensions can/do cancel URI loads right now.

That is documented on mdn [1] Or is that query about translating webrequest to httpchannel? "http-on-modify-request", "http-on-before-connect" and http-on-examine-* are probably the most common, but any webrequest api documented as accepting the blocking param will potentially be able to cancel. You can see the mechanisms used for various events here[2].

Each channels loadinfo now has a cancel reason on it, and the property bag on the channel will contain the extension id. So it is possible for the download to be aware that an extension has canceled some part of a page download. I'd probably check the cancel reason on loadinfo at any point that the download may be cancelled via the channel.

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest
[2] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/components/extensions/webrequest/WebRequest.jsm#1154

Component: DOM: Core & HTML → DOM: Serializers
Priority: -- → P3

This has been a nuisance for me for a year or more. But I've just gotten used to clicking the downloads dropdown and using retry to save it correctly. However, over the past month or so I've noticed that for some websites it will actually mess up the second time, rather than fixing the problem, so for some websites I need to leave it "failed" for the saved webpage to work, and if I do retry it, the saved version becomes corrupted (losing elements like formatting or some dynamically loaded components).

My system: OSX with Firefox 72.0, with Adblock Plus extension enabled. (Yes, if I disable it, webpages save correctly, but I shouldn't be required to do that when they load correctly in the browser, and when the desired blocking is unrelated to the websites that have problems saving!)

This happens to most websites I visit (I'm guessing most websites that have dependent files to load). One easy (and popular) website to try this at is Worldcat, such as http://www.worldcat.org/oclc/1028407523 (but any book will do). As you can see there, when you load the webpage, Worldcat dynamically loads a list of nearby libraries that have the item. (Adblock Plus has no effect on this when viewing!) But when I try to save the page, it says "failed" even though it works, and if it retry, then that dynamically loaded section is removed from the saved file! (As a researcher, I'm often specifically saving pages for there to remember where rare books are located. But this is just one example of many websites where this happens.)

This is becoming infuriating.

Flags: needinfo?(mbrodesser)
Flags: needinfo?(mbrodesser)

Hi, I've just set up Windows 10 version 1909 (I was on windows 8.1 before), and I have the bug since that with Firefox 75 and ublock origin 1.26.0.

Description :

When I try to save a FULL webpage (with the pictures, not just the html part) with some blocked stuff inside, Firefox tells me that the download failed. Which is wrong.

A specific URL where the issue occurs :

https://www.freenews.fr/freenews-edition-nationale-299/une-collecte-organisee-pour-fabriquer-des-visieres-de-protection-en-3d-par-nos-makers-et-pour-les-soignants

Steps to Reproduce :

Go to that webpage.
Do Ctrl+S, select FULL webpage
Click on Save

Expected behavior:

The download icon of firefox should be blue and indicates that the download succeded.

Actual behavior:

The download icon shows an orange circle and says it failed.

Your environment :

uBlock Origin version: 1.26.0
Browser Name and version: firefox 75.0 64 bits
Operating System and version: windows 10 64 bits

I hope someone could fix that because that annoying.

Flags: needinfo?(mixedpuppy)
Flags: needinfo?(gijskruitbosch+bugs)

We're aware of this issue, we know what causes it, but it currently isn't a priority to address. I'd be happy to review a patch, though even if this was a priority it'd probably make sense to wait for the refactoring in bug 1576188 to land first.

My understanding is that there's a very easy work around: disable ublock temporarily when you save a webpage.

Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(mixedpuppy)

(In reply to :Gijs (back Tue 14; he/him) from comment #38)

My understanding is that there's a very easy work around: disable ublock temporarily when you save a webpage.

Hi, for your information disabling Ublock origin via the extension doesn't solve the issue... It must be disabled in about:addons.

(In reply to Julien L. from comment #39)

(In reply to :Gijs (back Tue 14; he/him) from comment #38)

My understanding is that there's a very easy work around: disable ublock temporarily when you save a webpage.

Hi, for your information disabling Ublock origin via the extension doesn't solve the issue... It must be disabled in about:addons.

This may happen when "context" of request is not preserved. AFAIK parentFrameId and originUrl must be correctly set in webRequest callback for uBO to know which filters to apply. If this info is lost uBO will not know if request is whitelisted.

We won't know the (internal equivalent of the) parentFrameId here, at least for now. It's possible the refactor in bug 1576188 will help with that though.

disabling Ublock origin via the extension doesn't solve the issue

I can't reproduce this on my side, disabling uBO on https://www.freenews.fr/ makes the "Save As..." error disappear. Also, uBO does not use parentFrameId when tabId is -1. uBO's logger should be used to validate that uBO didn't block anything at "Save As..." time after being disabled. If the logger does not show anything being blocked, it must be investigated that something else than uBO blocked network requests.

Oh, sorry, I did not checked this thoroughly, turns out on my side ClearURLs was responsible for this error. It was configured to only redirect but somehow it was also blocking some request. I reinstalled and reconfigured and don't see errors anymore with uBO disabled for the page.

(In reply to :Gijs (he/him) from comment #38)

We're aware of this issue, we know what causes it, but it currently isn't a priority to address. I'd be happy to review a patch, though even if this was a priority it'd probably make sense to wait for the refactoring in bug 1576188 to land first.

My understanding is that there's a very easy work around: disable ublock temporarily when you save a webpage.

What about adding an option to disable any content-altering extensions in the page saving dialog?

(In reply to Digi from comment #45)

What about adding an option to disable any content-altering extensions in the page saving dialog?

I think the usual user expectation would be that resources blocked by the content altering extensions would also not be saved - if the page worked without them when rendered from the web, we can do the same when saved to disk, right?

Even if we added this option, some users would not use it, and the feature should Just Work in that case, and we shouldn't mark the download as failed.

(In reply to :Gijs (he/him) from comment #46)

(In reply to Digi from comment #45)

What about adding an option to disable any content-altering extensions in the page saving dialog?

I think the usual user expectation would be that resources blocked by the content altering extensions would also not be saved - if the page worked without them when rendered from the web, we can do the same when saved to disk, right?

Even if we added this option, some users would not use it, and the feature should Just Work in that case, and we shouldn't mark the download as failed.

I'd say that the usual user expectation is also that page saving doesn't fail due to obscure reasons.

PS from the user's perspective - saving a slightly different page is way much better than not saving it at all.

See Also: → 1668530
See Also: → 1681642

I'm still having this problem in v89, even when the adblocker is turned off. For instance, this page: http://rc-aviation.ru/chertplosk/91-ploskf22

Blocks: 1726362

(In reply to Digi from comment #48)

PS from the user's perspective - saving a slightly different page is way much better than not saving it at all.

I'd add that failed saves are often confusing as themselves. FF tells that save failed - but I see page's .html and folder present on disk. Some subresource was not saved maybe, but general result seems like success - marked as failed. So I don't know if I could end here, and what exactly is broken in saved copy.

Severity: normal → S3
See Also: → 964213
Duplicate of this bug: 1794727
See Also: → 1820083
Duplicate of this bug: 1853631
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: