Firefox Nightly sporadically fails to load any website until it gets restarted when DNS over HTTPS is turned on
Categories
(Core :: Networking, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox79 | --- | wontfix |
People
(Reporter: whimboo, Unassigned)
References
(Depends on 1 open bug)
Details
(Whiteboard: [necko-triaged])
Attachments
(2 files)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0 ID:20200623213840
Just noticed this problem the first time today after updating my Nightly from the Monday build (20200621093846). After some hours of usage Firefox was not able to load any web page anymore. I had to restart it to get rid of the problem.
As attached you can find the HTTP log from trying to load www.mozilla.org. The navigation request I aborted after around 15s.
Valentin, mind having a look at the attachment? Might this be a new regression?
Reporter | ||
Comment 1•4 years ago
|
||
And something I forgot to say is, that I have DoH enabled, but disabling it didn't make a difference.
Comment 2•4 years ago
|
||
Possibly a new regression.
Looking at the logs there's only one DNS resolution being preformed:
2020-06-24 14:22:12.910653 UTC - [Parent 96146: Main Thread]: D/nsHostResolver Resolving host [api.irccloud.com]<> type 0. [this=0x125ed1760]
2020-06-24 14:22:12.910678 UTC - [Parent 96146: Main Thread]: D/nsHostResolver Using cached record for host [api.irccloud.com].
2020-06-24 14:22:12.910707 UTC - [Parent 96146: Main Thread]: D/nsHostResolver Checking blacklist for host [api.irccloud.com], host record [0x12660f2c0].
And then there's this:
2020-06-24 14:22:16.719110 UTC - [Parent 96146: Main Thread]: V/nsHttp Creating HttpBaseChannel @0x19d53a000
2020-06-24 14:22:16.719121 UTC - [Parent 96146: Main Thread]: D/nsHttp Creating nsHttpChannel [this=0x19d53a000]
2020-06-24 14:22:16.719128 UTC - [Parent 96146: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=0x19d53a000]
2020-06-24 14:22:16.719135 UTC - [Parent 96146: Main Thread]: E/nsHttp host=www.mozilla.org port=-1
2020-06-24 14:22:16.719141 UTC - [Parent 96146: Main Thread]: E/nsHttp uri=https://www.mozilla.org/
2020-06-24 14:22:16.719156 UTC - [Parent 96146: Main Thread]: E/nsHttp nsHttpChannel::Init [this=0x19d53a000]
2020-06-24 14:22:16.719231 UTC - [Parent 96146: Main Thread]: D/nsHttp Destroying nsHttpChannel [this=0x19d53a000]
2020-06-24 14:22:16.719244 UTC - [Parent 96146: Main Thread]: V/nsHttp Destroying HttpBaseChannel @0x19d53a000
2020-06-24 14:22:16.862383 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpHandler::NewProxiedChannel [proxyInfo=0x0]
2020-06-24 14:22:16.862439 UTC - [Parent 96146: Main Thread]: V/nsHttp Creating HttpBaseChannel @0x12a0fe000
2020-06-24 14:22:16.862456 UTC - [Parent 96146: Main Thread]: D/nsHttp Creating nsHttpChannel [this=0x12a0fe000]
2020-06-24 14:22:16.862463 UTC - [Parent 96146: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=0x12a0fe000]
2020-06-24 14:22:16.862470 UTC - [Parent 96146: Main Thread]: E/nsHttp host=www.mozilla.org port=-1
2020-06-24 14:22:16.862475 UTC - [Parent 96146: Main Thread]: E/nsHttp uri=https://www.mozilla.org/
2020-06-24 14:22:16.862485 UTC - [Parent 96146: Main Thread]: E/nsHttp nsHttpChannel::Init [this=0x12a0fe000]
2020-06-24 14:22:16.862519 UTC - [Parent 96146: Main Thread]: D/nsHttp Destroying nsHttpChannel [this=0x12a0fe000]
2020-06-24 14:22:16.862529 UTC - [Parent 96146: Main Thread]: V/nsHttp Destroying HttpBaseChannel @0x12a0fe000
2020-06-24 14:22:16.961212 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpHandler::NewProxiedChannel [proxyInfo=0x0]
2020-06-24 14:22:16.961263 UTC - [Parent 96146: Main Thread]: V/nsHttp Creating HttpBaseChannel @0x1920dd000
2020-06-24 14:22:16.961273 UTC - [Parent 96146: Main Thread]: D/nsHttp Creating nsHttpChannel [this=0x1920dd000]
2020-06-24 14:22:16.961280 UTC - [Parent 96146: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=0x1920dd000]
2020-06-24 14:22:16.961286 UTC - [Parent 96146: Main Thread]: E/nsHttp host=www.mozilla.org port=-1
2020-06-24 14:22:16.961291 UTC - [Parent 96146: Main Thread]: E/nsHttp uri=https://www.mozilla.org/
2020-06-24 14:22:16.961301 UTC - [Parent 96146: Main Thread]: E/nsHttp nsHttpChannel::Init [this=0x1920dd000]
2020-06-24 14:22:16.961336 UTC - [Parent 96146: Main Thread]: D/nsHttp Destroying nsHttpChannel [this=0x1920dd000]
2020-06-24 14:22:16.961345 UTC - [Parent 96146: Main Thread]: V/nsHttp Destroying HttpBaseChannel @0x1920dd000
2020-06-24 14:22:17.044993 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpHandler::NewProxiedChannel [proxyInfo=0x0]
2020-06-24 14:22:17.045048 UTC - [Parent 96146: Main Thread]: V/nsHttp Creating HttpBaseChannel @0x19bd67000
2020-06-24 14:22:17.045058 UTC - [Parent 96146: Main Thread]: D/nsHttp Creating nsHttpChannel [this=0x19bd67000]
2020-06-24 14:22:17.045066 UTC - [Parent 96146: Main Thread]: E/nsHttp HttpBaseChannel::Init [this=0x19bd67000]
2020-06-24 14:22:17.045072 UTC - [Parent 96146: Main Thread]: E/nsHttp host=www.mozilla.org port=-1
2020-06-24 14:22:17.045078 UTC - [Parent 96146: Main Thread]: E/nsHttp uri=https://www.mozilla.org/
2020-06-24 14:22:17.045094 UTC - [Parent 96146: Main Thread]: E/nsHttp nsHttpChannel::Init [this=0x19bd67000]
2020-06-24 14:22:17.045128 UTC - [Parent 96146: Main Thread]: D/nsHttp Destroying nsHttpChannel [this=0x19bd67000]
2020-06-24 14:22:17.045137 UTC - [Parent 96146: Main Thread]: V/nsHttp Destroying HttpBaseChannel @0x19bd67000
Not sure if this is the reason for the failure but it looks odd.
Kershaw, do you know if we landed something recently that could affect this?
Comment 3•4 years ago
|
||
It looks like that the http channel is suspended twice, but only resume once. The log shows that someone assign a new listener to channel (0x1b21a0000) and perhaps forgot to call resume. I think this probably be related to service worker or document channel.
2020-06-24 14:22:17.559167 UTC - [Parent 96146: Main Thread]: V/nsHttp HttpBaseChannel::SetNewListener [this=0x1b21a0000, mListener=0x18c8f8900, newListener=0x19b9b1bb0]
2020-06-24 14:22:17.560365 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel::SuspendInternal [this=0x1b21a0000]
2020-06-24 14:22:17.560413 UTC - [Parent 96146: Main Thread]: D/nsHttp Waiting until resume OnBeforeConnect [this=0x1b21a0000]
2020-06-24 14:22:17.560424 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel::ContinueBeginConnectWithResult result [this=0x1b21a0000 rv=0 mCanceled=0]
2020-06-24 14:22:17.560432 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel [0x1b21a0000] created nsChannelClassifier [0x19d4e69a0]
2020-06-24 14:22:17.560441 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel::Starting nsChannelClassifier 0x19d4e69a0 [this=0x1b21a0000]
2020-06-24 14:22:17.560497 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel::SuspendInternal [this=0x1b21a0000]
2020-06-24 14:22:17.563151 UTC - [Parent 96146: Main Thread]: D/nsHttp nsHttpChannel::ResumeInternal [this=0x1b21a0000]
Comment 4•4 years ago
|
||
Do you think this is a regression from bug 1647133 ?
The timing fits
Comment 5•4 years ago
|
||
Henrik, do you have any addon installed?
This could be also caused by an addon.
Comment 6•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)
Do you think this is a regression from bug 1647133 ?
The timing fits
It's very unlikely
bug 1647133 introduced zero functional changes; it just change the declaration of a type from nsIEventTarget* to nsISerialEventTarget* because every single object used where nsISerialEventTarget* (which inherit from nsIEventTarget*).
The code compiled should be identical.
Reporter | ||
Comment 7•4 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #5)
Henrik, do you have any addon installed?
This could be also caused by an addon.
Yes, I have a couple of addons installed:
About Sync 0.0.52 true aboutsync@mhammond.github.com
Amazon.com 1.1 true amazondotcom@search.mozilla.org
Bing 1.1 true bing@search.mozilla.org
Certainly Something (Certificate Viewer) 1.2.3 true a2fff151f5ad0ef63cbd7e454e8907c1fa9cc32008f489178775570374f408a7@pokeinthe.io
Cookie Quick Manager 0.5rc2 true {60f82f00-9ad5-4de5-b31c-b16a47c51558}
devtools-highlighter 1.2.0 true {ff0cf743-dcf3-4097-ae4c-d872c18f7b4e}
Disable WebRTC 1.0.23 true jid1-5Fs7iTLscUaZBgwr@jetpack
Don't track me Google 4.22 true dont-track-me-google@robwu.nl
DuckDuckGo 1.0 true ddg@search.mozilla.org
Firefox DevTools ADB Extension 0.0.5 true adb@mozilla.org
Firefox Multi-Account Containers 6.2.5 true @testpilot-containers
Forget Me Not - Forget cookies & other data 2.2.8 true forget-me-not@lusito.info
Foxy Gestures 1.2.7 true {e839c3f9-298e-4cd0-99e0-464431cb7c34}
Google 1.0 true google@search.mozilla.org
HTTPS Everywhere 2020.5.20 true https-everywhere@eff.org
Media Panel 2.6 true {68d048f4-9449-4c97-8425-6dac7f743b14}
Multi-touch Zoom 0.95 true {90b5a72e-cdbd-49df-8304-5b5d6ea84a0f}
MyQOnly 0.8.3 true myqonly@mikeconley.ca
Nightly Tester Tools 4.0 true {8620c15f-30dc-4dba-a131-7c5d20cf4a29}
NoScript 11.0.32 true {73a6fe31-595d-460b-a920-fcc0f8843232}
Searchfox in Phabricator 0.2.4 true {16c9140a-394e-400b-b1ce-ae340b04e4c0}
Testcase Reducer 1.3.1 true testcase-reducer@webcompat.com
uBlock Origin 1.27.10 true uBlock0@raymondhill.net
User-Agent Switcher and Manager 0.3.5 true {a6c4a591-f1b2-4f03-b3ff-767e5bedf4e7}
Wikipedia (en) 1.0 true wikipedia@search.mozilla.org
Zoom Scheduler 2.1.10 true {bf855ead-d7c3-4c7b-9f88-9a7e75c0efdf}
Firefox Lightbeam 2.1.0 false jid1-F9UJ2thwoAm5gQ@jetpack
No Transition 1.1.8 false {8b5fde66-c64d-4a33-99f1-c7c94138d67e}
So far it didn't happen again.
Comment 8•4 years ago
|
||
It should be bug 1648132, when StreamFilterParent::Attach happens between HttpChannelChild::OnStartRequest and HttpChannelChild::DoOnStartRequest, the SetNewListener will fail.
Let's wait for the next nightly, it should be soon.
That brings another topic, the suspension should be resumed even the channel fails to set new listener
https://searchfox.org/mozilla-central/rev/cfaa250d14e344834932de4c2eed0061701654da/toolkit/components/extensions/webrequest/WebRequest.jsm#935
I file bug 1648511
Comment 9•4 years ago
•
|
||
FWIW we're going to backout bug 1648132 and land it back in the beginning of next cycle.
Edit: it's bug 1633935 actually
Comment 10•4 years ago
•
|
||
Current nightly (20200625094452) includes the change in bug 1648132. Could you have another try, Henrik? Thanks.
Comment 12•4 years ago
•
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #0)
Created attachment 9158925 [details]
log.txt-main.96146.moz_logMozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0 ID:20200623213840
Oops, 20200623213840 doesn't include bug 1633935 (edit: corrected the bug number)
it includes bug 1629299 and Bug 1645941 which are extension related.
Reporter | ||
Comment 13•4 years ago
|
||
I had the same problem this morning again. Here are the HTTP logs. I have to note that even after a restart it didn't work. So I restarted again with extensions disabled. Here it was working, but also after a restart with extensions enabled.
Valentin, can you see something specific? Is that the same issue or a different one?
Comment 14•4 years ago
|
||
Honza, do you have time to take a look at the logs?
Comment 15•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #13)
Created attachment 9161204 [details]
log.txt-main.11954.moz_logI had the same problem this morning again. Here are the HTTP logs. I have to note that even after a restart it didn't work. So I restarted again with extensions disabled. Here it was working, but also after a restart with extensions enabled.
Valentin, can you see something specific? Is that the same issue or a different one?
Henrik, when I look into the log I can see NS_ERROR_NET_TIMEOUT and NS_ERROR_UNKNOWN_HOST errors of sockets. Could your internet be just off for the moment?
In the log from comment 0 I can see all channels that were not used only temporarily for sec checking in suspendcount>0 state.
Reporter | ||
Comment 16•4 years ago
|
||
Hard to say. It's been a while since I created this log. I will make sure to check general network state when it happens again.
Comment 17•4 years ago
|
||
The severity field is not set for this bug.
:junior, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 18•4 years ago
|
||
Free free to set higher priority if we found that it's not caused by network temporary network disconnect.
Updated•4 years ago
|
Reporter | ||
Comment 19•4 years ago
|
||
Note that I hit the network timeouts a couple of times today again, but each time it was indeed related to the WIFI network. That means so far I haven't seen this bug again. Maybe we should close it?
Comment 20•4 years ago
|
||
Fingers crossed.
Updated•4 years ago
|
Reporter | ||
Comment 21•4 years ago
|
||
I had the same problem on Friday. While Nightly was not able to load any web page, the WIFI connection was stable, and other instances of Firefox were successfully be able to load these same pages. Valenin, I will send you an email with the network log for inspection. I hope it contains some valuable data.
Reporter | ||
Comment 22•4 years ago
|
||
Note that while Nightly was not able to load web pages, I was able to download an update from our servers.
Comment 23•4 years ago
|
||
For channel 2020-10-30 16:33:32.130 ⁃ nsHttpChannel ⁃ 1ca239000 ⁃ cancelled ⁃ status=n/a ⁃ http-status=n/a ⁃ url=https://bugzilla.mozilla.org/show_bug.cgi?id=1671205
Channel is suspended:
2020-10-30 16:33:32.136532 UTC - [Parent 99624: Main Thread]: D/nsHttp nsHttpChannel::SuspendInternal [this=0x1ca239000]
2020-10-30 16:33:32.136556 UTC - [Parent 99624: Main Thread]: D/nsHttp 0x1ca239000 called from script: resource://gre/modules/WebRequest.jsm:927:18
Channel is cancelled:
2020-10-30 16:33:37.344001 UTC - [Parent 99624: Main Thread]: D/nsHttp nsHttpChannel::Cancel [this=0x1ca239000 status=804b0002]
2020-10-30 16:33:37.344046 UTC - [Parent 99624: Main Thread]: D/nsHttp 0x1ca239000 called from script: resource://gre/modules/RemoteWebNavigation.jsm:137:34
So it seems that:
- there's and addon that may be involved in the failure and channel is suspended
- For some reason we don't resume the channel - the fault lies either in the implementation of the addon, or of WebRequest.jsm
It would be good to have someone look at WebRequest.jsm - it's not clear to me if the suspend and resume calls are correctly matched.
Henrik, does the bug ever happen without addons?
Moving bug to Webextensions component for a check of applyChanges
Reporter | ||
Comment 24•4 years ago
|
||
I cannot run without add-ons for a long time, and that bug didn't occur at all within the last weeks until Friday last week. So at least here a list of extensions installed:
About Sync 0.0.52
Amazon.com 1.3
Bing 1.2
Certainly Something (Certificate Viewer) 1.2.3
Cookie Quick Manager 0.5rc2
Disable WebRTC 1.0.23
Don't track me Google 4.22
DuckDuckGo 1.1
Firefox DevTools ADB Extension 0.0.5
Firefox Lightbeam 2.1.0 [DISABLED]
Firefox Multi-Account Containers 7.1.0
Forget Me Not - Forget cookies & other data 2.2.8
Foxy Gestures 1.2.9
Google 1.1
HTTPS Everywhere 2020.8.13 [DISABLED]
Media Panel 2.6
MyQOnly 0.8.3
Nightly Tester Tools 4.0
No Transition 1.1.8 [DISABLED]
NoScript 11.1.4
Searchfox in Phabricator 0.2.4
Testcase Reducer 1.3.1
User-Agent Switcher and Manager 0.4.4
Vimium 1.66 [DISABLED]
Wikipedia (en) 1.1
Zoom Scheduler 2.1.11
devtools-highlighter 1.2.0
phab-conventional-comments 0.3
phab-test-policy 1.4
uBlock Origin 1.30.6
The extensions aren't unpacked in the profile, but when I search through all files only uBlock origin actually has a reference to WebRequest.
Reporter | ||
Comment 25•4 years ago
|
||
I had the same situation again today with a Google doc. In the tab the document was loaded a Trying to connect
notification was shown. I tried to reload the document, but there was no reply from docs.google.com
. As such I tried in a different container, and loading the document worked fine. After that loading the document in the original container also worked again.
Could it be that using a different container somewhat got rid of the connection problems?
Comment 26•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #25)
Could it be that using a different container somewhat got rid of the connection problems?
Assuming something suspends the channel/connection and fails to resume it, using a different container will create a completely different connection due to having different originAttributes.
We still need to figure out why the unbalanced suspend() is called.
Reporter | ||
Comment 27•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #26)
Assuming something suspends the channel/connection and fails to resume it, using a different container will create a completely different connection due to having different originAttributes.
Sure. But can this really have an effect on the original container's connection? As said after doing these steps, the previous not working navigation is working again.
When the original channel gets resumed would that be through code in the same area (webextension), or somewhere completely different? If it's the same maybe I could create a gecko profile or network log that might indicate where it fails?
Comment 28•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #25)
After that loading the document in the original container also worked again.
I completely missed this part.
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #27)
When the original channel gets resumed would that be through code in the same area (webextension), or somewhere completely different? If it's the same maybe I could create a gecko profile or network log that might indicate where it fails?
It could be that the extension holds some broken state that gets fixed by the new channel - or possibly the broken state is somewhere in the network stack? Some HTTP logs with this scenario will definitely be helpful here. Thanks!
Reporter | ||
Comment 29•4 years ago
|
||
So I ran Firefox a couple of days with all installed web extensions disabled and the problem was still persistent. Could a problem like this also be caused by just the Add-on manager or the web extensions disabled?
Yesterday I tried to reset all custom networking related preferences and noticed that I'm using DoH. After disabling it I haven't seen the problem again. I will keep observing, and as Valentin suggested I switched from NextDNS to Cloudflare for now.
Reporter | ||
Comment 30•4 years ago
|
||
Since my last comment I didn't notice that behavior again. So as next step I switched back to NextDNS
.
Valentin, I also noticed that the router had IPv6 tunneling enabled. Could it be that this might have also been a factor that some requests weren't answered anymore? I turned that off two days ago too given that seems to be recommended.
Comment 31•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #30)
Since my last comment I didn't notice that behavior again. So as next step I switched back to
NextDNS
.
Valentin, I also noticed that the router had IPv6 tunneling enabled. Could it be that this might have also been a factor that some requests weren't answered anymore? I turned that off two days ago too given that seems to be recommended.
That is definitely possible. Our IPv6 test coverage isn't the best, so it's possible that we either have a bug or there are some issues around IPv6 that are causing the hang. In any case, it would be good to find out which one it is.
If you don't see this with IPv6 disabled, enable it for a while and report back. Should be easier to diagnose if we confirm that's the culprit.
Reporter | ||
Comment 32•4 years ago
|
||
Oh, I didn't disable IPv6 but only tunneling via IPv4. So far it works all fine.
Comment 33•4 years ago
|
||
If it's the tunneling that is causing this then it would be good to try if setting network.connectivity-service.nat64-check
to false fixes it.
Reporter | ||
Comment 34•4 years ago
|
||
Ok, so after switching back to NextDNS earlier today everything was fine until now. It just started that pages in this case wiki.mozilla.org didn't load anymore. Also a Shift-Reload didn't help. The only solution here was to switch back to CloudFlare.
Valentin, as such I'm fairly sure it's related to NextDNS as DoH provider. Any suggestions in what kinds of logs to create that might help you for further investigation?
Comment 35•4 years ago
|
||
Good to know that we're getting closer to the cause here.
I think what would be useful is another series of HTTP logs where the bug reproduces, you turn off TRR, and load the page again.
Also useful information to gather - when the bug reproduces, go to about:networking#dns - copy all of the IP information. Turn off TRR, reload the page, and capture the IP information again. So we can see the full list of available IPs.
Thanks again for all the help tracking this down.
Reporter | ||
Comment 36•3 years ago
|
||
I somewhat forgot about this bug because it didn't happen in all the last months. Reason is that I left DoH turned off. Now I turned it on again, and will keep an eye on that particular issue. Nevertheless it seconds that it has to be a problem related to DoH.
Reporter | ||
Comment 37•3 years ago
|
||
I didn't see it again in all the last months. Lets close as incomplete for now.
Description
•