Loss of network connections when IPv6 and DoH enabled
Categories
(Core :: Networking, defect, P2)
Tracking
()
People
(Reporter: whimboo, Assigned: valentin)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [necko-triaged][trr])
Attachments
(7 files)
206.77 KB,
image/png
|
Details | |
2.49 MB,
text/plain
|
Details | |
2.44 MB,
text/plain
|
Details | |
549.98 KB,
application/zip
|
Details | |
6.28 MB,
application/octet-stream
|
Details | |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0 ID:20200512212513
This started recently for me within a specific container tab that holds an inbox for mail.google.com. Failures look like:
08:26:51.967 Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://fonts.gstatic.com/s/googlesans/v14/4UabrENHsxJlGDuGo1OIlLU94YtzCwZsPF4o.woff2. (Reason: CORS request did not succeed).
Several icons (including the favicon) aren't loaded (see the attachment) which make the ui basically unusable when it comes to setting labels, or starring emails.
Interesting is that the problem is not visible in another tab with the default identity (no container). After logging in with the same credentials everything is displayed correctly.
I'm not sure if that is a container related problem, or security related.
Johann, any hint for further investigation or which logs might help here? Also who would be the right person to pick this up?
Reporter | ||
Comment 1•4 years ago
|
||
Actually while filing this bug I noticed that also Bugzilla is affected, so it might be a general problem with containers.
Reporter | ||
Comment 2•4 years ago
|
||
Also Gmail shows me regularly Not connected. Connecting in 82s…
messages, and Bugzilla is kinda slow. I assume both behaviors might be related because both websites run in the same container for me.
Trying to load like google.com completely fails with We can’t connect to the server at www.google.com.
, while with the normal identity it works.
Reporter | ||
Comment 3•4 years ago
|
||
Using Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 ID:20200501214551 it seems to work fine so far. So it looks like a recent regression in Firefox 78. I will do a regression check later today.
Reporter | ||
Comment 4•4 years ago
|
||
Ok, so 20200501214551 is also affected and I just got the problem after waking up my MacBook Pro from sleep. Gmail immediately showed the connection problem notification, and no other websites can be loaded.
I can only test that on MacOS. So it would be good to know if someone else could reproduce on a different platform.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 5•4 years ago
|
||
Sorry for flipping the summary that often, lets use one for now which should be kept.
I just noticed that it's not necessary to put the machine into sleep mode to get this loss of network connectivity reproduced. It seems to start with some resources as initially mentioned and at some point Google domains cannot be accessed anymore.
Reporter | ||
Comment 6•4 years ago
|
||
This is the HTTP log as recorded via about:networking
when the connection to google.com dropped for the user context 2. Please specifically search for the following link, which is opened when clicking on a bugzilla link within Gmail:
Honza, is there anything which stands out here?
Comment 7•4 years ago
|
||
2020-05-13 14:27:17.834479 UTC - [Parent 74059: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x135a50c00 status=804b0003]
804b0003 = STATUS_RESOLVING
2020-05-13 14:27:17.834581 UTC - [Parent 74059: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=0x135a50800 status=804b0003 progress=0]
2020-05-13 14:27:17.834599 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver Resolving host [www.google.com] - bypassing cache type 0. [this=0x1296c3310]
2020-05-13 14:27:17.834612 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver No usable record in cache for host [www.google.com] type 0.
2020-05-13 14:27:17.834620 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver NameLookup: www.google.com effectiveTRRmode: 2
2020-05-13 14:27:17.834723 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport after event [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834739 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834751 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x135a50c00 state=0 cond=804b001e]
2020-05-13 14:27:17.834766 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport not in a recoverable state
Valentin, Kershaw, anything in our TRR code that relates to containers somehow? I don't see any more information in the log, just this weird sync error.
Reporter | ||
Comment 8•4 years ago
•
|
||
Note that disabling DOH let the problem go away and Gmail behaves correctly again. The one I had selected was Cloudflare.
Reporter | ||
Comment 9•4 years ago
•
|
||
Turning on DOH again the same problem returns. It happens with both Cloudflare and NextDNS.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 10•4 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #7)
2020-05-13 14:27:17.834479 UTC - [Parent 74059: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x135a50c00 status=804b0003] 804b0003 = STATUS_RESOLVING 2020-05-13 14:27:17.834581 UTC - [Parent 74059: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=0x135a50800 status=804b0003 progress=0] 2020-05-13 14:27:17.834599 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver Resolving host [www.google.com] - bypassing cache type 0. [this=0x1296c3310] 2020-05-13 14:27:17.834612 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver No usable record in cache for host [www.google.com] type 0. 2020-05-13 14:27:17.834620 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver NameLookup: www.google.com effectiveTRRmode: 2 2020-05-13 14:27:17.834723 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport after event [this=0x135a50c00 cond=804b001e] 2020-05-13 14:27:17.834739 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=0x135a50c00 cond=804b001e] 2020-05-13 14:27:17.834751 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x135a50c00 state=0 cond=804b001e] 2020-05-13 14:27:17.834766 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport not in a recoverable state
Valentin, Kershaw, anything in our TRR code that relates to containers somehow? I don't see any more information in the log, just this weird sync error.
I am not really sure about this. I do see that OriginAttributes
is used in our DNS code, but it seems not used in TRR code.
Anyway, the log is a bit short and it doesn't tell the reason why the DNS resolution is failed.
Henrik, do you perhaps have a longer log? Or could you try to reproduce this again? When this happens, could you try to use about:networking#dnslookuptool
to test if DNS works?
Assignee | ||
Updated•4 years ago
|
Reporter | ||
Comment 11•4 years ago
|
||
I enabled it again yesterday but so far I haven't gotten Firefox into that state again. Gmail and DNS resolutions still work fine. I will still keep an eye on it.
Reporter | ||
Comment 12•4 years ago
•
|
||
(In reply to Kershaw Chang [:kershaw] from comment #10)
Anyway, the log is a bit short and it doesn't tell the reason why the DNS resolution is failed.
Henrik, do you perhaps have a longer log? Or could you try to reproduce this again? When this happens, could you try to useabout:networking#dnslookuptool
to test if DNS works?
Note that I started network logging in `about:networking) before I tried to open the page in question and then waited a couple of seconds before stopping it. So what else should I do?
I just hit this problem again. It started yesterday after starting an updated version of Nightly. At first some resources in Gmail couldn't be loaded (inside the bugzilla context), and then this morning after waking up the machine from sleep I'm not able to open the Chrome bug tracker in my "default" context and NOT Bugzilla context (container), which still can open the page? When I'm doing a DNS lookup in about:networking
I get 216.58.208.51
. So the name can be successfully resolved.
Also note that I had the problem yesterday with my other profile and Firefox 77b6 while trying to watch a movie on Amazon Prime. A click on the play button didn't start the movie at all, but showed no reaction. Only with a restart of Firefox I was able to get it working again.
I wonder how critical these problems are and if we should consider blocking the next release on it, or set the flag as long as we don't know the root cause?
Assignee | ||
Comment 13•4 years ago
|
||
The symptoms here do seem a bit similar to bug 1610691.
Could you try to flip the privacy.resistFingerprinting
pref as mentioned in bug 1610691 comment 11 and letting me know if you still see the problem?
Reporter | ||
Comment 14•4 years ago
|
||
I can do whenever I see the problem again. Just after waking up the machine again from sleep it works all fine for the moment. So I will have to wait again before I'm able to try this preference again.
Assignee | ||
Comment 15•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)
The symptoms here do seem a bit similar to bug 1610691.
Actually they don't. This is quite clearly a DNS failure.
I'll try to add some extra logging for the origin attributes.
Reporter | ||
Comment 16•4 years ago
|
||
Valentin, do you have any idea what could have caused that during the 77 release cycle? In case it would make sense to check for some regression ranges.
Assignee | ||
Comment 17•4 years ago
|
||
The only thing that is marked [trr] is bug 1626057. But it seems unlikely that caused the regression.
Maybe bug 1625151 could be related, but it seems unlikely. Bug 1625213 also landed in 77.
Reporter | ||
Comment 18•4 years ago
|
||
Note that an update of Nightly has been announced some minutes ago and I upgraded to see if it reproduces. And indeed, some minutes after the update and restart of Firefox I miss certain images and styles in Gmail again. Maybe putting the machine into sleep mode and waking it up somewhat fixes it?
Thanks for those bugs. Once I have more clear steps I could have a look, or wait for your logging patch to be landed.
Reporter | ||
Comment 19•4 years ago
|
||
Also now I can no longer load https://www.google.com again in my bugzilla
user context. As such I assume that this problem gets somewhat triggered by some startup code maybe as run when DoH is enabled. As seen earlier I haven't noticed it for a long timer with a restart when DoH was disabled.
Reporter | ||
Comment 20•4 years ago
|
||
As it looks like it's all Google domains that cannot be resolved here. For the affected tabs I can perfectly load others.
Also with google.com being loaded in the default context, and having a DNS entry shown under DNS
in about:networking
the same domain cannot be resolved in the user context tab.
Updated•4 years ago
|
Reporter | ||
Comment 21•4 years ago
|
||
Valentin, here a shortened HTTP log when waiting for Bugzilla to load after clicking a link inside of Gmail. Right now it takes more than 3 minutes - maybe due to not being able to resolve the domain?
Assignee | ||
Comment 22•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #21)
Created attachment 9150726 [details]
HTTP log of a very slow loading Bugzilla pageValentin, here a shortened HTTP log when waiting for Bugzilla to load after clicking a link inside of Gmail. Right now it takes more than 3 minutes - maybe due to not being able to resolve the domain?
The log shows bugzilla resolved in less than 0.2 seconds
2020-05-21 11:44:23.427741 UTC - [Parent 88502: Socket Thread]: D/nsHostResolver Resolving host [bugzilla.mozilla.org]<^userContextId=2> type 0. [this=0x1257f4760]
[...]
2020-05-21 11:44:23.516166 UTC - [Parent 88502: TRR Background]: D/nsHostResolver Processing DoH response took 15.707802 ms
But I've been seeing very long load times for bugzilla too. I always assumed it's a server side issue.
I'm wondering if the problem here might be that you're getting different IPs from cloudflare that are not for your local CDN, and those connections might fail/time-out?
Reporter | ||
Comment 23•4 years ago
|
||
Note that I hit this problem over the weekend with my other profile running Firefox 77 beta with the default context. Web pages on Amazon were displayed without any CSS, just black text on white background.
Meanwhile I wonder how important it is to get this problem fixed. If it's due to problems with DNS resolution with DoH providers (note that I have also seen this with NextDNS, what are the next steps?
Reporter | ||
Comment 24•4 years ago
|
||
As it looks like we still ship it pref'ed off, so we should be fine?
Reporter | ||
Comment 25•4 years ago
|
||
Valentin, very interesting is that I cannot get the content of https://vanilla.aslushnikov.com/ loaded when DoH is enabled, but it works just fine with DoH disabled. It's 100% reproducible for me. Attached you can find the network log.
Reporter | ||
Comment 26•4 years ago
|
||
And here the network log when DoH is turned off and the page loads perfectly fine.
Assignee | ||
Comment 27•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #24)
As it looks like we still ship it pref'ed off, so we should be fine?
The pref is turned on via Normandy in the locations we are rolling-out. Currently only the US.
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #25)
Created attachment 9151591 [details]
HTTP log for https://vanilla.aslushnikov.com/Valentin, very interesting is that I cannot get the content of https://vanilla.aslushnikov.com/ loaded when DoH is enabled, but it works just fine with DoH disabled. It's 100% reproducible for me. Attached you can find the network log.
That is quite interesting. Looking at the logs, it seems vanilla.aslushnikov.com loads with no problem, but cdn.jsdelivr.net does not.
But the reason why that happens is very unclear. I'll try to land even more logging to figure it out.
Reporter | ||
Comment 28•4 years ago
|
||
Valentin, have you had the time to add some further logging to networking yet?
Assignee | ||
Comment 29•4 years ago
|
||
Hi Henrik, yes, I did it in bug 1640872.
If possible, please try to reproduce again. Thanks a lot for all your help with this bug.
Reporter | ||
Comment 30•4 years ago
|
||
Great. I will upgrade and check throughout the day if I can hit the problem again.
Reporter | ||
Comment 31•4 years ago
|
||
Valentin, here a new network log but this time for IRCCloud. Usually it starts that the websockets don't work anymore and it shows me constantly offline, and is never able to connect. Then when I reload the page, it cannot load the page (We can’t connect to the server at www.irccloud.com):
https://send.firefox.com/download/76b41783ecfae5ea/#d0LNt-PiXM4uQVoSCUhnZQ
Once I disable DoH it works again. I hope you can find some more helpful details.
Assignee | ||
Comment 32•4 years ago
|
||
Thank you Henrik. I finally managed to figure it out.
It's related to IPv6 support. When making a TRR-first request for an IPv6 address, we don't properly fall back to DNS.
I've got a fix and a test, but I'm working on more to figure out why it only started happening recently.
Reporter | ||
Comment 33•4 years ago
|
||
Oh good to know! The only thing I can say right now is that I enabled ipv6 a while ago via my router. If it really has to be I could check older releases and maybe find out which release actually regressed with that. But given the amount of time needed to run into this problem, it might take a while.
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 34•4 years ago
|
||
If you only recently turned on IPv6 it's likely the issue has existed for a while and was introduced by bug 1610836.
Reporter | ||
Comment 35•4 years ago
|
||
No, it has been done a couple of months ago; sometimes last year.
Comment 36•4 years ago
|
||
[Tracking Requested - why for this release]:
I'm seeing lots of chatter on Reddit from people with websites that won't load. Refreshing their profile fixes it, suggesting that the users might have been opted into DoH rollout.
Comment 37•4 years ago
|
||
I wonder if there is some relation with bug
Assignee | ||
Comment 38•4 years ago
|
||
(In reply to Aaron Klotz [:aklotz] from comment #36)
[Tracking Requested - why for this release]:
I'm seeing lots of chatter on Reddit from people with websites that won't load. Refreshing their profile fixes it, suggesting that the users might have been opted into DoH rollout.
If they are people who were previously using nextdns in trr.mode=3 they would completely lose connectivity as nextdns has turned off their TRR server due to 77 accidentally DDOSing them.
Reporter | ||
Comment 39•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #32)
It's related to IPv6 support. When making a TRR-first request for an IPv6 address, we don't properly fall back to DNS.
What does it actually mean for regression testing? If I would have clear steps to do it should be easier to find the regressor.
Assignee | ||
Comment 40•4 years ago
|
||
Assignee | ||
Comment 41•4 years ago
|
||
Depends on D78237
Comment 42•4 years ago
|
||
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/604346d2f6da IPv6 TRR requests don't fallback to DNS properly r=dragana,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/3b999337cd40 Test that IPv6 TRR requests fallback to DNS properly r=dragana,necko-reviewers
Comment 43•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/604346d2f6da
https://hg.mozilla.org/mozilla-central/rev/3b999337cd40
Assignee | ||
Comment 44•4 years ago
|
||
[Tracking Requested - why for this release]:
Intermittent failure of IPv6 DNS requests when using TRR.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 45•4 years ago
|
||
Henrik, would you be able to verify this fix?
Otherwise, I could help with this if you provide me clear reproduction steps.
Also, the regressor above could be confirmed if this bug could be reproduced constantly.
Thank you for your contribution!
Reporter | ||
Comment 46•4 years ago
|
||
Yes, I will keep an eye on it and reply back by mid of the week. Usually it takes a little bit until I was able to see this problem. I will come back on Wednesday.
Reporter | ||
Comment 47•4 years ago
|
||
After a couple of days using the Firefox Nighly with the patch included and DoH enabled I can no longer see the network issues. It runs all fine now.
Valentin, can you start the uplift request for 78 beta?
Assignee | ||
Comment 48•4 years ago
|
||
Comment on attachment 9154116 [details]
Bug 1637512 - IPv6 TRR requests don't fallback to DNS properly r=dragana
Beta/Release Uplift Approval Request
- User impact if declined: IPv6 DNS requests might intermittently fail under TRR.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The change is limited in scope. It makes sure we fall back to DNS regardless of the reason why the DoH request failed.
- String changes made/needed:
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 49•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #47)
After a couple of days using the Firefox Nighly with the patch included and DoH enabled I can no longer see the network issues. It runs all fine now.
Thank you for confirming the fix, Henrik, and for filing the bug in the first place. This would have been difficult to figure out without all the logs you provided. I am grateful for all your help.
Assignee | ||
Updated•4 years ago
|
Comment 50•4 years ago
|
||
Comment on attachment 9154116 [details]
Bug 1637512 - IPv6 TRR requests don't fallback to DNS properly r=dragana
approved for 78.0b6
Updated•4 years ago
|
Comment 51•4 years ago
|
||
bugherder uplift |
Updated•4 years ago
|
Description
•