Closed Bug 1637512 Opened 4 years ago Closed 4 years ago

Loss of network connections when IPv6 and DoH enabled

Categories

(Core :: Networking, defect, P2)

78 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla79
Tracking Status
firefox-esr68 --- unaffected
firefox76 --- wontfix
firefox77 + wontfix
firefox78 + fixed
firefox79 + verified

People

(Reporter: whimboo, Assigned: valentin)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [necko-triaged][trr])

Attachments

(7 files)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0 ID:20200512212513

This started recently for me within a specific container tab that holds an inbox for mail.google.com. Failures look like:

08:26:51.967 Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://fonts.gstatic.com/s/googlesans/v14/4UabrENHsxJlGDuGo1OIlLU94YtzCwZsPF4o.woff2. (Reason: CORS request did not succeed).

Several icons (including the favicon) aren't loaded (see the attachment) which make the ui basically unusable when it comes to setting labels, or starring emails.

Interesting is that the problem is not visible in another tab with the default identity (no container). After logging in with the same credentials everything is displayed correctly.

I'm not sure if that is a container related problem, or security related.

Johann, any hint for further investigation or which logs might help here? Also who would be the right person to pick this up?

Flags: needinfo?(jhofmann)

Actually while filing this bug I noticed that also Bugzilla is affected, so it might be a general problem with containers.

Summary: Firefox Nightly fails to load lots of resources for mail.google.com due to "Cross-Origin Request Blocked" → Firefox Nightly fails to load lots of resources for websites running in a container due to "Cross-Origin Request Blocked"

Also Gmail shows me regularly Not connected. Connecting in 82s… messages, and Bugzilla is kinda slow. I assume both behaviors might be related because both websites run in the same container for me.

Trying to load like google.com completely fails with We can’t connect to the server at www.google.com., while with the normal identity it works.

Using Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0 ID:20200501214551 it seems to work fine so far. So it looks like a recent regression in Firefox 78. I will do a regression check later today.

Ok, so 20200501214551 is also affected and I just got the problem after waking up my MacBook Pro from sleep. Gmail immediately showed the connection problem notification, and no other websites can be loaded.

I can only test that on MacOS. So it would be good to know if someone else could reproduce on a different platform.

Summary: Firefox Nightly fails to load lots of resources for websites running in a container due to "Cross-Origin Request Blocked" → Wakinig up Firefox Nightly from sleep mode causes a loss of network for non-default container tabs

Sorry for flipping the summary that often, lets use one for now which should be kept.

I just noticed that it's not necessary to put the machine into sleep mode to get this loss of network connectivity reproduced. It seems to start with some resources as initially mentioned and at some point Google domains cannot be accessed anymore.

Summary: Wakinig up Firefox Nightly from sleep mode causes a loss of network for non-default container tabs → Loss of network connections for non-default user contexts (container) in Firefox Nightly
Attached file HTTP log

This is the HTTP log as recorded via about:networking when the connection to google.com dropped for the user context 2. Please specifically search for the following link, which is opened when clicking on a bugzilla link within Gmail:

https://www.google.com/url?q=https://bugzilla.mozilla.org/show_bug.cgi?id%3D1636333&source=gmail&ust=1589465880000000&usg=AFQjCNHWx1XyhZgp9c0Hbm30wtL8WoduRQ

Honza, is there anything which stands out here?

Flags: needinfo?(honzab.moz)
2020-05-13 14:27:17.834479 UTC - [Parent 74059: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x135a50c00 status=804b0003]
  804b0003 = STATUS_RESOLVING
2020-05-13 14:27:17.834581 UTC - [Parent 74059: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=0x135a50800 status=804b0003 progress=0]
2020-05-13 14:27:17.834599 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver Resolving host [www.google.com] - bypassing cache type 0. [this=0x1296c3310]
2020-05-13 14:27:17.834612 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver   No usable record in cache for host [www.google.com] type 0.
2020-05-13 14:27:17.834620 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver NameLookup: www.google.com effectiveTRRmode: 2
2020-05-13 14:27:17.834723 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport   after event [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834739 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834751 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x135a50c00 state=0 cond=804b001e]
2020-05-13 14:27:17.834766 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport   not in a recoverable state

Valentin, Kershaw, anything in our TRR code that relates to containers somehow? I don't see any more information in the log, just this weird sync error.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(kershaw)
Flags: needinfo?(honzab.moz)

Note that disabling DOH let the problem go away and Gmail behaves correctly again. The one I had selected was Cloudflare.

Turning on DOH again the same problem returns. It happens with both Cloudflare and NextDNS.

Component: Security → Networking
Summary: Loss of network connections for non-default user contexts (container) in Firefox Nightly → Loss of network connections for non-default user contexts (container) when DoH enabled
Flags: needinfo?(jhofmann)

(In reply to Honza Bambas (:mayhemer) from comment #7)

2020-05-13 14:27:17.834479 UTC - [Parent 74059: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=0x135a50c00 status=804b0003]
  804b0003 = STATUS_RESOLVING
2020-05-13 14:27:17.834581 UTC - [Parent 74059: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=0x135a50800 status=804b0003 progress=0]
2020-05-13 14:27:17.834599 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver Resolving host [www.google.com] - bypassing cache type 0. [this=0x1296c3310]
2020-05-13 14:27:17.834612 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver   No usable record in cache for host [www.google.com] type 0.
2020-05-13 14:27:17.834620 UTC - [Parent 74059: Socket Thread]: D/nsHostResolver NameLookup: www.google.com effectiveTRRmode: 2
2020-05-13 14:27:17.834723 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport   after event [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834739 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=0x135a50c00 cond=804b001e]
2020-05-13 14:27:17.834751 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x135a50c00 state=0 cond=804b001e]
2020-05-13 14:27:17.834766 UTC - [Parent 74059: Socket Thread]: D/nsSocketTransport   not in a recoverable state

Valentin, Kershaw, anything in our TRR code that relates to containers somehow? I don't see any more information in the log, just this weird sync error.

I am not really sure about this. I do see that OriginAttributes is used in our DNS code, but it seems not used in TRR code.

Anyway, the log is a bit short and it doesn't tell the reason why the DNS resolution is failed.
Henrik, do you perhaps have a longer log? Or could you try to reproduce this again? When this happens, could you try to use about:networking#dnslookuptool to test if DNS works?

Flags: needinfo?(kershaw) → needinfo?(hskupin)
Assignee: nobody → valentin.gosu
Severity: -- → S3
Flags: needinfo?(valentin.gosu)
Priority: -- → P2
Whiteboard: [necko-triaged][trr]

I enabled it again yesterday but so far I haven't gotten Firefox into that state again. Gmail and DNS resolutions still work fine. I will still keep an eye on it.

(In reply to Kershaw Chang [:kershaw] from comment #10)

Anyway, the log is a bit short and it doesn't tell the reason why the DNS resolution is failed.
Henrik, do you perhaps have a longer log? Or could you try to reproduce this again? When this happens, could you try to use about:networking#dnslookuptool to test if DNS works?

Note that I started network logging in `about:networking) before I tried to open the page in question and then waited a couple of seconds before stopping it. So what else should I do?

I just hit this problem again. It started yesterday after starting an updated version of Nightly. At first some resources in Gmail couldn't be loaded (inside the bugzilla context), and then this morning after waking up the machine from sleep I'm not able to open the Chrome bug tracker in my "default" context and NOT Bugzilla context (container), which still can open the page? When I'm doing a DNS lookup in about:networking I get 216.58.208.51. So the name can be successfully resolved.

Also note that I had the problem yesterday with my other profile and Firefox 77b6 while trying to watch a movie on Amazon Prime. A click on the play button didn't start the movie at all, but showed no reaction. Only with a restart of Firefox I was able to get it working again.

I wonder how critical these problems are and if we should consider blocking the next release on it, or set the flag as long as we don't know the root cause?

Flags: needinfo?(hskupin)

The symptoms here do seem a bit similar to bug 1610691.
Could you try to flip the privacy.resistFingerprinting pref as mentioned in bug 1610691 comment 11 and letting me know if you still see the problem?

Flags: needinfo?(hskupin)

I can do whenever I see the problem again. Just after waking up the machine again from sleep it works all fine for the moment. So I will have to wait again before I'm able to try this preference again.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #13)

The symptoms here do seem a bit similar to bug 1610691.

Actually they don't. This is quite clearly a DNS failure.
I'll try to add some extra logging for the origin attributes.

Flags: needinfo?(hskupin)

Valentin, do you have any idea what could have caused that during the 77 release cycle? In case it would make sense to check for some regression ranges.

Depends on: 1638789

The only thing that is marked [trr] is bug 1626057. But it seems unlikely that caused the regression.
Maybe bug 1625151 could be related, but it seems unlikely. Bug 1625213 also landed in 77.

Note that an update of Nightly has been announced some minutes ago and I upgraded to see if it reproduces. And indeed, some minutes after the update and restart of Firefox I miss certain images and styles in Gmail again. Maybe putting the machine into sleep mode and waking it up somewhat fixes it?

Thanks for those bugs. Once I have more clear steps I could have a look, or wait for your logging patch to be landed.

Also now I can no longer load https://www.google.com again in my bugzilla user context. As such I assume that this problem gets somewhat triggered by some startup code maybe as run when DoH is enabled. As seen earlier I haven't noticed it for a long timer with a restart when DoH was disabled.

As it looks like it's all Google domains that cannot be resolved here. For the affected tabs I can perfectly load others.

Also with google.com being loaded in the default context, and having a DNS entry shown under DNS in about:networking the same domain cannot be resolved in the user context tab.

Valentin, here a shortened HTTP log when waiting for Bugzilla to load after clicking a link inside of Gmail. Right now it takes more than 3 minutes - maybe due to not being able to resolve the domain?

Flags: needinfo?(valentin.gosu)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #21)

Created attachment 9150726 [details]
HTTP log of a very slow loading Bugzilla page

Valentin, here a shortened HTTP log when waiting for Bugzilla to load after clicking a link inside of Gmail. Right now it takes more than 3 minutes - maybe due to not being able to resolve the domain?

The log shows bugzilla resolved in less than 0.2 seconds

2020-05-21 11:44:23.427741 UTC - [Parent 88502: Socket Thread]: D/nsHostResolver Resolving host [bugzilla.mozilla.org]<^userContextId=2> type 0. [this=0x1257f4760]
[...]
2020-05-21 11:44:23.516166 UTC - [Parent 88502: TRR Background]: D/nsHostResolver Processing DoH response took 15.707802 ms

But I've been seeing very long load times for bugzilla too. I always assumed it's a server side issue.

I'm wondering if the problem here might be that you're getting different IPs from cloudflare that are not for your local CDN, and those connections might fail/time-out?

Flags: needinfo?(valentin.gosu)

Note that I hit this problem over the weekend with my other profile running Firefox 77 beta with the default context. Web pages on Amazon were displayed without any CSS, just black text on white background.

Meanwhile I wonder how important it is to get this problem fixed. If it's due to problems with DNS resolution with DoH providers (note that I have also seen this with NextDNS, what are the next steps?

Summary: Loss of network connections for non-default user contexts (container) when DoH enabled → Loss of network connections when DoH enabled

Valentin, very interesting is that I cannot get the content of https://vanilla.aslushnikov.com/ loaded when DoH is enabled, but it works just fine with DoH disabled. It's 100% reproducible for me. Attached you can find the network log.

Flags: needinfo?(valentin.gosu)

And here the network log when DoH is turned off and the page loads perfectly fine.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #24)

As it looks like we still ship it pref'ed off, so we should be fine?

https://searchfox.org/mozilla-central/rev/501eb4718d73870892d28f31a99b46f4783efaa0/modules/libpref/init/all.js#4224

The pref is turned on via Normandy in the locations we are rolling-out. Currently only the US.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #25)

Created attachment 9151591 [details]
HTTP log for https://vanilla.aslushnikov.com/

Valentin, very interesting is that I cannot get the content of https://vanilla.aslushnikov.com/ loaded when DoH is enabled, but it works just fine with DoH disabled. It's 100% reproducible for me. Attached you can find the network log.

That is quite interesting. Looking at the logs, it seems vanilla.aslushnikov.com loads with no problem, but cdn.jsdelivr.net does not.
But the reason why that happens is very unclear. I'll try to land even more logging to figure it out.

Flags: needinfo?(valentin.gosu)

Valentin, have you had the time to add some further logging to networking yet?

Flags: needinfo?(valentin.gosu)

Hi Henrik, yes, I did it in bug 1640872.
If possible, please try to reproduce again. Thanks a lot for all your help with this bug.

Flags: needinfo?(valentin.gosu)

Great. I will upgrade and check throughout the day if I can hit the problem again.

Depends on: 1640872

Valentin, here a new network log but this time for IRCCloud. Usually it starts that the websockets don't work anymore and it shows me constantly offline, and is never able to connect. Then when I reload the page, it cannot load the page (We can’t connect to the server at www.irccloud.com):

https://send.firefox.com/download/76b41783ecfae5ea/#d0LNt-PiXM4uQVoSCUhnZQ

Once I disable DoH it works again. I hope you can find some more helpful details.

Flags: needinfo?(valentin.gosu)

Thank you Henrik. I finally managed to figure it out.
It's related to IPv6 support. When making a TRR-first request for an IPv6 address, we don't properly fall back to DNS.
I've got a fix and a test, but I'm working on more to figure out why it only started happening recently.

Flags: needinfo?(valentin.gosu)

Oh good to know! The only thing I can say right now is that I enabled ipv6 a while ago via my router. If it really has to be I could check older releases and maybe find out which release actually regressed with that. But given the amount of time needed to run into this problem, it might take a while.

Summary: Loss of network connections when DoH enabled → Loss of network connections when IPv6 and DoH enabled

If you only recently turned on IPv6 it's likely the issue has existed for a while and was introduced by bug 1610836.

No, it has been done a couple of months ago; sometimes last year.

[Tracking Requested - why for this release]:
I'm seeing lots of chatter on Reddit from people with websites that won't load. Refreshing their profile fixes it, suggesting that the users might have been opted into DoH rollout.

I wonder if there is some relation with bug

(In reply to Aaron Klotz [:aklotz] from comment #36)

[Tracking Requested - why for this release]:
I'm seeing lots of chatter on Reddit from people with websites that won't load. Refreshing their profile fixes it, suggesting that the users might have been opted into DoH rollout.

If they are people who were previously using nextdns in trr.mode=3 they would completely lose connectivity as nextdns has turned off their TRR server due to 77 accidentally DDOSing them.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #32)

It's related to IPv6 support. When making a TRR-first request for an IPv6 address, we don't properly fall back to DNS.

What does it actually mean for regression testing? If I would have clear steps to do it should be easier to find the regressor.

Flags: needinfo?(valentin.gosu)
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/604346d2f6da
IPv6 TRR requests don't fallback to DNS properly r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/3b999337cd40
Test that IPv6 TRR requests fallback to DNS properly r=dragana,necko-reviewers
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79

[Tracking Requested - why for this release]:
Intermittent failure of IPv6 DNS requests when using TRR.

Flags: needinfo?(valentin.gosu)
Flags: in-testsuite+
Regressed by: 1610836
Has Regression Range: --- → yes

Henrik, would you be able to verify this fix?
Otherwise, I could help with this if you provide me clear reproduction steps.

Also, the regressor above could be confirmed if this bug could be reproduced constantly.
Thank you for your contribution!

Flags: needinfo?(hskupin)

Yes, I will keep an eye on it and reply back by mid of the week. Usually it takes a little bit until I was able to see this problem. I will come back on Wednesday.

After a couple of days using the Firefox Nighly with the patch included and DoH enabled I can no longer see the network issues. It runs all fine now.

Valentin, can you start the uplift request for 78 beta?

Status: RESOLVED → VERIFIED
Flags: needinfo?(hskupin) → needinfo?(valentin.gosu)

Comment on attachment 9154116 [details]
Bug 1637512 - IPv6 TRR requests don't fallback to DNS properly r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: IPv6 DNS requests might intermittently fail under TRR.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change is limited in scope. It makes sure we fall back to DNS regardless of the reason why the DoH request failed.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9154116 - Flags: approval-mozilla-beta?
Attachment #9154117 - Flags: approval-mozilla-beta?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #47)

After a couple of days using the Firefox Nighly with the patch included and DoH enabled I can no longer see the network issues. It runs all fine now.

Thank you for confirming the fix, Henrik, and for filing the bug in the first place. This would have been difficult to figure out without all the logs you provided. I am grateful for all your help.

Comment on attachment 9154116 [details]
Bug 1637512 - IPv6 TRR requests don't fallback to DNS properly r=dragana

approved for 78.0b6

Attachment #9154116 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9154117 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: