Closed Bug 1703934 Opened 3 years ago Closed 3 years ago

Discord Not Loading with DNS-over-HTTPS Enabled

Categories

(Core :: Networking: DNS, defect, P2)

Firefox 89
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox89 --- fixed
firefox90 --- fixed

People

(Reporter: nils, Assigned: kershaw)

Details

(Whiteboard: [necko-triaged])

Attachments

(4 files, 3 obsolete files)

Attached image aJF3U3an5s.png

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0

Steps to reproduce:

  1. Enable DNS-over-HTTPS to any provider
  2. Load https://discord.com/app

Actual results:

Discord splash page sometimes works, other times the channel/server will load, but the chat window will have greyed out/unloaded content.

As soon as I disable DoH, everything loads properly. Other sites seem to load fine with DoH, except for Discord.

Expected results:

All content should load normally.

The Bugbug bot thinks this bug should belong to the 'Core::Networking: DNS' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Networking: DNS
Product: Firefox → Core

Could you try to got the http log?
Since this might have something to do with websocket, could you also add nsWebSocket:5 to MOZ_LOG variable? Thanks.

Flags: needinfo?(nils)

(In reply to Kershaw Chang [:kershaw] from comment #2)

Could you try to got the http log?
Since this might have something to do with websocket, could you also add nsWebSocket:5 to MOZ_LOG variable? Thanks.

I went ahead and generated a log file with the added variable. File was fairly large, I didn't have any other tabs open except Discord when the log was generated. Since it was so large, I had to use an external sharing service. URL below:

https://ufile.io/ysb2r07l

Flags: needinfo?(nils)

Thanks for the log, but I can't find anything wrong from the log.
Could you try to do things below and see if you can still reproduce this (please keep DNS-over-HTTPS enabled)?

  1. Go to about:config and disable network.dns.use_https_rr_as_altsvc.
  2. Try to reproduce with a clean profile.
  3. Disable network.http.spdy.websockets.

Thanks.

Flags: needinfo?(nils)

(In reply to Kershaw Chang [:kershaw] from comment #4)

Thanks for the log, but I can't find anything wrong from the log.
Could you try to do things below and see if you can still reproduce this (please keep DNS-over-HTTPS enabled)?

  1. Go to about:config and disable network.dns.use_https_rr_as_altsvc.
  2. Try to reproduce with a clean profile.
  3. Disable network.http.spdy.websockets.

Thanks.

I disabled network.dns.use_https_rr_as_altsvc. This allowed Discord to load normally. Using Cloudflare as the DoH provider.

Flags: needinfo?(nils)

(In reply to Nils from comment #5)

(In reply to Kershaw Chang [:kershaw] from comment #4)

Thanks for the log, but I can't find anything wrong from the log.
Could you try to do things below and see if you can still reproduce this (please keep DNS-over-HTTPS enabled)?

  1. Go to about:config and disable network.dns.use_https_rr_as_altsvc.
  2. Try to reproduce with a clean profile.
  3. Disable network.http.spdy.websockets.

Thanks.

I disabled network.dns.use_https_rr_as_altsvc. This allowed Discord to load normally. Using Cloudflare as the DoH provider.

Thanks! Now I see the problem from the log.
This seems to be related to http3. Here is what happened.

  1. An HTTPS RR record is used to connect to discord.com with http3.
2021-04-12 12:59:34.267000 UTC - [Parent 16268: Socket Thread]: V/nsHttp nsHttpTransaction::OnHTTPSRRAvailable [this=142c1bde800] mActivated=0
2021-04-12 12:59:34.267000 UTC - [Parent 16268: Socket Thread]: V/nsHttp HTTPSSVC: use new routed host (discord.com) and new npnToken (h3-29)
  1. For some reason, Firefox can't establish the http3 connection, so we try to fallback to h2.
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp nsHttpTransaction::OnFastFallbackTimer [142c1bde800] mConnected=0
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp HTTPSSVC: use new routed host (discord.com) and new npnToken (h2)
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp Init nsHttpConnectionInfo @142be97c200
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp NulHttpTransaction::NullHttpTransaction() mActivityDistributor is active [this=142b37eff80, discord.com]
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp FindCoalescableConnection .S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp FindCoalescableConnection(.S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29) no matching conn
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp GetH2orH3ActiveConn() request for ent 142bfbb1e20 .S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29 did not find an active connection
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp Init nsHttpConnectionInfo @142be97c7a0
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: D/nsHttp Destroying nsHttpConnectionInfo @142be97c7a0
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp OnMsgSpeculativeConnect Transport not created due to existing connection count
  1. The speculative connection is failed to create, since the connection limitation is reached (it seems network.http.speculative-parallel-limit is 0 here). In the end, the transaction is stayed in the pending queue forever.

Dragana, do you probably have an idea why we failed to establish a http3 connection?
Thanks.

Flags: needinfo?(dd.mozilla)

(In reply to Nils from comment #5)

(In reply to Kershaw Chang [:kershaw] from comment #4)

Thanks for the log, but I can't find anything wrong from the log.
Could you try to do things below and see if you can still reproduce this (please keep DNS-over-HTTPS enabled)?

  1. Go to about:config and disable network.dns.use_https_rr_as_altsvc.
  2. Try to reproduce with a clean profile.
  3. Disable network.http.spdy.websockets.

Thanks.

I disabled network.dns.use_https_rr_as_altsvc. This allowed Discord to load normally. Using Cloudflare as the DoH provider.

Could you make another http log with network.dns.use_https_rr_as_altsvc disabled?
Just wondering if the http3 connection can be made by alt-svc header. Thanks.

Flags: needinfo?(nils)

(In reply to Kershaw Chang [:kershaw] from comment #7)

(In reply to Nils from comment #5)

(In reply to Kershaw Chang [:kershaw] from comment #4)

Thanks for the log, but I can't find anything wrong from the log.
Could you try to do things below and see if you can still reproduce this (please keep DNS-over-HTTPS enabled)?

  1. Go to about:config and disable network.dns.use_https_rr_as_altsvc.
  2. Try to reproduce with a clean profile.
  3. Disable network.http.spdy.websockets.

Thanks.

I disabled network.dns.use_https_rr_as_altsvc. This allowed Discord to load normally. Using Cloudflare as the DoH provider.

Could you make another http log with network.dns.use_https_rr_as_altsvc disabled?
Just wondering if the http3 connection can be made by alt-svc header. Thanks.

Here's the new log as requested:
https://ufile.io/036w1vg7

Flags: needinfo?(nils)

This if statement return false :
if (mNumDnsAndConnectSockets < parallelSpeculativeConnectLimit &&
((ignoreIdle &&
(ent->IdleConnectionsLength() < parallelSpeculativeConnectLimit)) ||
!ent->IdleConnectionsLength()) &&
!(keepAlive && ent->RestrictConnections()) && <<<<<<<<<< I think we are failing in this line
!AtActiveConnectionLimit(ent, aTrans->Caps())) { <<<<<<<<< we never call this line because there is no log (the function has loggings)

RestrictConnections() calls AvailableForDispatchNow, AvailableForDispatchNow calls GetH2orH3ActiveConn. GetH2orH3ActiveConn prints:

021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp FindCoalescableConnection .S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp FindCoalescableConnection(.S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29) no matching conn
2021-04-12 12:59:34.318000 UTC - [Parent 16268: Socket Thread]: V/nsHttp GetH2orH3ActiveConn() request for ent 142bfbb1e20 .S......[tlsflags0x00000000]discord.com:443 {NPN-TOKEN h2}^partitionKey=%28https%2Cdiscord.com%29 did not find an active connection

So we probably have connection or unconnected DnsAndConnectSocket. That may happen :(
We need to force creating of a speculative connection and ignore or limits.

Flags: needinfo?(dd.mozilla)

Here's the new log as requested:
https://ufile.io/036w1vg7

Thanks for this log!
From the log, I see the http3 connection is made with h3-27, which is different with the previous one (h3-29).

2021-04-14 16:20:53.248000 UTC - [Parent 5852: Socket Thread]: V/nsHttp Creating DnsAndConnectSocket [this=241c11d9480 trans=241c4332050 ent=discord.com key=.S......[tlsflags0x00000000]discord.com:443 <ROUTE-via discord.com:443> {NPN-TOKEN h3-27}^partitionKey=%28https%2Cdiscord.com%29]

The reason is that the alt-svc header from discord.com is alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400, but the HTTPS RR is 1 discord.com (alpn=“h3-29,h3-28,h3-27,h2” ipv4hint=“162.159.128.233, 162.159.135.232, 162.159.136.232, 162.159.137.232, 162.159.138.232" ).
The spec rfc7838 says:

   When multiple values are present, the order of the values reflects
   the server's preference (with the first value being the most
   preferred alternative).

We use h3-27 when alt-svc header is used. However, it seems the order alpn-id is not defined in this spec. When HTTPS RR is available, Firefox chooses the first supported alpn-id (h3-29) to connect.

Hi Nils,
Could you check what's the value of network.http.speculative-parallel-limit at your side?
Thanks.

Flags: needinfo?(nils)

(In reply to Kershaw Chang [:kershaw] from comment #11)

Hi Nils,
Could you check what's the value of network.http.speculative-parallel-limit at your side?
Thanks.

It's currently set to '0'.

Flags: needinfo?(nils)
Assignee: nobody → kershaw
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]
Attachment #9216391 - Attachment is obsolete: true
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8a7e66be31e2
P1: restart pending transactions when an error happened before connect, r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/46cd307ba50c
P2: Use another parallel limit for backup speculative connection, r=dragana,necko-reviewers

Backed out 2 changesets (Bug 1703934) as requested on irc by kershaw for causing a possible regression.
https://hg.mozilla.org/integration/autoland/rev/0751c8ab736b1967556c951435df71357f660480

Flags: needinfo?(kershaw)
Attachment #9218311 - Attachment is obsolete: true
Attachment #9218312 - Attachment is obsolete: true
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6af76624ce86
P1: restart pending transactions when an error happened before connect, r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/2a51d2530939
P2: Use another parallel limit for backup speculative connection, r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/9dea771db2ce
P3: Make sure we always fast fallback to a non-http3 connection, r=dragana,necko-reviewers

Backed out 3 changesets (Bug 1703934) for causing xpcshell failures in test_http3_fast_fallback.js
Backout link: https://hg.mozilla.org/integration/autoland/rev/33e6726dee20181958342c8197f76cd37818df67
Push with failures, failure log.

Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8c713e2f43e7
P1: restart pending transactions when an error happened before connect, r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/4869f5f35758
P2: Use another parallel limit for backup speculative connection, r=dragana,necko-reviewers
https://hg.mozilla.org/integration/autoland/rev/f15ea1f11770
P3: Make sure we always fast fallback to a non-http3 connection, r=dragana,necko-reviewers
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch

Comment on attachment 9217668 [details]
Bug 1703934 - P3: Make sure we always fast fallback to a non-http3 connection, r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: The fast fallback mechanism for http3 is not working for those users who have set network.http.speculative-parallel-limit to 0.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: N/A
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch has been verified in Nightly for two days and we have tests for this.
  • String changes made/needed: N/A
Attachment #9217668 - Flags: approval-mozilla-beta?
Attachment #9216392 - Flags: approval-mozilla-beta?
Attachment #9216393 - Flags: approval-mozilla-beta?

Comment on attachment 9217668 [details]
Bug 1703934 - P3: Make sure we always fast fallback to a non-http3 connection, r=dragana

This baked in nightly for a week and is covered by tests, we are in early beta so that seems like a good time to uplift, thanks.

Attachment #9217668 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9216392 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9216393 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Kershaw, it seems that you have clear steps to reproduce the bug manually, are you sure that we don't need QA to verify the fix in nightly and beta?

Flags: needinfo?(kershaw)

(In reply to Pascal Chevrel:pascalc from comment #28)

Kershaw, it seems that you have clear steps to reproduce the bug manually, are you sure that we don't need QA to verify the fix in nightly and beta?

To reproduce this, we need a http3 connection fails first, but I think it's not easy to setup a test environment for this.
I think we don't need QA to verify since we already have an automatic test for this.

Flags: needinfo?(kershaw)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: