Slack page can't load channel list from Slack's WebSocket server when using DoH and Google's DNS

VERIFIED FIXED in Firefox 61

Status

()

defect
P1
normal
VERIFIED FIXED
Last year
Last year

People

(Reporter: cpeterson, Assigned: bagder)

Tracking

unspecified
mozilla62
Unspecified
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr60 disabled, firefox59 unaffected, firefox60 disabled, firefox61 verified, firefox62 verified)

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(3 attachments, 2 obsolete attachments)

Posted image Slack_screenshot.png
STR:
1. Enable DoH prefs with Google's DNS:
network.trr.mode = 2 or 1
network.trr.uri = https://dns.google.com/experimental

2. Restart Firefox.
3. Load Slack, e.g. https://mozilla.slack.com/

RESULT:
Slack can load the page's wireframe, but it can't load the channel list or recent messages from Slack's WebSocket server. See the attached screenshot. Occasionally, Slack can load the channel list and recent messages after waiting 20 seconds (which just happens to be a multiple of the Slack SOCKET-MANAGER's 10000 ms timeout below).

This problem only seems to happen with Google's DNS. If I use Cloudflare's DNS (network.trr.uri = https://mozilla.cloudflare-dns.com/dns-query) or disable DoH (network.trr.mode = 0) and reload the Slack page, then the channel list and recent messages load correctly.

When using Google's DNS, I see WebSocket error messages in the devtools console like:

Firefox can’t establish a connection to the server at wss://mpmulti-5eri.lb.slack-msgs.com/...
Apr-7 22:49:21.398 [T027LFU12] SOCKET-MANAGER: Changing from prov_connecting to error
Apr-7 22:49:21.400 [T027LFU12] SOCKET-MANAGER: will stay in error state for up to 10000 ms
Apr-7 22:49:21.401 T027LFU12 SOCKET-MANAGER: Closing socket due to unknown error
Assignee: nobody → daniel
Priority: -- → P1
Whiteboard: [necko-triaged][trr]
This seems more like a race condition than a problem with Google's DNS. Today I can also reliably reproduce the problem with Cloudflare's DNS in mode = 1 (but mode = 2 still works).
I take that back: I can reproduce on both macOS and Windows 10. On macOS, I can make the problem start and stop just by changing the prefs. On Windows, however, it seems like I have to restart the browser for the new prefs to trigger the problem.
OS: Unspecified → All
I can reproduce on Windows 10, too. I'm suffering of that. :/
I'm also seeing this, on macOS. Fixed by flipping trr.mode back from 2 to 0.
I've managed to reproduce a long delay a few times but I can't seem to do it very reliably and I have not been able to observe exactly why this problem happens. I'm logging at nsHostResolver:5,nsHttp:5 and I can see that the slack host names are "CNAME-chased" quite a bit but I can't spot any DOH-related failures (due to CNAMEs or otherwise).

Mode 2 should cause a retry on native resole if TRR fails, but I see no fails.

Mode 1 should just go with the fastest (and use native if TRR fails).

Based on all this, it seems like TRR doesn't really (report) fail when this effect happens.
chris - can you turn on dns logging in the about:networking tab and repro for daniel? Mode 1 is the most curious.
actually you might want all of http logging on. its possible the dns data from cloud is the problem, not the dns interaction
Here is a zip with log files of mode=0, mode=1, and mode=2:

https://www.dropbox.com/s/d5x1x8cc3fk2n6e/bug-1452436-logs.zip?dl=0

I recorded these log files with the default modules: timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5. If there is a different HTTP module I should specify, just let me know.

I can still reproduce this Slack connection problem very easily with mode=1 while mode=2 connects successfully.
thanks!
Flags: needinfo?(daniel)
(In reply to Chris Peterson [:cpeterson] from comment #9)
> I can still reproduce this Slack connection problem very easily with mode=1
> while mode=2 connects successfully.

Did any DoH changes land recently? Slack used to fail to load for me with mode=1 while mode=0 and mode=2 loaded successfully. But today Slack fails to load with either mode=1 or mode=2 (on Mac and Windows with both Google's and Cloudflare's DNS servers).
There's been a few smaller changes recently, and a change of how CNAME responses are handled landed just the other day (bug 1455972).

I've stared on your logs a lot and I've tried to make sense of the symptoms but failed so far - there doesn't seem to be any TRR failures logged when you get these problems!

I keep using the mozilla slack in a tab of mine all the time and I alternate between trr mode 2 and 3 (usually with cloudflare) and I experience none of these problems. (I've seen delayed loads at some point but I'm far from being able to reproduce).

mode 0 isn't using TRR at all so I ignore that.

mode 1 is racing native vs trr and goes with the fastest response, which then would indicate that sometimes you get the TRR response and for some reason it seems to give you the "wrong" IP or an IP that is slow for you.

mode 2 is trr-first and then as long as you get an IP from the DOH server (and Firefox doesn't fail to connect to it) you'll use the TRR address so why mode 2 would work and mode 1 would not is really hard to grasp.

It might offer some additional data points if you check "about:networking" on the DNS tab when things go wrong/right and see if the TRR columns says true or false for the involved host names.

Can you record some new logs with these more "absolute" failures you experience now? Maybe this situation gives a better picture.
Flags: needinfo?(daniel)
Here's a snapshot of my about:networking DNS tab with trr mode 2. After waiting about one minute, Slack finally loaded.

Hostname                        Family  TRR     Addresses                     Expires (Seconds)
ca.slack-edge.com               ipv4    false   151.101.40.102                18
mpmulti-s86e.lb.slack-msgs.com  ipv4    false   18.216.16.163                 14
mpmulti-wq3r.lb.slack-msgs.com  ipv4    true    54.149.194.173                17
a.slack-edge.com                ipv4    true    151.101.40.102                9
mozilla.cloudflare-dns.com      ipv4    false   104.16.112.25 104.16.111.25   90
lh4.googleusercontent.com       ipv4    true    172.217.0.33                  36
secure.gravatar.com             ipv4    true    192.0.73.2                    59
lh3.googleusercontent.com       ipv4    true    172.217.0.33                  36
a.slack-edge.com                ipv4    true    151.101.40.102                9
emoji.slack-edge.com            ipv4    true    151.101.40.102                9
lh6.googleusercontent.com       ipv4    true    172.217.0.33                  36
Here's a freshly recorded log file when this happens ("timestamp,sync,nsWebSocket:5,nsHostResolver:5").

This shows that TRR works fine the whole time, but the websockets fail to connect to the given addresses and it has a 20 second timeout on each attempt.

There's no fallback to native DNS resolve in the websocket logic so the next attempt (if any) may very well also fail. I think we need to add logic there i similar style that we're doing for the sockettransport:

https://searchfox.org/mozilla-central/source/netwerk/base/nsSocketTransport2.cpp#1820-1830

I can't explain why TRR returns diff/non-functional addresses and why native resolves doesn't. It surprises me that the slack web site out of all sites would show this behavior.

Michal, as somewhat familiar with the websockets code, would you agree with my conclusion and do you have a suggestion on how I'd best add the fallback/retry ?
Flags: needinfo?(michal.novotny)
(In reply to Daniel Stenberg [:bagder] from comment #14)
> Michal, as somewhat familiar with the websockets code, would you agree with
> my conclusion and do you have a suggestion on how I'd best add the
> fallback/retry ?

We use the address from resolver only in nsWSAdmissionManager to control how many websocket channels we're connecting to some host at the same time. We create http channel in WebSocketChannel::AsyncOpen with the given URI, so no fallback is necessary in websockets code.
Flags: needinfo?(michal.novotny)
daniel can you confirm that this will have no impact in shadow mode?
I can confirm that in mode 4 this issue doesn't trigger.

I haven't seen this problem (and nobody else has reported it) happen in mode 4 (nor 0) and it also follows the logic that the problem is somehow related to the particular IP addresses returned from TRR instead of native. After talking with the team this evening I got some new ideas on how to proceed my debugging from here so I hope to have some further details on this soon.
Ok Nick,

Let me bounce this with you as well. I have a 6MB gzipped logfile here: https://daniel.haxx.se/test/mozilla-slack-broken2-2018-04-27.log.gz (about 60MB unpacked)

That's a log of nsHostResolver, nsWebSocket and nsHttp when I load mozilla.slack.com from my own Firefox dev build a few days ago.

Search for "mpmulti-pzxk.lb.slack-msgs.com" and look at the WebSocket HTTP request that is sent at timestamp 2018-04-27 13:26:20.822957.

WebSockets here implies HTTP/1 while the rest of this site is served over h2. I'm asking for your help to interpret the logs here to spot why this request takes ~95 seconds(!) to fulfill. The WebSockets code itself only allows 20 seconds and times out before the actual HTTP response arrives.

The TRR name resolve is done in 22ms at "2018-04-27 13:26:20.860225". I can't see any problems with that.

After almost 7 seconds (2018-04-27 13:26:27.617449), we get "GetSpdyActiveConn() request for ent 0x7f6e42944580 .S...X.[tlsflags0x00000000]mpmulti-pzxk.lb.slack-msgs.com:443 did not find an active connection"

... which to me sounds as if we're looking for an h2 connection for this request? I can't see anything else interesting done on this request before then and it puzzles me greatly.

I can't see how this is related to TRR/name resolving at all, but right now I'm trying to figure out exactly what's going on so that I can possibly backtrack from there and figure out where the problem is introduced in this scenario. Since it doesn't happen without TRR it pretty much has to be involved somehow!
Flags: needinfo?(hurley)
I've looked at the bits specifically related to the GetSpdyActiveConn bits. It looks like we hit that code as part of ProcessPendingQForEntry, which while not strictly necessary (the conninfo's hash key has the 'X' bit set, indicating no spdy allowed), but it does the right thing (there is, of course, no spdy conn available to an entry that has conninfo with no spdy).

For reference, it looks like we're trying to ProcessPendingQForEntry due to the code at https://searchfox.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#2543 (we were asked to process the pending queue for something with an empty pending queue, so we just go looking for something to process)

The other thing I noticed around that bit of the log is that it then tries to dispatch the transaction with no connection available (indicating at that point that we haven't successfully made a connection). We definitely *try* to make a connection (2018-04-27 13:26:20.823149 UTC - [12579:Socket Thread]: V/nsHttp nsHttpConnectionMgr::MakeNewConnection 0x7f6e65e97740 ent=0x7f6e42944580 trans=0x7f6e409f2c00).

If you look at the DispatchPendingQ log lines a bit below the GetSpdyActiveConn stuff, it looks like the connection stays in the state where it has a half-open connection (implied by the second DispatchPendingQ line) that has not yet completed. It keeps going around like that every time we ProcessPendingQForEntry. Eventually the existing halfopen goes away (timed out, I presume, no socket transport stuff in the log), and we try to MakeNewConnection again.

That all indicates to me that the connection is just taking a *really* long time to get set up. Are we certain that the TRR server is returning the same rrset as plain ol' dns? (I didn't see any mention of that in my quick scan of the bug, apologies if I missed that.)
Flags: needinfo?(hurley)
Thanks for taking a look Nick!

> Are we certain that the TRR server is returning the same rrset as plain ol' dns?

They often don't return the exact same addresses, no, but they return addresses from the same general IP ranges - which isn't a surprise since these are all Amazon things.

I have however found that in many cases when we get an A record back from TRR for this host, that also typically include two CNAMEs, I will get a different A record if we "chase" the CNAME down and get that A record. I might hint that the A record we get in the first response is actually... stale? I'm now working on trying this theory out more to see what I can learn.

I've also been thinking about TTL times, especially with responses using CNAMEs, and I intend to go back to study exact how they are supposed to work for us here and make sure that our code does what my reading will tell me.
Ok, here's a patch for testing and I'm curious how this changes the slack page for people who can repro this problem.

This patch does a few things that changes TRR behavior:

1 - it brings back CNAME "chasing" instead of relying on the address records that also comes back (reverting bug 1455972). Empirical testing shows that this makes us get a different address in a significant share of the cases.

2 - it prefers the last CNAME instead of the first, which is mostly arbitrary but in the slack case it gets the final name faster.

3 - I figured out that the code didn't correctly use the TRR provided TTL in all cases and now made sure that all addresses resolved with TRR always set the TTL correctlt for the DNS cache entry (with grace time zero).

Using this patch in my own testing, I seem to at least have reduced the likeliness of problems quite significantly. I'll continue and test more...

I'm mostly looking for some input before I try to land this.
Flags: needinfo?(cpeterson)
ni?myself to also try with this patch, since I'd been reproducing basically constantly.
Flags: needinfo?(hurley)
> Ok, here's a patch for testing and I'm curious how this changes the slack page for people who can repro this problem.

Is there some kind of performance decrease?
(In reply to Jens Hausdorf from comment #23)

> Is there some kind of performance decrease?

Yes, a small one. Performance goes back to how it was before bug 1455972 landed and I doubt anyone even noticed a performance difference (improvement) with that. Following CNAMEs means doing another HTTP request (potentially a whole series of them) to get the "actual" address, so typically another round-trip extra time for a "fresh" resolve. (We don't cache the CNAMEs, we only cache the end results in the in-memory DNS cache.)
(In reply to Daniel Stenberg [:bagder] from comment #21)
> Created attachment 8972846 [details] [diff] [review]
> 0001-bug-1452436-TRR-use-CNAME-and-TTL-better-r-mcmanus.patch
> 
> Ok, here's a patch for testing and I'm curious how this changes the slack
> page for people who can repro this problem.

In my local build, this patch seems to improve things a little, but I can still easily reproduce the websocket connection problems. I'm testing a debug build on macOS with network.trr.uri = https://dns.google.com/experimental.

The first time I load mozilla.slack.com in a browser session seems to be the most likely to reproduce the problem. I can repro with either network.trr.mode = 1 or 2.
Flags: needinfo?(cpeterson)
So, first time I go to load slack with this patch applied, I hit the bug. However, like :cpeterson, this does appear to have improved the situation (it's not *every* time I load slack, as it was previously). Only other difference with my build is that it's debug (whereas of course I've been experiencing this on release builds otherwise). Just to rule out that difference, I've pushed to try (macOS only, since that's what I'm running) to get a "real" build out of infra to make sure.

For reference, I've been using trr mode 3, and https://1.1.1.1/dns-query as my resolver
Flags: needinfo?(hurley)
We eventually figured out the ultimate reason for this problem and it turned out to not really be about websockets.

Here's how the bug happens:

1. A name resolve is done for host X (TRR asks for A and AAAA records separately)
2. A response for the A records arrives
3. The host resolver callback fires early (before the AAAA response has arrived)
4. *Another* resolve is then started for the same host X
5. The AAAA response arrives
6. The second callback is then wrongly inhibited because the code sees that a callback has been called already for the A records

The fix (just submitted for review):

Makes sure that if a new resolve is done when the existing record is "in progress", that callback gets called at once if there has been a callback called previously for this record.
Comment on attachment 8974397 [details]
bug 1452436 - callback when resolving a host already in progress

https://reviewboard.mozilla.org/r/242746/#review248846
Attachment #8974397 - Flags: review?(mcmanus)
Comment on attachment 8974397 [details]
bug 1452436 - callback when resolving a host already in progress

https://reviewboard.mozilla.org/r/242746/#review248848
A better take on fixing this issue
Attachment #8972846 - Attachment is obsolete: true
Attachment #8974397 - Attachment is obsolete: true
Attachment #8974678 - Flags: review?(mcmanus)
Attachment #8974678 - Flags: review?(mcmanus) → review+
Keywords: checkin-needed
Comment on attachment 8974678 [details] [diff] [review]
0001-bug-1452436-callback-when-resolving-a-host-already-i.patch

Approval Request Comment
[Feature/Bug causing the regression]: bug 1452436
[User impact if declined]: It causes non-obvious timeouts/hangs for users of TRR. Trigger perhaps easiest when websockets are involved.
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: it is about to
[Needs manual test from QE? If yes, steps to reproduce]: enable TRR, goto mozilla.slack.com and see if it updates fine
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: no
[Why is the change risky/not risky?]: the change is tiny and only affects users with TRR enabled
[String changes made/needed]:
Attachment #8974678 - Flags: approval-mozilla-beta?
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d23e0279ecd
Callback when resolving a host already in progress. r=mcmanus
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/9d23e0279ecd
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Flags: qe-verify+
Comment on attachment 8974678 [details] [diff] [review]
0001-bug-1452436-callback-when-resolving-a-host-already-i.patch

Fixes a Slack bug for users preffing on TRR for testing. Since the change only affects users who've preffed TRR on, I'll approve for 61.0b5.
Attachment #8974678 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:62.0) Gecko/20100101 Firefox/62.0
Build ID: 20180514220126

I verified this issue on MacOS 10.13.4, Windows 10 and Ubuntu 16.04 with the latest FF Nightly 61.0a1(2018-05-14) and the issue is no longer reproducible. 
We will verify this also in beta (61.0b5) after the fix will be uplifted.
(In reply to Cristian Craciun from comment #37)
> User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:62.0)
> Gecko/20100101 Firefox/62.0
> Build ID: 20180514220126
> 
> I verified this issue on MacOS 10.13.4, Windows 10 and Ubuntu 16.04 with the
> latest FF Nightly 61.0a1(2018-05-14) and the issue is no longer
> reproducible. 
> We will verify this also in beta (61.0b5) after the fix will be uplifted.

Edit: latest FF Nightly 62.0a1(2018-05-14)
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0
Build ID: 20180514150347

Verified as fixed on Firefox 61.0b5 on MacOS 10.13.4, Windows 10 and Ubuntu 16.04
Status: RESOLVED → VERIFIED
Flags: qe-verify+
I'm still/again seeing this on 62.0a1 (2018-05-21) (64-bit) (resetting trr.mode from 1 to 0 makes it go away) on macOS.
I double-checked on my mac right now and it works as snappy as it is supposed to on the 2018-05-21 version for me. I checked and I can't spot any difference in behavior between Windows, Mac and Linux on the slack page. I suspect what you're experiencing is not the same problem that was already fixed.

Did you change any other trr pref than mode and uri?

Can you switch to trr mode 2 and enable logging[*] with "timestamp,nsHostResolver:5" when you go to the slack page and experience this problem, and then share that log with me/us?

[*] = https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Hmm, can't reproduce now in either mode 1 or 2. Probably not worth logging now? Sorry for the false alarm!
No worries. If you can't reproduce it's not worth logging. Let's save that for if this ever happens again... =)
I had problems loading Slack yesterday, too. I thought it might be a DoH regression, but I think it was a Slack server problem:

https://status.slack.com/2018-05/42acc9f7860c0416
I had a similar problem again earlier. I was going to log it, but when I toggled trr.mode from 1 to 2, the problem resolved itself (and I can't reproduce it again setting it back to mode = 1).
You need to log in before you can comment on or make changes to this bug.