DNS-over-HTTPS sometimes fails after VPN is turned off
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: ajft, Assigned: kershaw)
References
(Blocks 3 open bugs)
Details
(Whiteboard: [necko-triaged])
Attachments
(1 file, 1 obsolete file)
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0
Steps to reproduce:
Ubuntu 20.04, Firefox 77.0.1 using DNS over HTTPS to Cloudflare (default). Start VPN (Cisco AnyConnect) to connect to corporate network. Browsing continues to work, DNS lookups working. Close the VPN connection and sometimes firefox loses the ability to resolve any network address.
Actual results:
Disconnected VPN, tried to open a website in browser (www.cnet.com) and receive the error page "Hmm. Weβre having trouble finding that site. We canβt connect to the server at www.cnet.com."
Only fix seems to be to quit firefox & restart it, the "failed" page resolves the address and automatically loads
Expected results:
DNS resolution should always continue to work after VPN is disconnected, not intermittently
Comment 1•4 years ago
|
||
Bugbug thinks this bug should belong to this component, but please revert this change in case of error.
Updated•4 years ago
|
Comment 3•3 years ago
|
||
I have reproduced this with mode3 and disconnecting from expressVPN.
It seems getaddrinfo doesn't work, and calling res_ninit doesn't fix it.
2021-06-01 14:24:04.520574 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [mozilla.cloudflare-dns.com].
2021-06-01 14:24:04.520710 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver Resolving host [mozilla.cloudflare-dns.com]<> type 0. [this=7f1a08c18120]
2021-06-01 14:24:04.520728 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver Host [mozilla.cloudflare-dns.com] is being resolved. Appending callback [7f19ccec0dc0].
2021-06-01 14:24:04.520941 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2021-06-01 14:24:04.521215 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [mozilla.cloudflare-dns.com]: failure: unknown host.
2021-06-01 14:24:04.521225 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup mozilla.cloudflare-dns.com 0 804B001E resolver=0 stillResolving=0
2021-06-01 14:24:04.521229 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 new gencnt
2021-06-01 14:24:04.521233 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Caching host [mozilla.cloudflare-dns.com] negative record for 60 seconds.
2021-06-01 14:24:04.521237 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver CompleteLookup: mozilla.cloudflare-dns.com has NO address
2021-06-01 14:24:04.521240 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 calling back dns users status:804B001E
2021-06-01 14:24:04.521388 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR::OnStartRequest 7f19c8a80890 support.mozilla.org 1
2021-06-01 14:24:04.521415 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 support.mozilla.org 1 failed=0 code=804B001E
2021-06-01 14:24:04.521439 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 status 804b001e mFailed 0
2021-06-01 14:24:04.521452 UTC - [Parent 318015: TRR Background]: D/nsHostResolver nsHostResolver::CompleteLookup support.mozilla.org 7f19cade1ce0 804B001E resolver=1 stillResolving=0
I'm looking into how we can fix this.
Assignee | ||
Comment 4•2 years ago
|
||
So, I've looked the code that we deal with network change and found it's problematic.
This is what happens when we detect a network change:
- The http handler calls VerifyTraffic.
- ConnectionEntry::VerifyTraffic is called for each connection entry. For h2 connections, SendPing is called.
- In
Http2Session::SendPing
, we send a ping to the server and set mPingThreshold to NetworkChangedTimeout, which is 5s. - Then, Http2Session::ReadTimeoutTick will be called to check if the time of the last read.
- If the last read (before network change) was happened within 5s, we assume this connection is alive. But, this connection maybe already dead.
For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.
Assignee | ||
Comment 5•2 years ago
|
||
Comment 6•2 years ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #4)
For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.
Thank you for the investigation. I have two questions about this:
- It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.
- I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?
Also, I think this problem might not fix everything. My logs in comment 3 point to something else entirely.
I think that issue is similar to this stack overflow reproducible test case
Let's fix this one first, then maybe we can use bug 1606399 for a DNS follow-up fix.
Assignee | ||
Comment 7•2 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #6)
(In reply to Kershaw Chang [:kershaw] from comment #4)
For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.
Thank you for the investigation. I have two questions about this:
- It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.
Yes, this problem also affects other connections. I'll try to come up with a solution to fix all in this bug.
- I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?
Yes, the connection eventually timeout, but the timeout value can be really long (up to 58s).
Assignee | ||
Comment 8•2 years ago
|
||
My proposal to fix this bug is:
- Set
mLastReadEpoch
to a value that is bigger thanmPingThreshold
inHttp2Session::SendPing
, so we really check if we receive server's ack after network change. - Consider to call
VerifyTraiifc
only when the network id changed. - During the time of
VerifyTraiifc
, not dispatching the transactions to connections, since we don't know those connections are still alive. - Consider to use a shorter network change timeout for TRR connection (it's 5s currently).
Valentin, what do you think? Thanks.
Comment 9•2 years ago
|
||
Sounds great, Kershaw. Thanks!
During the time of VerifyTraiifc, not dispatching the transactions to connections, since we don't know those connections are still alive.
I'm not sure about this one. Feels like it could delay the transactions unnecessarily when the connection still works. Maybe we could put it behind a pref and try it out only on nightly? π€·ββοΈ
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 10•2 years ago
|
||
Comment 11•2 years ago
|
||
Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/23b5f3348457 Use shorter ping timeout for the connection to TRR, r=necko-reviewers,valentin
Comment 12•2 years ago
|
||
bugherder |
Comment 13•2 years ago
|
||
This seems like it might be a good uplift candidate?
Assignee | ||
Comment 14•2 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)
This seems like it might be a good uplift candidate?
This patch only helps users who use TRR mode 3 (not a lot), so I think we don't need to uplift this.
Updated•2 years ago
|
Updated•1 year ago
|
Comment 15•1 year ago
|
||
I couldn't reproduce this issue with the affected builds on Ubuntu 20 and Windows 10, with mode 3 (The VPN used was NordVPN, for which I have subscription). I don't see a recent activity from the reporter, but maybe, if around, and time permits, ajft can you confirm this issue is no longer reproducible on your side with the latest Firefox build? Thank you!
Reporter | ||
Comment 16•1 year ago
|
||
I'm not able to reproduce this, seems to no longer be an issue
Current PC is up to Ubuntu 23.04, VPN is still Cisco AnyConnect but is now version 4.10.0511
Comment 17•1 year ago
|
||
Thank you for your time ajft. Based on comment 16, I will mark this issue as verified.
Description
•