Closed Bug 1647985 Opened 4 years ago Closed 2 years ago

DNS-over-HTTPS sometimes fails after VPN is turned off

Categories

(Core :: Networking: DNS, defect, P2)

77 Branch
defect

Tracking

()

VERIFIED FIXED
113 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- fixed

People

(Reporter: ajft, Assigned: kershaw)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0

Steps to reproduce:

Ubuntu 20.04, Firefox 77.0.1 using DNS over HTTPS to Cloudflare (default). Start VPN (Cisco AnyConnect) to connect to corporate network. Browsing continues to work, DNS lookups working. Close the VPN connection and sometimes firefox loses the ability to resolve any network address.

Actual results:

Disconnected VPN, tried to open a website in browser (www.cnet.com) and receive the error page "Hmm. We’re having trouble finding that site. We can’t connect to the server at www.cnet.com."

Only fix seems to be to quit firefox & restart it, the "failed" page resolves the address and automatically loads

Expected results:

DNS resolution should always continue to work after VPN is disconnected, not intermittently

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Networking: DNS
Product: Firefox → Core
Severity: -- → S2
Priority: -- → P2
See Also: → 1638542, 1635935
Whiteboard: [necko-triaged]
Assignee: nobody → valentin.gosu
Blocks: doh

I have reproduced this with mode3 and disconnecting from expressVPN.
It seems getaddrinfo doesn't work, and calling res_ninit doesn't fix it.

2021-06-01 14:24:04.520574 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [mozilla.cloudflare-dns.com].
2021-06-01 14:24:04.520710 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver Resolving host [mozilla.cloudflare-dns.com]<> type 0. [this=7f1a08c18120]
2021-06-01 14:24:04.520728 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver   Host [mozilla.cloudflare-dns.com] is being resolved. Appending callback [7f19ccec0dc0].
2021-06-01 14:24:04.520941 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2021-06-01 14:24:04.521215 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [mozilla.cloudflare-dns.com]: failure: unknown host.
2021-06-01 14:24:04.521225 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup mozilla.cloudflare-dns.com 0 804B001E resolver=0 stillResolving=0
2021-06-01 14:24:04.521229 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 new gencnt
2021-06-01 14:24:04.521233 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Caching host [mozilla.cloudflare-dns.com] negative record for 60 seconds.
2021-06-01 14:24:04.521237 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver CompleteLookup: mozilla.cloudflare-dns.com has NO address
2021-06-01 14:24:04.521240 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 calling back dns users status:804B001E
2021-06-01 14:24:04.521388 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR::OnStartRequest 7f19c8a80890 support.mozilla.org 1
2021-06-01 14:24:04.521415 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 support.mozilla.org 1 failed=0 code=804B001E
2021-06-01 14:24:04.521439 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 status 804b001e mFailed 0
2021-06-01 14:24:04.521452 UTC - [Parent 318015: TRR Background]: D/nsHostResolver nsHostResolver::CompleteLookup support.mozilla.org 7f19cade1ce0 804B001E resolver=1 stillResolving=0

I'm looking into how we can fix this.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: 1596839

So, I've looked the code that we deal with network change and found it's problematic.
This is what happens when we detect a network change:

  1. The http handler calls VerifyTraffic.
  2. ConnectionEntry::VerifyTraffic is called for each connection entry. For h2 connections, SendPing is called.
  3. In Http2Session::SendPing, we send a ping to the server and set mPingThreshold to NetworkChangedTimeout, which is 5s.
  4. Then, Http2Session::ReadTimeoutTick will be called to check if the time of the last read.
  5. If the last read (before network change) was happened within 5s, we assume this connection is alive. But, this connection maybe already dead.

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

(In reply to Kershaw Chang [:kershaw] from comment #4)

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

Thank you for the investigation. I have two questions about this:

  1. It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.
  2. I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?

Also, I think this problem might not fix everything. My logs in comment 3 point to something else entirely.
I think that issue is similar to this stack overflow reproducible test case
Let's fix this one first, then maybe we can use bug 1606399 for a DNS follow-up fix.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #6)

(In reply to Kershaw Chang [:kershaw] from comment #4)

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

Thank you for the investigation. I have two questions about this:

  1. It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.

Yes, this problem also affects other connections. I'll try to come up with a solution to fix all in this bug.

  1. I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?

Yes, the connection eventually timeout, but the timeout value can be really long (up to 58s).

My proposal to fix this bug is:

  1. Set mLastReadEpoch to a value that is bigger than mPingThreshold in Http2Session::SendPing, so we really check if we receive server's ack after network change.
  2. Consider to call VerifyTraiifc only when the network id changed.
  3. During the time of VerifyTraiifc, not dispatching the transactions to connections, since we don't know those connections are still alive.
  4. Consider to use a shorter network change timeout for TRR connection (it's 5s currently).

Valentin, what do you think? Thanks.

Flags: needinfo?(valentin.gosu)

Sounds great, Kershaw. Thanks!

During the time of VerifyTraiifc, not dispatching the transactions to connections, since we don't know those connections are still alive.

I'm not sure about this one. Feels like it could delay the transactions unnecessarily when the connection still works. Maybe we could put it behind a pref and try it out only on nightly? πŸ€·β€β™‚οΈ

Flags: needinfo?(valentin.gosu)
Assignee: valentin.gosu → kershaw
Attachment #9319685 - Attachment is obsolete: true
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23b5f3348457
Use shorter ping timeout for the connection to TRR, r=necko-reviewers,valentin
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch

This seems like it might be a good uplift candidate?

Flags: needinfo?(kershaw)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)

This seems like it might be a good uplift candidate?

This patch only helps users who use TRR mode 3 (not a lot), so I think we don't need to uplift this.

Flags: needinfo?(kershaw)
Flags: qe-verify+

I couldn't reproduce this issue with the affected builds on Ubuntu 20 and Windows 10, with mode 3 (The VPN used was NordVPN, for which I have subscription). I don't see a recent activity from the reporter, but maybe, if around, and time permits, ajft can you confirm this issue is no longer reproducible on your side with the latest Firefox build? Thank you!

Flags: needinfo?(ajft)

I'm not able to reproduce this, seems to no longer be an issue

Current PC is up to Ubuntu 23.04, VPN is still Cisco AnyConnect but is now version 4.10.0511

Flags: needinfo?(ajft)

Thank you for your time ajft. Based on comment 16, I will mark this issue as verified.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: