Closed Bug 1647985 Opened 4 years ago Closed 2 years ago

DNS-over-HTTPS sometimes fails after VPN is turned off

Tracking

()

Status:

VERIFIED FIXED

Milestone:

113 Branch

Tracking Flags:

Tracking

Status

firefox-esr102

---

wontfix

firefox111

---

wontfix

firefox112

---

wontfix

firefox113

---

fixed

People

(Reporter: ajft, Assigned: kershaw)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file, 1 obsolete file)

Bug 1647985 - Close TRR connection when network changes, r=#necko 2 years ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1647985 - Use shorter ping timeout for the connection to TRR, r=#necko 2 years ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details \| Review

ajft

Reporter

Description

•

4 years ago

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0

Steps to reproduce:

Ubuntu 20.04, Firefox 77.0.1 using DNS over HTTPS to Cloudflare (default). Start VPN (Cisco AnyConnect) to connect to corporate network. Browsing continues to work, DNS lookups working. Close the VPN connection and sometimes firefox loses the ability to resolve any network address.

Actual results:

Disconnected VPN, tried to open a website in browser (www.cnet.com) and receive the error page "Hmm. We’re having trouble finding that site. We can’t connect to the server at www.cnet.com."

Only fix seems to be to quit firefox & restart it, the "failed" page resolves the address and automatically loads

Expected results:

DNS resolution should always continue to work after VPN is disconnected, not intermittently

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

4 years ago

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Networking: DNS

Product: Firefox → Core

Michal Novotny [:michal]

Updated

•

4 years ago

Severity: -- → S2

Priority: -- → P2

Updated

•

3 years ago

Assignee: nobody → valentin.gosu

Blocks: doh

Valentin Gosu [:valentin] (he/him)

Updated

•

3 years ago

Blocks: necko-vpn

Valentin Gosu [:valentin] (he/him)

Comment 3

•

3 years ago

I have reproduced this with mode3 and disconnecting from expressVPN.
It seems getaddrinfo doesn't work, and calling res_ninit doesn't fix it.

2021-06-01 14:24:04.520574 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - Calling getaddrinfo for host [mozilla.cloudflare-dns.com].
2021-06-01 14:24:04.520710 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver Resolving host [mozilla.cloudflare-dns.com]<> type 0. [this=7f1a08c18120]
2021-06-01 14:24:04.520728 UTC - [Parent 318015: Socket Thread]: D/nsHostResolver   Host [mozilla.cloudflare-dns.com] is being resolved. Appending callback [7f19ccec0dc0].
2021-06-01 14:24:04.520941 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Calling 'res_ninit'.
2021-06-01 14:24:04.521215 UTC - [Parent 318015: DNS Resolver #2]: E/nsHostResolver DNS lookup thread - lookup completed for host [mozilla.cloudflare-dns.com]: failure: unknown host.
2021-06-01 14:24:04.521225 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver::CompleteLookup mozilla.cloudflare-dns.com 0 804B001E resolver=0 stillResolving=0
2021-06-01 14:24:04.521229 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 new gencnt
2021-06-01 14:24:04.521233 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver Caching host [mozilla.cloudflare-dns.com] negative record for 60 seconds.
2021-06-01 14:24:04.521237 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver CompleteLookup: mozilla.cloudflare-dns.com has NO address
2021-06-01 14:24:04.521240 UTC - [Parent 318015: DNS Resolver #2]: D/nsHostResolver nsHostResolver record 7f19ccfbbc00 calling back dns users status:804B001E
2021-06-01 14:24:04.521388 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR::OnStartRequest 7f19c8a80890 support.mozilla.org 1
2021-06-01 14:24:04.521415 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 support.mozilla.org 1 failed=0 code=804B001E
2021-06-01 14:24:04.521439 UTC - [Parent 318015: TRR Background]: D/nsHostResolver TRR:OnStopRequest 7f19c8a80890 status 804b001e mFailed 0
2021-06-01 14:24:04.521452 UTC - [Parent 318015: TRR Background]: D/nsHostResolver nsHostResolver::CompleteLookup support.mozilla.org 7f19cade1ce0 804B001E resolver=1 stillResolving=0

I'm looking into how we can fix this.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Valentin Gosu [:valentin] (he/him)

Updated

•

3 years ago

Blocks: 1606399

Kershaw Chang [:kershaw]

Assignee

Updated

•

2 years ago

Blocks: 1596839

Kershaw Chang [:kershaw]

Assignee

Comment 4

•

2 years ago

So, I've looked the code that we deal with network change and found it's problematic.
This is what happens when we detect a network change:

The http handler calls VerifyTraffic.
ConnectionEntry::VerifyTraffic is called for each connection entry. For h2 connections, SendPing is called.
In Http2Session::SendPing, we send a ping to the server and set mPingThreshold to NetworkChangedTimeout, which is 5s.
Then, Http2Session::ReadTimeoutTick will be called to check if the time of the last read.
If the last read (before network change) was happened within 5s, we assume this connection is alive. But, this connection maybe already dead.

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

Kershaw Chang [:kershaw]

Assignee

Comment 5

•

2 years ago

Attached file Bug 1647985 - Close TRR connection when network changes, r=#necko (obsolete) — Details

Valentin Gosu [:valentin] (he/him)

Comment 6

•

2 years ago

(In reply to Kershaw Chang [:kershaw] from comment #4)

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

Thank you for the investigation. I have two questions about this:

It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.
I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?

Also, I think this problem might not fix everything. My logs in comment 3 point to something else entirely.
I think that issue is similar to this stack overflow reproducible test case
Let's fix this one first, then maybe we can use bug 1606399 for a DNS follow-up fix.

Kershaw Chang [:kershaw]

Assignee

Comment 7

•

2 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #6)

(In reply to Kershaw Chang [:kershaw] from comment #4)

For the h2 connection to the TRR server, this condition is easily fulfilled, since there should be lots of DNS request to the server. Changing the network change timeout to a small value can be still racy. So, I think the safest way would be always closing the TRR connection when detecting network change.

Thank you for the investigation. I have two questions about this:

It seems that this race condition could potentially affect other connections too, right? It's just that the TRR connection is more likely to be affected and it's more obvious when it is. Could we try to fix the other cases too? Follow-up is fine.

Yes, this problem also affects other connections. I'll try to come up with a solution to fix all in this bug.

I would expect the connection to eventually timeout then recover, right? Do you know why that's not the case?

Yes, the connection eventually timeout, but the timeout value can be really long (up to 58s).

Kershaw Chang [:kershaw]

Assignee

Comment 8

•

2 years ago

My proposal to fix this bug is:

Set mLastReadEpoch to a value that is bigger than mPingThreshold in Http2Session::SendPing, so we really check if we receive server's ack after network change.
Consider to call VerifyTraiifc only when the network id changed.
During the time of VerifyTraiifc, not dispatching the transactions to connections, since we don't know those connections are still alive.
Consider to use a shorter network change timeout for TRR connection (it's 5s currently).

Valentin, what do you think? Thanks.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Comment 9

•

2 years ago

Sounds great, Kershaw. Thanks!

During the time of VerifyTraiifc, not dispatching the transactions to connections, since we don't know those connections are still alive.

I'm not sure about this one. Feels like it could delay the transactions unnecessarily when the connection still works. Maybe we could put it behind a pref and try it out only on nightly? 🤷‍♂️

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Updated

•

2 years ago

Assignee: valentin.gosu → kershaw

Phabricator Automation

Updated

•

2 years ago

Attachment #9319685 - Attachment is obsolete: true

Kershaw Chang [:kershaw]

Assignee

Comment 10

•

2 years ago

Attached file Bug 1647985 - Use shorter ping timeout for the connection to TRR, r=#necko — Details

Pulsebot

Comment 11

•

2 years ago

Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/23b5f3348457
Use shorter ping timeout for the connection to TRR, r=necko-reviewers,valentin

Cristina Horotan [:chorotan]

Comment 12

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/23b5f3348457

Status: NEW → RESOLVED

Closed: 2 years ago

status-firefox113: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 113 Branch

Ryan VanderMeulen [:RyanVM]

Comment 13

•

2 years ago

This seems like it might be a good uplift candidate?

status-firefox111: --- → wontfix

status-firefox112: --- → affected

status-firefox-esr102: --- → affected

Flags: needinfo?(kershaw)

Kershaw Chang [:kershaw]

Assignee

Comment 14

•

2 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)

This seems like it might be a good uplift candidate?

This patch only helps users who use TRR mode 3 (not a lot), so I think we don't need to uplift this.

Flags: needinfo?(kershaw)

Ryan VanderMeulen [:RyanVM]

Updated

•

2 years ago

status-firefox112: affected → wontfix

status-firefox-esr102: affected → wontfix

Mihai Boldan, Desktop QA [:mboldan]

Updated

•

1 year ago

Flags: qe-verify+

Anca Soncutean, Desktop QA

Comment 15

•

1 year ago

I couldn't reproduce this issue with the affected builds on Ubuntu 20 and Windows 10, with mode 3 (The VPN used was NordVPN, for which I have subscription). I don't see a recent activity from the reporter, but maybe, if around, and time permits, ajft can you confirm this issue is no longer reproducible on your side with the latest Firefox build? Thank you!

Flags: needinfo?(ajft)

ajft

Reporter

Comment 16

•

1 year ago

I'm not able to reproduce this, seems to no longer be an issue

Current PC is up to Ubuntu 23.04, VPN is still Cisco AnyConnect but is now version 4.10.0511

Flags: needinfo?(ajft)

Anca Soncutean, Desktop QA

Comment 17

•

1 year ago

Thank you for your time ajft. Based on comment 16, I will mark this issue as verified.

Status: RESOLVED → VERIFIED

Flags: qe-verify+

You need to log in before you can comment on or make changes to this bug.