Closed Bug 1495523 Opened 7 years ago Closed 7 years ago

TRR: Disable TRR for a while if we detect poorly performing/crashed nameserver

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla64

Tracking Flags:

Tracking

Status

firefox63

---

fixed

firefox64

---

fixed

People

(Reporter: jduell.mcbugs, Assigned: daniel)

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(1 file)

bug 1495523 - disable TRR after max-fails number of failed requests 7 years ago -- 46 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+	Details \| Review

Jason Duell

Reporter

Description

•

7 years ago

Right now our failure modality for TRR is to wait for TRR to timeout (3 seconds), then try regular DNS. If the TRR servers gets in a bad state, this could slow down all our channels by 3 sec, which is Bad. We're going to drop the limit from 3 Secs to something lower for now, but the real fix is to launch a regular DNS requests if the TRR reply hasn't come back after some period. We're seeing a median response time of around 40ms for TRR with Cloudflare, so maybe a 100 ms timeout would work well.

Jason Duell

Reporter

Comment 1

•

7 years ago

Even 100ms extra wait per channel (if TRR is performing poorly or broken) is a significant hit--I'm wondering if there are other things we can do here. If the TRR server is down, I assume our current code with try to connect to it every time, and max out the 3 sec timeout every time? We'd be better off in that case disabling TRR for a period (and try it again at some point to see if things are OK again). It's probably worth doing this if we hit the timeout consistently for any reason above some threshold (like 50% or more of requests are timing out, etc--I'm not sure of the right % cutoff).

Summary: TRR: race regular DNS request after N milliseconds → TRR: Handle poorly performing/crashed servers more performantly

Assignee

Updated

•

7 years ago

Priority: -- → P2

Whiteboard: [necko-triaged][trr]

Jason Duell

Reporter

Comment 2

•

7 years ago

OK, I'm going to split off the "race regular DNS after 100ms:" work to a different bug. For now I think we want the comment 1 algorithm--Disable TRR for a while if we detect poorly performing/crashed nameserver. Note: if the TRR server is down when the browser session starts, we'll be covered by our initial "does TRR work?" check. This bug is to improve our handling of the case when we start off with a working TRR server which then crashes or becomes unresponsive. The "race regular DNS after 100ms" bug will handle the case where the TRR server stays up but becomes slow. Would be nice to have this bug for 63 if possible. (the racing with regular DNS can wait for later)

Summary: TRR: Handle poorly performing/crashed servers more performantly → TRR: Disable TRR for a while if we detect poorly performing/crashed nameserver

Assignee

Updated

•

7 years ago

Assignee: nobody → daniel

Status: NEW → ASSIGNED

Priority: P2 → P1

Assignee

Comment 3

•

7 years ago

I suggest this functionality: If NUM_FAILS TRR requests in a row fail (OnStopRequest is called with error or we hit the time-out), mark TRR as non-functional and go back to need-to-verify-NS state. The same state that TRR starts up in, which makes it attempt a plain NS lookup of example.com just to make sure the DoH server works. If TRR fails to resolve NS using the specified DoH server, it currently will just stop trying and not use TRR for the rest of the session until the browser is restarted or the TRR URI is updated. I suggest we instead put it on a timer so that it retries the NS again after TIME_RETRY minutes/seconds. Rationale: it will fix this bug but it will also fix the case when we start Firefox with a TRR URI while the server is dead/gone and it comes back during the session. This approach would be a way to eventually notice that the server is back and start using it again. I propose that NUM_FAILS defaults to 5 and TIME_RETRY to 180 seconds, both made as prefs.

Assignee

Comment 4

•

7 years ago

Revised take, to better re-use existing code: If NUM_FAILS TRR requests in a row fail (OnStopRequest is called with error or we hit the time-out), go back to need-to-verify-NS state. The same state that TRR starts up in, which makes it attempt a plain NS lookup of example.com just to make sure the DoH server works. Until the NS lookup works, TRR will not be used. If TRR fails to resolve NS using the specified DoH server, make it retry using an increasing interval. Starting with 1000 milliseconds and double for every attempt up to 64 seconds. This logic is already used for TRR-ONLY. NUM_FAILS is a pref that defaults to 5.

Assignee

Comment 5

•

7 years ago

Attached file bug 1495523 - disable TRR after max-fails number of failed requests — Details

MozReview-Commit-ID: 2dSEY6DuP2A

Pulsebot

Comment 6

•

7 years ago

Pushed by daniel@haxx.se: https://hg.mozilla.org/integration/autoland/rev/f062d23be181 disable TRR after max-fails number of failed requests r=valentin

Andrei Ciure[:aciure]

Comment 7

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f062d23be181

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

Assignee

Comment 8

•

7 years ago

Comment on attachment 9013967 [details] bug 1495523 - disable TRR after max-fails number of failed requests [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: Bug 1495523 User impact if declined: When a DoH suddenly dies, goes bad or the user ventures into a really bad network (to the DoH server), getting "stuck" on DoH might contribute to a worse user experience. This patch helps detect such badness and takes active counter-measures. Is this code covered by automated tests?: No Has the fix been verified in Nightly?: Yes Needs manual test from QE?: No If yes, steps to reproduce: List of other uplifts needed: None Risk to taking this patch: Low Why is the change risky/not risky? (and alternatives if risky): It's not really risky, but it adds a timer and exponential backing off retry attempts when the DoH server has failed which means new code paths are traveled. String changes made/needed:

Attachment #9013967 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc

Comment 9

•

7 years ago

Comment on attachment 9013967 [details] bug 1495523 - disable TRR after max-fails number of failed requests Looks reasonably safe and, uplift approved for 63 beta 13, thanks.

Attachment #9013967 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 10

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/b75cfee22e1e

status-firefox63: --- → fixed

You need to log in before you can comment on or make changes to this bug.

Bugzilla

TRR: Disable TRR for a while if we detect poorly performing/crashed nameserver

Categories

(Core :: Networking: DNS, enhancement, P1)

Tracking

()

People

(Reporter: jduell.mcbugs, Assigned: daniel)

References

Details

(Whiteboard: [necko-triaged][trr])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Attachment

General

Description

File Name

Content Type