Closed Bug 1495523 Opened Last year Closed Last year

TRR: Disable TRR for a while if we detect poorly performing/crashed nameserver

Categories

(Core :: Networking: DNS, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox63 --- fixed
firefox64 --- fixed

People

(Reporter: jduell.mcbugs, Assigned: bagder)

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(1 file)

Right now our failure modality for TRR is to wait for TRR to timeout (3 seconds), then try regular DNS.

If the TRR servers gets in a bad state, this could slow down all our channels by 3 sec, which is Bad.

We're going to drop the limit from 3 Secs to something lower for now, but the real fix is to launch a regular DNS requests if the TRR reply hasn't come back after some period.  We're seeing a median response time of around 40ms for TRR with Cloudflare, so maybe a 100 ms timeout would work well.
Even 100ms extra wait per channel (if TRR is performing poorly or broken) is a significant hit--I'm wondering if there are other things we can do here.

If the TRR server is down, I assume our current code with try to connect to it every time, and max out the 3 sec timeout every time?  We'd be better off in that case disabling TRR for a period (and try it again at some point to see if things are OK again).   It's probably worth doing this if we hit the timeout consistently for any reason above some threshold (like 50% or more of requests are timing out, etc--I'm not sure of the right % cutoff).
Summary: TRR: race regular DNS request after N milliseconds → TRR: Handle poorly performing/crashed servers more performantly
Priority: -- → P2
Whiteboard: [necko-triaged][trr]
OK, I'm going to split off the "race regular DNS after 100ms:" work to a different bug.  

For now I think we want the comment 1 algorithm--Disable TRR for a while if we detect poorly performing/crashed nameserver.

Note: if the TRR server is down when the browser session starts, we'll be covered by our initial "does TRR work?" check.  This bug is to improve our handling of the case when we start off with a working TRR server which then crashes or becomes unresponsive.  The "race regular DNS after 100ms" bug will handle the case where the TRR server stays up but becomes slow.

Would be nice to have this bug for 63 if possible.  (the racing with regular DNS can wait for later)
Summary: TRR: Handle poorly performing/crashed servers more performantly → TRR: Disable TRR for a while if we detect poorly performing/crashed nameserver
Assignee: nobody → daniel
Status: NEW → ASSIGNED
Priority: P2 → P1
I suggest this functionality:

If NUM_FAILS TRR requests in a row fail (OnStopRequest is called with error or we hit the time-out), mark TRR as non-functional and go back to need-to-verify-NS state. The same state that TRR starts up in, which makes it attempt a plain NS lookup of example.com just to make sure the DoH server works.

If TRR fails to resolve NS using the specified DoH server, it currently will just stop trying and not use TRR for the rest of the session until the browser is restarted or the TRR URI is updated. I suggest we instead put it on a timer so that it retries the NS again after TIME_RETRY minutes/seconds.

Rationale: it will fix this bug but it will also fix the case when we start Firefox with a TRR URI while the server is dead/gone and it comes back during the session. This approach would be a way to eventually notice that the server is back and start using it again.

I propose that NUM_FAILS defaults to 5 and TIME_RETRY to 180 seconds, both made as prefs.
Revised take, to better re-use existing code:

If NUM_FAILS TRR requests in a row fail (OnStopRequest is called with error or we hit the time-out), go back to need-to-verify-NS state. The same state that TRR starts up in, which makes it attempt a plain NS lookup of example.com just to make sure the DoH server works. Until the NS lookup works, TRR will not be used.

If TRR fails to resolve NS using the specified DoH server, make it retry using an increasing interval. Starting with 1000 milliseconds and double for every attempt up to 64 seconds. This logic is already used for TRR-ONLY.

NUM_FAILS is a pref that defaults to 5.
MozReview-Commit-ID: 2dSEY6DuP2A
Pushed by daniel@haxx.se:
https://hg.mozilla.org/integration/autoland/rev/f062d23be181
disable TRR after max-fails number of failed requests r=valentin
https://hg.mozilla.org/mozilla-central/rev/f062d23be181
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Comment on attachment 9013967 [details]
bug 1495523 - disable TRR after max-fails number of failed requests

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1495523

User impact if declined: When a DoH suddenly dies, goes bad or the user ventures into a really bad network (to the DoH server), getting "stuck" on DoH might contribute to a worse user experience. This patch helps detect such badness and takes active counter-measures.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): It's not really risky, but it adds a timer and exponential backing off retry attempts when the DoH server has failed which means new code paths are traveled.

String changes made/needed:
Attachment #9013967 - Flags: approval-mozilla-beta?
Comment on attachment 9013967 [details]
bug 1495523 - disable TRR after max-fails number of failed requests

Looks reasonably safe and, uplift approved for 63 beta 13, thanks.
Attachment #9013967 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.