Closed Bug 1656862 Opened 4 years ago Closed 4 years ago

Improve TRR_SKIP_REASON telemetry

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox80 - fixed
firefox81 --- fixed

People

(Reporter: valentin, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Followup to bug 1649143 comment 11:

Going forward we should:

  • Fix TRR_DISABLED_FLAG never being used.
  • See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results
  • Figure out why TRR_NOT_CONFIRMED is so large, and how to make confirmation work better
  • See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?

(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)

  • See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results

So, I've looked at the code that controls the TRRService mode

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

This means we record the skip reason when in mode0, then at the end we are in mode2 and report the reason.
This is similar to TRR_NOT_CONFIRMED, but it's a race against the heuristics, not the confirmation process.

I assume the few TRR_OFF_EXPLICIT reports that we do have are caused by user activity, changing from mode5 to mode2? AFAICT doh-rollout.mode never goes from 5 to 2.

Blocks: doh

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.
I think we will need to measure how often we change the trr mode (I think we can get that from addon telemetry). As I remember the old stats we will need to improve this.

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/c61c4ac846f6 Record telemetry for TRR_DISABLED_FLAG r=dragana,necko-reviewers

[Tracking Requested - why for this release]:
We want to have accurate telemetry for this probe once it gets to release.

Keywords: leave-open

(In reply to Dragana Damjanovic [:dragana] from comment #3)

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.

I agree this is higher than expected. Normally we shouldn't be changing TRR mode at all.
I suspect what's happening is that the computer comes out of suspension, connectivity changes (and we debounce) we run heuristics, but connectivity is down, so we disable. Connectivity comes back up, we go into mode 2.

Nihanth, do you think that's possible?
Should we try using navigator.onLine instead of gNetworkLinkService.isLinkUp ?

Flags: needinfo?(nhnt11)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)

TRR_EXCLUDED 26 6.12M (13.36%) - This is a fair chunk of results.

  • See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?

I've figured out why this is so large.
This is caused by the split horizon mitigations we have in place on Windows. Specifically, if the user has a VPN, proxy or NRPT we will treat all resolutions as "excluded from TRR"

Nightly 81:
Windows: 14.42M (13.58%)
Linux: 654.95k (3.76%)
OSX: 268.98k (2.03%)

Beta 80:
Windows: 16.93M (6.06%)
Linux: 122.27k (1.83%)
OSX: 128.03k (0.74%)

I think this confirms the cause of the issue. The remaining question is: do we think the split horizon mitigations are important enough to be kept in place, even though they affect a fair number of our DoH population? Should we move it to the doh-rollout heuristics, so it doesn't show up as "TRR enabled" even when exclude everything? And more importantly maybe, so we don't skip TRR when this is enabled by the user.

Flags: needinfo?(nhnguyen)

I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).

The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.

I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?

Flags: needinfo?(nhnguyen) → needinfo?(valentin.gosu)

(In reply to Nhi Nguyen (:nhi) from comment #9)

I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).

The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.

I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?

I don't think we have a way to do that at the moment. The conversion should be easy, and it would us to see how often we trigger the probe too.

Flags: needinfo?(valentin.gosu)
Depends on: 1657897

Is there anything that needs to happen for 80 here? We're got 2 beta builds left before release candidates.

Flags: needinfo?(valentin.gosu)

Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: TRR_SKIP_REASON telemetry will be slightly off.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We only make sure to set the recorded variable to a different value when the flag is present.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9167650 - Flags: approval-mozilla-beta?
Depends on: 1658277
Depends on: 1658278
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch

Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana

approved for 80.0b7

Attachment #9167650 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(nhnt11)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: