Improve TRR_SKIP_REASON telemetry
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: valentin, Assigned: valentin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged])
Attachments
(1 file)
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
Followup to bug 1649143 comment 11:
Going forward we should:
- Fix TRR_DISABLED_FLAG never being used.
- See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results
- Figure out why TRR_NOT_CONFIRMED is so large, and how to make confirmation work better
- See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?
Assignee | ||
Comment 1•4 years ago
|
||
Assignee | ||
Comment 2•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)
- See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results
So, I've looked at the code that controls the TRRService mode
I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.
TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0
This means we record the skip reason when in mode0, then at the end we are in mode2 and report the reason.
This is similar to TRR_NOT_CONFIRMED, but it's a race against the heuristics, not the confirmation process.
I assume the few TRR_OFF_EXPLICIT reports that we do have are caused by user activity, changing from mode5 to mode2? AFAICT doh-rollout.mode never goes from 5 to 2.
Comment 3•4 years ago
|
||
I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.
TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0
I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.
I think we will need to measure how often we change the trr mode (I think we can get that from addon telemetry). As I remember the old stats we will need to improve this.
Assignee | ||
Comment 5•4 years ago
|
||
[Tracking Requested - why for this release]:
We want to have accurate telemetry for this probe once it gets to release.
Comment 6•4 years ago
|
||
bugherder |
Assignee | ||
Comment 7•4 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #3)
I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.
TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0
I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.
I agree this is higher than expected. Normally we shouldn't be changing TRR mode at all.
I suspect what's happening is that the computer comes out of suspension, connectivity changes (and we debounce) we run heuristics, but connectivity is down, so we disable. Connectivity comes back up, we go into mode 2.
Nihanth, do you think that's possible?
Should we try using navigator.onLine
instead of gNetworkLinkService.isLinkUp
?
Assignee | ||
Comment 8•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)
TRR_EXCLUDED 26 6.12M (13.36%) - This is a fair chunk of results.
- See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?
I've figured out why this is so large.
This is caused by the split horizon mitigations we have in place on Windows. Specifically, if the user has a VPN, proxy or NRPT we will treat all resolutions as "excluded from TRR"
Nightly 81:
Windows: 14.42M (13.58%)
Linux: 654.95k (3.76%)
OSX: 268.98k (2.03%)
Beta 80:
Windows: 16.93M (6.06%)
Linux: 122.27k (1.83%)
OSX: 128.03k (0.74%)
I think this confirms the cause of the issue. The remaining question is: do we think the split horizon mitigations are important enough to be kept in place, even though they affect a fair number of our DoH population? Should we move it to the doh-rollout heuristics, so it doesn't show up as "TRR enabled" even when exclude everything? And more importantly maybe, so we don't skip TRR when this is enabled by the user.
Comment 9•4 years ago
|
||
I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).
The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.
I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?
Assignee | ||
Comment 10•4 years ago
|
||
(In reply to Nhi Nguyen (:nhi) from comment #9)
I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).
The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.
I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?
I don't think we have a way to do that at the moment. The conversion should be easy, and it would us to see how often we trigger the probe too.
Comment 11•4 years ago
|
||
Is there anything that needs to happen for 80 here? We're got 2 beta builds left before release candidates.
Assignee | ||
Comment 12•4 years ago
|
||
Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana
Beta/Release Uplift Approval Request
- User impact if declined: TRR_SKIP_REASON telemetry will be slightly off.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): We only make sure to set the recorded variable to a different value when the flag is present.
- String changes made/needed:
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 13•4 years ago
|
||
Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana
approved for 80.0b7
Comment 14•4 years ago
|
||
bugherder uplift |
Updated•4 years ago
|
Description
•