Closed Bug 1656862 Opened 4 years ago Closed 4 years ago

Improve TRR_SKIP_REASON telemetry

Tracking

()

Status:

RESOLVED FIXED

Milestone:

81 Branch

Tracking Flags:

Tracking

Status

firefox80

fixed

firefox81

---

fixed

People

(Reporter: valentin, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana 4 years ago Valentin Gosu [:valentin] (he/him) 47 bytes, text/x-phabricator-request	jcristau : approval-mozilla-beta+	Details \| Review

Valentin Gosu [:valentin] (he/him)

Assignee

Description

•

4 years ago

Followup to bug 1649143 comment 11:

Going forward we should:

Fix TRR_DISABLED_FLAG never being used.

See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results

Figure out why TRR_NOT_CONFIRMED is so large, and how to make confirmation work better

See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 1

•

4 years ago

Attached file Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana — Details

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 2

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)

See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results

So, I've looked at the code that controls the TRRService mode

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

This means we record the skip reason when in mode0, then at the end we are in mode2 and report the reason.
This is similar to TRR_NOT_CONFIRMED, but it's a race against the heuristics, not the confirmation process.

I assume the few TRR_OFF_EXPLICIT reports that we do have are caused by user activity, changing from mode5 to mode2? AFAICT doh-rollout.mode never goes from 5 to 2.

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Blocks: doh

Dragana Damjanovic [:dragana]

Comment 3

•

4 years ago

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.
I think we will need to measure how often we change the trr mode (I think we can get that from addon telemetry). As I remember the old stats we will need to improve this.

Pulsebot

Comment 4

•

4 years ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/c61c4ac846f6 Record telemetry for TRR_DISABLED_FLAG r=dragana,necko-reviewers

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 5

•

4 years ago

[Tracking Requested - why for this release]:
We want to have accurate telemetry for this probe once it gets to release.

tracking-firefox80: --- → ?

Keywords: leave-open

Natalia Csoregi [:nataliaCs]

Comment 6

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c61c4ac846f6

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 7

•

4 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #3)

I think the mismatch is caused by the TRR mode flipping between mode 2 and mode0.

TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0

I find this too high. What do you think?
Can we add a check, add timestamps when we set the "reason" and when we change the mode and compare them and report if our assumption was right. This may be an indicator that we are changing the trr mode too often.

I agree this is higher than expected. Normally we shouldn't be changing TRR mode at all.
I suspect what's happening is that the computer comes out of suspension, connectivity changes (and we debounce) we run heuristics, but connectivity is down, so we disable. Connectivity comes back up, we go into mode 2.

Nihanth, do you think that's possible?
Should we try using navigator.onLine instead of gNetworkLinkService.isLinkUp ?

Flags: needinfo?(nhnt11)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 8

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #0)

TRR_EXCLUDED 26 6.12M (13.36%) - This is a fair chunk of results.

See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?

I've figured out why this is so large.
This is caused by the split horizon mitigations we have in place on Windows. Specifically, if the user has a VPN, proxy or NRPT we will treat all resolutions as "excluded from TRR"

Nightly 81:
Windows: 14.42M (13.58%)
Linux: 654.95k (3.76%)
OSX: 268.98k (2.03%)

Beta 80:
Windows: 16.93M (6.06%)
Linux: 122.27k (1.83%)
OSX: 128.03k (0.74%)

I think this confirms the cause of the issue. The remaining question is: do we think the split horizon mitigations are important enough to be kept in place, even though they affect a fair number of our DoH population? Should we move it to the doh-rollout heuristics, so it doesn't show up as "TRR enabled" even when exclude everything? And more importantly maybe, so we don't skip TRR when this is enabled by the user.

Flags: needinfo?(nhnguyen)

Nhi Nguyen (:nhi)

Comment 9

•

4 years ago

I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).

The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.

I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?

Flags: needinfo?(nhnguyen) → needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 10

•

4 years ago

(In reply to Nhi Nguyen (:nhi) from comment #9)

I think the split horizon mitigations should be moved to the doh-rollout heuristics (they are indeed heuristics).

The change should be documented in the release notes, as users who in TRR mode 2 may see breakage in split horizon cases.

I don't see this change as high priority, though. Do we currently have some signals to not count those resolutions when calculating TRR fallback rate?

I don't think we have a way to do that at the moment. The conversion should be easy, and it would us to see how often we trigger the probe too.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Depends on: 1657897

Julien Cristau [:jcristau]

Comment 11

•

4 years ago

Is there anything that needs to happen for 80 here? We're got 2 beta builds left before release candidates.

Flags: needinfo?(valentin.gosu)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 12

•

4 years ago

Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana

Beta/Release Uplift Approval Request

User impact if declined: TRR_SKIP_REASON telemetry will be slightly off.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): We only make sure to set the recorded variable to a different value when the flag is present.
String changes made/needed:

Flags: needinfo?(valentin.gosu)

Attachment #9167650 - Flags: approval-mozilla-beta?

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Depends on: 1658277

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Depends on: 1658278

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Status: NEW → RESOLVED

Closed: 4 years ago

Keywords: leave-open

Resolution: --- → FIXED

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox81: --- → fixed

Target Milestone: --- → 81 Branch

Julien Cristau [:jcristau]

Updated

•

4 years ago

status-firefox80: --- → fix-optional

tracking-firefox80: ? → -

Julien Cristau [:jcristau]

Comment 13

•

4 years ago

Comment on attachment 9167650 [details]
Bug 1656862 - Record telemetry for TRR_DISABLED_FLAG r=dragana

approved for 80.0b7

Attachment #9167650 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Julien Cristau [:jcristau]

Comment 14

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/fba85087a129

status-firefox80: fix-optional → fixed

(inactive) Nihanth Subramanya [:nhnt11]

Updated

•

4 years ago

Flags: needinfo?(nhnt11)

You need to log in before you can comment on or make changes to this bug.