Record telemetry with reason we fell back to Do53
Categories
(Core :: Networking: DNS, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox80 | --- | fixed |
People
(Reporter: valentin, Assigned: valentin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged][trr])
Attachments
(3 files, 3 obsolete files)
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta-
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta-
|
Details | Review |
2.51 KB,
text/plain
|
mmccorquodale
:
data-review+
|
Details |
There are currently a bunch of reasons for falling back to Do53, but we don't record them in telemetry. Here's an incomplete list of reasons:
- TRR service hasn't yet finished the confirmation stage
- TRR timer has expired
- TRR request returned no IPs or returned an error code
- RES_DISABLE_TRR flag was passed
- domain was present in exclusion list
- domain was blocklisted because of earlier failure
- etc?
Assignee | ||
Comment 1•4 years ago
|
||
Assignee | ||
Comment 2•4 years ago
|
||
There is done in order to make sure we don't go though NameLookup again
and try to figure out why skipped TRR again.
Depends on D82168
Comment 3•4 years ago
|
||
Comment 4•4 years ago
|
||
Whoops, forgot to mention the category.
Comment 5•4 years ago
|
||
Comment 7•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8f0602cfcc33
https://hg.mozilla.org/mozilla-central/rev/84477e4c8237
Comment 8•4 years ago
|
||
Comment on attachment 9161209 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53 r=dragana
Beta/Release Uplift Approval Request
- User impact if declined: TRR is not used as much as we expected. This bug add detail telemetry probe to help us understand why it is not used.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It only adds 2 new telemetry probes.
- String changes made/needed:
Updated•4 years ago
|
Comment 9•4 years ago
|
||
Have we verified that the incoming Telemetry data from Nightly looks sane so far?
Comment 10•4 years ago
|
||
I did look at it, and I am planning to spend some more time today. I will let you know by the late afternoon my time. (I need to extract certain data to be able to meaningfully look at them)
Assignee | ||
Comment 11•4 years ago
|
||
Looking at TRR_SKIP_REASON_TRR_FIRST in nightly I see the following distribution:
TRR_UNSET 0 1.86M (4.06%) - indicates there's a code path we don't report
TRR_OK 1 28.55M (62.36%) - realistic success rate for TRR requests
TRR_PARENTAL_CONTROL 3 5.86k (0.01%) - very small
TRR_OFF_EXPLICIT 4 128 (0%) - extremely small, but similar to TRR_MODE_NOT_ENABLED, should be 0
TRR_REQ_MODE_DISABLED 5 2.14M (4.68%) - we can probably ignore these. They're made for connectivity service checks, or by TRR channel resolves.
TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0
TRR_FAILED 7 23.41k (0.05%) - unknown failure
TRR_DISABLED_FLAG 10 0 (0%) - this is a bug. We never set this in the code
TRR_TIMEOUT 11 342.79k (0.75%) - quite small
TRR_CHANNEL_DNS_FAIL 12 4.58k (0.01%) - fairly small
TRR_IS_OFFLINE 13 19.16k (0.04%) - very small
TRR_NOT_CONFIRMED 14 4.25M (9.29%) - this is larger than I'd like. This means we either make a lot of requests at startup before we have a TRR connection, or the confirmation status changes too often
TRR_DID_NOT_MAKE_QUERY 15 25.87k (0.06%) - small
TRR_UNKNOWN_CHANNEL_FAILURE 16 21.57k (0.05%) - small
TRR_HOST_BLOCKED_TEMPORARY 17 278.77k (0.61%) - small (this is good, because it means we don't hit blocklisted hostnames that often)
TRR_DECODE_FAILED 25 356.41k (0.78%) - small
TRR_EXCLUDED 26 6.12M (13.36%) - This is a fair chunk of results. We should look into how often it happens.
Going forward we should:
- Fix TRR_DISABLED_FLAG never being used.
- See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results
- Figure out why TRR_NOT_CONFIRMED is so large, and how to make confirmation work better
- See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?
Comment 12•4 years ago
|
||
It's not clear to me based on the recent comments whether we're feeling good about the quality of the incoming Telemetry data or not (and attempts to discuss on Slack haven't gotten very far). Also, at least the first patch needs rebasing for uplift. This is going to miss RC1 but I'm leaving this set to affected as a possible RC2 ride-along. Please attached rebased patches and request release approval when ready.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
Comment 15•4 years ago
|
||
Comment on attachment 9164893 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53; rebased on beta; r=dragana
Beta/Release Uplift Approval Request
- User impact if declined: TRR is not used as much as we expected. This bug add detail telemetry probe to help us understand why it is not used.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: none
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): It only adds 2 new telemetry probes.
- String changes made/needed:
Updated•4 years ago
|
Comment 17•4 years ago
|
||
We would like to get this patch into 79 to get data to debug TRR fallback rate, which is currently one of the blockers for DoH rollout. Please let me know if you still have concerns about this patch that haven't been addressed.
Comment 18•4 years ago
|
||
See comment 12. Right now, there's no driver for an RC2, however. We'll leave this on the radar for 79.0.1.
Comment 19•4 years ago
|
||
Comment on attachment 9164893 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53; rebased on beta; r=dragana
we're a week from shipping 80, there are no plans for a 79 dot release
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Description
•