Closed Bug 1649143 Opened 4 years ago Closed 4 years ago

Record telemetry with reason we fell back to Do53

Categories

(Core :: Networking: DNS, task, P2)

task

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: valentin, Assigned: valentin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(3 files, 3 obsolete files)

There are currently a bunch of reasons for falling back to Do53, but we don't record them in telemetry. Here's an incomplete list of reasons:

  • TRR service hasn't yet finished the confirmation stage
  • TRR timer has expired
  • TRR request returned no IPs or returned an error code
  • RES_DISABLE_TRR flag was passed
  • domain was present in exclusion list
  • domain was blocklisted because of earlier failure
  • etc?
Blocks: doh

There is done in order to make sure we don't go though NameLookup again
and try to figure out why skipped TRR again.

Depends on D82168

Attached file Data collection review request (obsolete) —
Attachment #9161347 - Flags: data-review?(mmccorquodale)

Whoops, forgot to mention the category.

Attachment #9161347 - Attachment is obsolete: true
Attachment #9161347 - Flags: data-review?(mmccorquodale)
Attachment #9161348 - Flags: data-review?(mmccorquodale)
Comment on attachment 9161348 [details] Data collection review request 1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way? This will be documented in the probe dictionary. 2. Is there a control mechanism that allows the user to turn the data collection on and off? Yes, users can opt out of telemetry collection. 3. If the request is for permanent data collection, is there someone who will monitor the data over time? Valentin will monitor this data. 4. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, Technical data. 5. Is the data collection request for default-on or default-off? Default on. 6. Does the instrumentation include the addition of any new identifiers? No new identifiers. 7. Is the data collection covered by the existing Firefox privacy notice? Yes. 8. Does there need to be a check-in in the future to determine whether to renew the data? No. 9. Does the data collection use a third-party collection tool? No. ----- data-review +
Attachment #9161348 - Flags: data-review?(mmccorquodale) → data-review+
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/8f0602cfcc33 Record telemetry with reason we fell back to Do53 r=dragana,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/84477e4c8237 Just call NativeLookup when retrying for TTL r=dragana,necko-reviewers
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80

Comment on attachment 9161209 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53 r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: TRR is not used as much as we expected. This bug add detail telemetry probe to help us understand why it is not used.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only adds 2 new telemetry probes.
  • String changes made/needed:
Attachment #9161209 - Flags: approval-mozilla-beta?
Attachment #9161210 - Flags: approval-mozilla-beta?

Have we verified that the incoming Telemetry data from Nightly looks sane so far?

Flags: needinfo?(valentin.gosu)

I did look at it, and I am planning to spend some more time today. I will let you know by the late afternoon my time. (I need to extract certain data to be able to meaningfully look at them)

Looking at TRR_SKIP_REASON_TRR_FIRST in nightly I see the following distribution:

TRR_UNSET 0 1.86M (4.06%) - indicates there's a code path we don't report
TRR_OK 1 28.55M (62.36%) - realistic success rate for TRR requests
TRR_PARENTAL_CONTROL 3 5.86k (0.01%) - very small
TRR_OFF_EXPLICIT 4 128 (0%) - extremely small, but similar to TRR_MODE_NOT_ENABLED, should be 0
TRR_REQ_MODE_DISABLED 5 2.14M (4.68%) - we can probably ignore these. They're made for connectivity service checks, or by TRR channel resolves.
TRR_MODE_NOT_ENABLED 6 1.78M (3.88%) - this is larger than expected. TRR_SKIP_REASON_TRR_FIRST is only reported in mode2, but the code seems to think it's in mode0
TRR_FAILED 7 23.41k (0.05%) - unknown failure
TRR_DISABLED_FLAG 10 0 (0%) - this is a bug. We never set this in the code
TRR_TIMEOUT 11 342.79k (0.75%) - quite small
TRR_CHANNEL_DNS_FAIL 12 4.58k (0.01%) - fairly small
TRR_IS_OFFLINE 13 19.16k (0.04%) - very small
TRR_NOT_CONFIRMED 14 4.25M (9.29%) - this is larger than I'd like. This means we either make a lot of requests at startup before we have a TRR connection, or the confirmation status changes too often
TRR_DID_NOT_MAKE_QUERY 15 25.87k (0.06%) - small
TRR_UNKNOWN_CHANNEL_FAILURE 16 21.57k (0.05%) - small
TRR_HOST_BLOCKED_TEMPORARY 17 278.77k (0.61%) - small (this is good, because it means we don't hit blocklisted hostnames that often)
TRR_DECODE_FAILED 25 356.41k (0.78%) - small
TRR_EXCLUDED 26 6.12M (13.36%) - This is a fair chunk of results. We should look into how often it happens.

Going forward we should:

  • Fix TRR_DISABLED_FLAG never being used.
  • See why we have TRR_OFF_EXPLICIT and TRR_MODE_NOT_ENABLED results
  • Figure out why TRR_NOT_CONFIRMED is so large, and how to make confirmation work better
  • See if the TRR_EXCLUDED reports are accurate. They may involve resolving "localhost" (even from the browser), or maybe domains on the local network?
Flags: needinfo?(valentin.gosu)

It's not clear to me based on the recent comments whether we're feeling good about the quality of the incoming Telemetry data or not (and attempts to discuss on Slack haven't gotten very far). Also, at least the first patch needs rebasing for uplift. This is going to miss RC1 but I'm leaving this set to affected as a possible RC2 ride-along. Please attached rebased patches and request release approval when ready.

Flags: needinfo?(valentin.gosu)
Attachment #9161209 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #9161210 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Comment on attachment 9164893 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53; rebased on beta; r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: TRR is not used as much as we expected. This bug add detail telemetry probe to help us understand why it is not used.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: none
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It only adds 2 new telemetry probes.
  • String changes made/needed:
Attachment #9164893 - Flags: approval-mozilla-release?
Attachment #9164896 - Flags: approval-mozilla-release?

Thanks for the request, Dragana!

Flags: needinfo?(valentin.gosu)

We would like to get this patch into 79 to get data to debug TRR fallback rate, which is currently one of the blockers for DoH rollout. Please let me know if you still have concerns about this patch that haven't been addressed.

Flags: needinfo?(ryanvm)

See comment 12. Right now, there's no driver for an RC2, however. We'll leave this on the radar for 79.0.1.

Flags: needinfo?(ryanvm)
Blocks: 1656862

Comment on attachment 9164893 [details]
Bug 1649143 - Record telemetry with reason we fell back to Do53; rebased on beta; r=dragana

we're a week from shipping 80, there are no plans for a 79 dot release

Attachment #9164893 - Flags: approval-mozilla-release? → approval-mozilla-release-
Attachment #9164896 - Flags: approval-mozilla-release? → approval-mozilla-release-
Attachment #9164893 - Attachment is obsolete: true
Attachment #9164896 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: