Closed Bug 1674088 Opened 4 years ago Closed 4 years ago

[Experiment] The "TRR_DISABLED_FLAG" and "TRR_NOT_CONFIRMED" buckets of the "TRR_SKIP_REASON_TRR_FIRST" histogram are larger than expected

Categories

(Core :: Networking: DNS, defect, P1)

Desktop
All
defect

Tracking

()

VERIFIED FIXED
85 Branch
Tracking Status
firefox82 --- wontfix
firefox83 --- wontfix
firefox84 --- verified
firefox85 --- verified

People

(Reporter: mheres, Assigned: valentin)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(3 files)

Attached file log.txt.moz_log

[Affected versions]:

  • Firefox Release 82.0 (BuildID 20201014125134)
  • Firefox Release 82.0.1 (Build ID:20201026153733)
  • Firefox Release 82.0.2 (Build ID: 20201027185343)

[Affected Platforms]:

  • Windows 10
  • Linux Mint 20 (intermittent)

[Prerequisites]:

  • See this document for the environment variable needed.
  • Have Firefox Release installed and open.
  • Have the Remote Settings environment set to Stage.
  • Have doh-rollout.enabled set to true.
  • Be enrolled in the 'Don't reset doh-rollout.mode at shutdown" branch of the "DNS-over-HTTPS usage rate study".

[Steps to reproduce]:

  1. Navigate to various webpages.
  2. Restart the browser.
  3. Navigate to various webpages.
  4. Navigate to "about:telemetry#histograms-tab" and observe the values of theTRR_SKIP_REASON_TRR_FIRSThistogram.

[Expected result]:

  • The numbers of items in the 10 and 14 buckets are not large.

[Actual result]:

  • The numbers of items in the 10 and 14 buckets are larger than expected.

[Notes]:

  • Attached are a recording of the issue and the log for the session displayed in the recording.
  • I am also attaching a log for a session using the "Both prefs" branch.
  • This issue also seems reproducible for the "Both prefs" branch, though the 14 bucket seems to have fewer items on that branch.
  • The issue does not seem reproducible for the "Control" or "Skip confirmation" branches.
  • The issue seems to only be partially reproducible for Linux (the 14 bucket is larger sometimes, but the 10 bucket remains small) and does not seem to be reproducible for macOS 10.15.

Updated the issue with the behavior observed on macOS and Linux.

Assignee: nobody → valentin.gosu
Priority: -- → P1
Whiteboard: [necko-triaged]

So, the cause for this is that the confirmation mechanism is also used to account for failures.
When a certain number of TRR failures occur (5 by default) we go into confirmation failed mode and set a timer to recheck a second later.

Setting the "confirmationNS" pref to "skip" only makes it so that confirmation is instantaneous when attempted. That should improve things at startup - so that we don't wait for a TRR request before making others.
But in cases where the TRR connection fails too often (because it is too slow, because it is blocked, because the server is down, etc) it is probably better to keep this behaviour.
Improving it is definitely the goal, but it's hard to do that without incurring the 1.5 seconds penalty from waiting for each TRR request to time out.

This threshold seems to be too low currently, leading to sometimes
dropping confirmation for unexpected network failures / changes.

Due to various reasons (network change, temporary network congestion, etc) it
may happen that we exceed the limit of TRR failures thus going into
CONFIRM_FAILED state and setting the timer for automatic retry.
When confirmation is not "skip" we want to reduce the amount of time as much
as possible - so if it's a transient reason for the failures, we should retry
as early as possible.

This patch reduces the initial timer to 125 ms (down from 1000 ms).
Exponential backoff is still in effect, so the only effect should be retrying
earlier. We also turn it into a pref, so it's easy to experiment with it to
find the perfect value.

Depends on D96822

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2c8d4e805399
Increase TRR max fails from 5 to 15 r=necko-reviewers,dragana
https://hg.mozilla.org/integration/autoland/rev/f59bfe8ee981
Add pref for minimum delay time for TRR confirmation timer r=dragana,necko-reviewers
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

I have verified that the 10 bucket is now either no longer displayed both before and after a restart or is displayed (small) before a restart but is no longer displayed after a restart (one case) for the Don't reset doh-rollout.mode at shutdown and Both prefs branches of the "DNS-over-HTTPS usage rate study" experiment. The 14 bucket is comparable to the non-experiment version - sometimes slightly larger and sometimes slightly smaller, with most results having it smaller.
Tested on Firefox Nightly 85.0a1 (Build ID: 20201118041908) using Windows 10 and Linux Mint 20.

Status: RESOLVED → VERIFIED

== Change summary for alert #27724 (as of Wed, 18 Nov 2020 18:52:04 GMT) ==

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
2% Base Content JS macosx1014-64-shippable-qr 2,837,974.00 -> 2,773,715.67
2% Base Content JS windows10-64-shippable-qr 2,824,399.00 -> 2,779,943.33
2% Base Content JS linux1804-64-shippable-qr 2,813,143.83 -> 2,770,475.67
1% Base Content JS linux1804-64-shippable 2,806,557.67 -> 2,771,861.33

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=27724

The patch landed in nightly and beta is affected.
:valentin, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(valentin.gosu)

Comment on attachment 9187351 [details]
Bug 1674088 - Add pref for minimum delay time for TRR confirmation timer r=dragana

Beta/Release Uplift Approval Request

  • User impact if declined: Slightly more DNS requests will be resolved using unencrypted DNS rather than DNS over HTTPS
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: Optional: see comment 0
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patches increase the limit of allowed TRR failures/timeouts and decrease the timeout until we attempt to reenable the TRR service after the threshold is reached.
    This is not a logic change, but a tweaking of some parameters - at worst it could have a performance impact if more/less requests are being made, but experimental testing and telemetry didn't reveal any changes.
  • String changes made/needed:
Flags: needinfo?(valentin.gosu)
Attachment #9187351 - Flags: approval-mozilla-beta?
Attachment #9187350 - Flags: approval-mozilla-beta?
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

Comment on attachment 9187350 [details]
Bug 1674088 - Increase TRR max fails from 5 to 15 r=#necko

Approved for 84.0b7.

Attachment #9187350 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9187351 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

I have verified that the 10 bucket is now not displayed for the Don't reset doh-rollout.mode at shutdown and Both prefs branches of the "DNS-over-HTTPS usage rate study" experiment. The 14 bucket is comparable to the non-experiment version - sometimes slightly larger and sometimes slightly smaller.
Tested on Firefox Beta 84.0b7 (Build ID: 20201201213706) using Windows 10 and Linux Mint 20.

Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: