Closed Bug 1526041 Opened 5 years ago Closed 5 years ago

Size WebRender experiment for release 66

Categories

(Data Science :: Investigation, task, P1)

task
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tdsmith, Assigned: tdsmith)

References

(Blocks 1 open bug)

Details

Brief description of the request: Estimate the population size for an experiment to validate WebRender performance against the MVP ship criteria in release 66, based on the performance of the current experiment in beta (Bug 1492568).

About 2.5% of beta users and 5% of release users have gfx_features_wrqualified_status == "available".

https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/79144/command/79156

Jeff and telin, I'm looking at how we powered/proposed the v1 experiment and I think we're looking at the probes a little differently now. Can you confirm/comment on the targets for this round?

This is what I think we're looking at:

Probe v1 success condition New success condition
overall crash reports ≀ 5% increase in crash rate same
OOM crash reports ≀ 5% increase in crash rate same
shutdown crashes ≀ 5% increase in crash rate same
CANVAS_WEBGL_SUCCESS ≀ 5% regression in median of fraction "True" per user same
COMPOSITE_FRAME_ROUNDTRIP_TIME expect to see an improvement here Abandon?
COMPOSITE_TIME ≀ 10% regression in median of per-user means Median per-user fraction of slow frames < 0.5% (absolute)
CONTENT_FRAME_TIME ≀ 10% regression in median of per-user means Replaced by CONTENT_FRAME_TIME_VSYNC; ≀ 5% regression in median of per-user fraction of slow events
CONTENT_LARGE_PAINT_PHASE_WEIGHT ≀ 5% regression in submission rates Abandon?
CONTENT_PAINT_TIME ≀ 5% regression in median of per-user means Replaced by CONTENT_FULL_PAINT_TIME; ≀ 5% regression in fraction of slow paints (> 16 ms), ≀ 5% regression in median of per-user means
DEVICE_RESET_REASON ≀ 5% increase in reset rate same
FX_PAGE_LOAD_MS(_2) ≀ 5% regression in median of per-user means same
FX_TAB_CLICK_MS ≀ 5% regression in median of per-user means Replaced by FX_TAB_SWITCH_COMPOSITE_E10S_MS; same target

I'll also need to figure out what to do with CHECKERBOARD_{DURATION, PEAK, SEVERITY}.

Crash (/submission/reset) rates will be measured as events per 1,000 usage hours. A "slow event" means 16 ms or 200% vsync.

To summarize, the things I think might be different are:

  • Abandon CONTENT_LARGE_PAINT_PHASE_WEIGHT and COMPOSITE_FRAME_ROUNDTRIP_TIME (it isn't necessary to let them go, but we haven't been looking at them afaik)
  • Apply an absolute limit to slow COMPOSITE_TIME event rates based on the beta study, instead of a percentage difference vs Gecko -- it's many times "worse" than Gecko but we don't expect that to have a proportionally negative impact on users
  • Use CONTENT_FRAME_TIME_VSYNC instead of CONTENT_FRAME_TIME
  • For CONTENT_FRAME_TIME_VSYNC, look at fraction slow instead of the mean value
  • Use CONTENT_FULL_PAINT_TIME instead of CONTENT_PAINT_TIME
  • Additionally consider the median per-user fraction of slow CONTENT_FULL_PAINT_TIME events

Please let me know what you think; thanks!

Flags: needinfo?(jmuizelaar)

Yes. That looks great.

Flags: needinfo?(jmuizelaar)

To have 90% confidence that we can detect a 5% degradation in the per-user means and the per-user "slow fractions" of the performance metrics with a 5% family-wise type 1 error rate (adjusted for multiple comparisons), we should actually enroll about 0.06% of our release WAU.

This will wind up being 4-5x the number of users we're currently observing in beta, so the confidence intervals should be quite crisp.

Since only 5% of release users have wrQualified status "available", we need to "screen" 20-fold more users for eligibility, so we need Normandy to expose the recipe to 1.24% or more of release channel users. Let's round up to 1.5%.

(If we settle for lower confidence that we can detect each change, we can turn that number down -- e.g., at 80% confidence, we can expose the recipe to just 0.9% of release channel client_ids, but I don't think it's that critical to avoid the 1% threshold, so the larger sample is better.)

Detecting a 5% change in the crash rates is harder, because crashes are rare; we would need a larger sample. I think we aren't especially worried about crash rates so we can slip that a little. At .06% of users for 21 days, we're powered (at 90% confidence) for a 7% change in the total crash rate (all processes) and a 12% change in the OOM crash rate.

I'll throw those numbers in Experimenter; please let me know if you have any followups or if you'd like tighter bounds on the crash rate.

The supporting notebook is here: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/81426/command/81504

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.