Closed Bug 1526041 Opened 7 years ago Closed 7 years ago

Size WebRender experiment for release 66

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: tdsmith, Assigned: tdsmith)

References

(Blocks 1 open bug)

Details

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Description

•

7 years ago

Brief description of the request: Estimate the population size for an experiment to validate WebRender performance against the MVP ship criteria in release 66, based on the performance of the current experiment in beta (Bug 1492568).

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Updated

•

7 years ago

Blocks: 1492568

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Updated

•

7 years ago

Blocks: 1526094

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Comment 1

•

7 years ago

About 2.5% of beta users and 5% of release users have gfx_features_wrqualified_status == "available".

https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/79144/command/79156

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Comment 2

•

7 years ago

Jeff and telin, I'm looking at how we powered/proposed the v1 experiment and I think we're looking at the probes a little differently now. Can you confirm/comment on the targets for this round?

This is what I think we're looking at:

Probe	v1 success condition	New success condition
overall crash reports	≤ 5% increase in crash rate	same
OOM crash reports	≤ 5% increase in crash rate	same
shutdown crashes	≤ 5% increase in crash rate	same
CANVAS_WEBGL_SUCCESS	≤ 5% regression in median of fraction "True" per user	same
COMPOSITE_FRAME_ROUNDTRIP_TIME	expect to see an improvement here	Abandon?
COMPOSITE_TIME	≤ 10% regression in median of per-user means	Median per-user fraction of slow frames < 0.5% (absolute)
CONTENT_FRAME_TIME	≤ 10% regression in median of per-user means	Replaced by CONTENT_FRAME_TIME_VSYNC; ≤ 5% regression in median of per-user fraction of slow events
CONTENT_LARGE_PAINT_PHASE_WEIGHT	≤ 5% regression in submission rates	Abandon?
CONTENT_PAINT_TIME	≤ 5% regression in median of per-user means	Replaced by CONTENT_FULL_PAINT_TIME; ≤ 5% regression in fraction of slow paints (> 16 ms), ≤ 5% regression in median of per-user means
DEVICE_RESET_REASON	≤ 5% increase in reset rate	same
FX_PAGE_LOAD_MS(_2)	≤ 5% regression in median of per-user means	same
FX_TAB_CLICK_MS	≤ 5% regression in median of per-user means	Replaced by FX_TAB_SWITCH_COMPOSITE_E10S_MS; same target

I'll also need to figure out what to do with CHECKERBOARD_{DURATION, PEAK, SEVERITY}.

Crash (/submission/reset) rates will be measured as events per 1,000 usage hours. A "slow event" means 16 ms or 200% vsync.

To summarize, the things I think might be different are:

Abandon CONTENT_LARGE_PAINT_PHASE_WEIGHT and COMPOSITE_FRAME_ROUNDTRIP_TIME (it isn't necessary to let them go, but we haven't been looking at them afaik)
Apply an absolute limit to slow COMPOSITE_TIME event rates based on the beta study, instead of a percentage difference vs Gecko -- it's many times "worse" than Gecko but we don't expect that to have a proportionally negative impact on users
Use CONTENT_FRAME_TIME_VSYNC instead of CONTENT_FRAME_TIME
For CONTENT_FRAME_TIME_VSYNC, look at fraction slow instead of the mean value
Use CONTENT_FULL_PAINT_TIME instead of CONTENT_PAINT_TIME
Additionally consider the median per-user fraction of slow CONTENT_FULL_PAINT_TIME events

Please let me know what you think; thanks!

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 3

•

7 years ago

Yes. That looks great.

Flags: needinfo?(jmuizelaar)

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Assignee

Comment 4

•

7 years ago

To have 90% confidence that we can detect a 5% degradation in the per-user means and the per-user "slow fractions" of the performance metrics with a 5% family-wise type 1 error rate (adjusted for multiple comparisons), we should actually enroll about 0.06% of our release WAU.

This will wind up being 4-5x the number of users we're currently observing in beta, so the confidence intervals should be quite crisp.

Since only 5% of release users have wrQualified status "available", we need to "screen" 20-fold more users for eligibility, so we need Normandy to expose the recipe to 1.24% or more of release channel users. Let's round up to 1.5%.

(If we settle for lower confidence that we can detect each change, we can turn that number down -- e.g., at 80% confidence, we can expose the recipe to just 0.9% of release channel client_ids, but I don't think it's that critical to avoid the 1% threshold, so the larger sample is better.)

Detecting a 5% change in the crash rates is harder, because crashes are rare; we would need a larger sample. I think we aren't especially worried about crash rates so we can slip that a little. At .06% of users for 21 days, we're powered (at 90% confidence) for a 7% change in the total crash rate (all processes) and a 12% change in the OOM crash rate.

I'll throw those numbers in Experimenter; please let me know if you have any followups or if you'd like tighter bounds on the crash rate.

The supporting notebook is here: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/81426/command/81504

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Size WebRender experiment for release 66

Categories

(Data Science :: Investigation, task, P1)

Tracking

(Not tracked)

People

(Reporter: tdsmith, Assigned: tdsmith)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4