Open Bug 1526094 Opened 9 months ago Updated 4 months ago

Validate WebRender performance in Release 67

Categories

(Data Science :: Experiment Collaboration, task, P1)

task
Points:
3

Tracking

(data-science-status Evaluation & interpretation)

ASSIGNED
Tracking Status
data-science-status --- Evaluation & interpretation

People

(Reporter: tdsmith, Assigned: tdsmith)

References

(Blocks 1 open bug, )

Details

Brief Description of the request: The WR team expects to turn on WR by default (for supported hardware) in release 67. We're interested in holding back a set of users in release 67 for a few weeks to validate the performance wins.

Any timelines for the request or how this fits into roadmaps: This should run during the 67 cycle.

Links to any assets (e.g Start of a PHD, BRD; any document that helps describe the project):

Assignee: nobody → tdsmith
Status: NEW → ASSIGNED

Note that the plan for release 67 is to do a gradual rollout, see bug 1541488. Not sure if that would impact this bug at all.

See Also: → 1541488

Corey, can I trouble you for a experiment design review?

The structure is essentially the same as the 66 experiment that you reviewed in Bug 1521626. The only difference is the context; WR was off by default in 66 and it will be on by default for eligible users in 67. We'd like to hold back a group in order to measure and document WR performance.

The population calculation for this study and the endpoints (described in Bug 1526041) are same as for 66. Based on our experience with the deployment for the 66 experiment I'll ask for the larger 5% sample that we ended up requiring in the last study.

Flags: needinfo?(cdowhygelund)

Tim, sure thing. I will review Monday/Tuesday of next week, if that works.

Experiment design review checklist:

What is the goal of the effort the experiment is supporting?

The use of Webrender as a rendering solution for Firefox. Webrender has many desirable qualities, and has been validated by two previous experiments. This experiment is to add further validation as this feature is rolled out in 67.

Is an experiment a useful next step towards this goal?

Yes, because it gives a an estimation for feature rollout if Webrender is performant and stable relative to the existing Firefox rendering solution.

What is the hypothesis or research question? Are the consequences for the top-level goal clear if the hypothesis is confirmed or rejected?

  • To validate the previous results on the Release and Beta channel found in a two previous experiments that WebRender is a stable and performant rendering solution.
  • Yes the consequences are clear: it will either validate or invalidate Webrender as having acceptable performance and stability for feature rollout.

Which measurements will be taken, and how do they support the hypothesis and goal? Are these measurements available in the targeted release channels? Has there been data steward review of the collection?

  • The measurements being take are as follows:
    No more than a 5% increase in overall crash reports
    No more than a 5% increase in OOM crash reports
    No more than a 5% increase in shutdown crashes
    Telemetry probes:
    CANVAS_WEBGL_SUCCESS - no more than 5% regression in "True" value
    DEVICE_RESET_REASON - no more than 5% regression in number of submissions
    CHECKERBOARD_DURATION - no more than 5% regression in distribution
    CHECKERBOARD_PEAK - no more than 5% regression in distribution
    CHECKERBOARD_SEVERITY - no more than 5% regression in distribution
    CONTENT_LARGE_PAINT_PHASE_WEIGHT - no more than 5% regression in number of submissions
    CONTENT_PAINT_TIME - no more than 5% regression in distribution
    FX_PAGE_LOAD_MS - no more than 5% regression in distribution
    FX_TAB_CLICK_MS - no more than 5% regression in distribution
    COMPOSITE_TIME - no more than 10% regression in distribution
    CONTENT_FRAME_TIME - no more than 10% regression in distribution
    COMPOSITE_FRAME_ROUNDTRIP_TIME - expect to see an improvement here
  • These metric measure rendering performance, thereby supporting the hypothesis.
  • These measurements are all available in the release channel (though it is too much effort to make that determination).
  • At the time, there hasn't been an optional data steward review of the collection.

Is the experiment design supported by an analysis plan? Is it adequate to answer the experimental questions

Yes, the experiment plan follows the previous experiments. In addition, the plan increases the sample size from the Release experiments, from the observed deployment behavior and sample sizes acquired.

Is the requested sample size supported by a power analysis that includes the core product metrics?

Yes, it is the same as the previous study. However, statistics acquired from the similar previous study are used to calculate the requisite sample size.

If the experiment is deployed to channels other than release, is it acceptable that the results will not be representative of the release population?

Not applicable - experiment is deployed on release.

Flags: needinfo?(cdowhygelund)

:tdsmith All looks good. I am curious as to the mention of FX_PAGE_LOAD_MS versus FX_PAGE_LOAD_MS_2. Is this a typo or intended?

Unintended! That's old text; thanks for flagging it. The "real" list of endpoints is the table in https://bugzilla.mozilla.org/show_bug.cgi?id=1526041#c2.

Thanks for the review!

data-science-status: --- → Data Acquisition
Priority: P3 → P1

Experiment has concluded and I'm drafting the report. This was very similar to the 66 experiment (and results were similar).

data-science-status: Data Acquisition → Evaluation & interpretation
You need to log in before you can comment on or make changes to this bug.