Closed Bug 1521626 Opened 6 years ago Closed 5 years ago

Validate WebRender performance in Release 66

Categories

(Data Science :: Experiment Collaboration, task, P1)

task
Points:
3

Tracking

(data-science-status Peer Review)

RESOLVED FIXED
Tracking Status
data-science-status --- Peer Review

People

(Reporter: tdsmith, Assigned: tdsmith)

References

()

Details

Attachments

(1 file)

Brief Description of the request: Jessie Bonisteel let me know that the WR team is interested in flipping on WR for a set of users in release 66 to validate the results of the nightly/beta experiment.

Any timelines for the request or how this fits into roadmaps: This should run during the 66 cycle.

Links to any assets (e.g Start of a PHD, BRD; any document that helps describe the project):

Assignee: nobody → tdsmith
Status: NEW → ASSIGNED
Priority: -- → P3
Priority: P3 → P1

Experiment design review checklist:
What is the goal of the effort the experiment is supporting?

Is an experiment a useful next step towards this goal?

  • Yes, because it determines if Webrender is performant and stable relative to the existing Firefox rendering solution.

What is the hypothesis or research question? Are the consequences for the top-level goal clear if the hypothesis is confirmed or rejected?

  • To validate the results on the Release channel found in a previous Beta channel experiment that WebRender is a stable and performant rendering solution.
  • Yes the consequences are clear: it will either validate or invalidate Webrender as having acceptable performance and stability.

Which measurements will be taken, and how do they support the hypothesis and goal? Are these measurements available in the targeted release channels? Has there been data steward review of the collection?

  • The measurements being take are as follows:
    overall crash reports ≀ 5% increase in crash rate
    OOM crash reports ≀ 5% increase in crash rate
    shutdown crashes ≀ 5% increase in crash rate
    CANVAS_WEBGL_SUCCESS ≀ 5% regression in median of fraction "True" per user
    COMPOSITE_TIME Median per-user fraction of slow frames < 0.5% (absolute)
    CONTENT_FRAME_TIME_VSYNC ≀ 5% regression in median of per-user fraction of slow events
    CONTENT_FULL_PAINT_TIME ≀ 5% regression in fraction of slow paints (> 16 ms), ≀ 5% regression in median of per-user means
    DEVICE_RESET_REASON ≀ 5% increase in reset rate
    FX_PAGE_LOAD_MS_2 ≀ 5% regression in median of per-user means
    FX_TAB_SWITCH_COMPOSITE_E10S_MS ≀ 5% regression in median of per-user means
    Crash (/submission/reset) rates will be measured as events per 1,000 usage hours. A "slow event" means 16 ms or 200% vsync.
  • These metric measure rendering performance, thereby supporting the hypothesis.
  • These measurements are all available in the release channel (though it is too much effort to make that determination).
  • At the time, there hasn't been an optional data steward review of the collection.

Is the experiment design supported by an analysis plan? Is it adequate to answer the experimental questions
Yes, the analysis plan follows the previous experiment performed on Beta.

Is the requested sample size supported by a power analysis that includes the core product metrics?

  • Yes, a power analysis was performed on the previous study results, which contains all relevant metrics.

If the experiment is deployed to channels other than release, is it acceptable that the results will not be representative of the release population?

  • Not applicable, experiment deployed on release.
data-science-status: Planning → Data Acquisition
data-science-status: Data Acquisition → Evaluation & interpretation
Attached file Draft report β€”

Felix, could you review?

Notebook source at: https://metrics.mozilla.com/protected/tdsmith/20190422-wr66_release.Rmd

The ETL notebooks are linked from the bottom of the report.

Attachment #9059915 - Flags: review?(flawrence)
data-science-status: Evaluation & interpretation → Peer Review

Nice analysis; here are my comments on things that could plausibly change the results or the reader's interpretation of the results.

The first telemetry session after a user enrolled was dropped for users in both branches

Did you just drop one ping (i.e. one subsession) or did you drop the whole session? My suspicions were raised because you then talk about profile_subsession_counter.

The graph in Sec 2.1.2 appears to be mislabeled, the text on the graph should read "0.5% slow events", not "0.05% slow events".

Overall crash reports etc

Normalizing by "per 1000 usage hours" makes sense as a metric outside of an experiment context, but does it really make sense here? Do you really have a rigorous statistical method to model the uncertainties here? Seems like a hard problem.

Also it's important to measure "distinct users with crash" as opposed to "average crashes per user"; is the main difference between the two metrics going to be measuring "when someone gets into a crash loop, how many times do they persist before giving up and switching browser?" It would be good to add this metric if you have time, particularly given you saw some movement in crashes.

Active time may have decreased slightly for WebRender branch users among less avid users, which could reflect either a more efficient browsing experience or less browsing.

This is not saving the user multiple minutes - page rendering time is surely a few orders of magnitude less than time spent on page?

Retention

Would be nice to see relative changes; I suspect that there's a small but significant drop at x=1? What's the CI on the relative change there and is it acceptable?

Thanks for this! Generally agreed; a few responses:

(In reply to Felix Lawrence from comment #3)

The first telemetry session after a user enrolled was dropped for users in both branches

Did you just drop one ping (i.e. one subsession) or did you drop the whole session? My suspicions were raised because you then talk about profile_subsession_counter.

The whole session (there's no counter for sessions).

Overall crash reports etc

Normalizing by "per 1000 usage hours" makes sense as a metric outside of an experiment context, but does it really make sense here? Do you really have a rigorous statistical method to model the uncertainties here? Seems like a hard problem.

No, we cargo-culted this from mission control. The confidence intervals are based on an assumption of crashes-as-Poisson-events. What would be a good way to address this? Add caveats, find a different display, something else?

Also it's important to measure "distinct users with crash" as opposed to "average crashes per user"; is the main difference between the two metrics going to be measuring "when someone gets into a crash loop, how many times do they persist before giving up and switching browser?" It would be good to add this metric if you have time, particularly given you saw some movement in crashes.

Definitely agree that I should add a distinct-users presentation.

Retention

Would be nice to see relative changes; I suspect that there's a small but significant drop at x=1? What's the CI on the relative change there and is it acceptable?

The concordance at two weeks made me reticent to dig into the one-week number but I'm pretty sure you're right; I'll display that.

The confidence intervals are based on an assumption of crashes-as-Poisson-events. What would be a good way to address this? Add caveats, find a different display, something else?

I think you face a choice between:

  1. coming up with a model that captures what is going on in reality, and analysing the data to see what the intervention is doing to the model parameters (or some quantity derived from the model parameters)
  2. defining a crude metric that captures a directional change and is robust to effects you don't care about.

If you're defining a Poisson model then you're currently in camp 1. Some concerns immediately come to mind:

  • Do users ever get into crash loops? If so that would seriously distort the model, because "one" event actually appears as a cluster of many.
  • Is there a different rate parameter for each user or is there one shared across all users? If it's shared across all users, you're making a approximation - some users' systems are way crashier than others. Is this a good approximation or is it problematic?

If you or someone else will/has thought through these issues, and your model is trustworthy, then great - this approach will give more insight into what's really going on.

If not, then I would be tempted not to normalize for hours of usage, and just look at the distinct-users data. If you're worried about increasing the number of crashes per user rather than just the number of users who experience a crash, then you could threshold, or bootstrap quantiles, or something else.

Opened a pull request against mozilla-reports: https://github.com/mozilla/mozilla-reports/pull/106

The good news is that all of the ways to look at crashes were broadly consistent.

You may have been alluding to this intentionally, but at least to spell it out, when you say:

Is there a different rate parameter for each user or is there one shared across all users? If it's shared across all users, you're making a approximation - some users' systems are way crashier than others. Is this a good approximation or is it problematic?

Eh, Β‘por que no los dos? A problem is that since crashes are rare, it's very hard to estimate a rate parameter for users who aren't crash-looping, which is almost all of them. There's probably some clever imputation method we could use. Hopefully bug 1533444 starts to guide us somewhere more concrete, but I think it's helpful to provide both for now.

Could you point me to an easy place or method to read the report? Easier than setting up your fork as a remote so I can check out the branch? If your PR included the .md then I'd just read that.

The good news is that all of the ways to look at crashes were broadly consistent.

So you investigated the crude "at least one crash per user" metric and it supports your conclusions? Then that's likely good enough for me; would still prefer to read the revised report before giving a formal r+, but if there's a rush then just press on.

Comment on attachment 9059915 [details] Draft report > The approach may overestimate the actual effect of WebRender on the population if a non-random set of users (e.g. users with poor performance) were more likely to unenroll from the experiment "overestimate" is the wrong word here - I read it as "effects will appear larger in our measurements than in reality" when I think you mean "the WebRender branch will appear better in our measurements than in reality". This has no material impact on this report because the sentence ended with "but this is unlikely because unenrollments were rare, and balanced between the experiment and control branches" - but it's worth correcting in reports on future experiments, when you may not have the luxury of rare and balanced unenrollments. Otherwise the rendered preview looks good to me, so r+
Attachment #9059915 - Flags: review?(flawrence) → review+
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: