Closed Bug 1251052 Opened 8 years ago Closed 8 years ago

[checkerboard-experiment] Make a telemetry experiment to reduce displayport size

Categories

(Core :: Panning and Zooming, defect)

Unspecified
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
firefox47 --- fixed

People

(Reporter: kats, Assigned: kats)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [gfx-noted])

Attachments

(3 files, 1 obsolete file)

We can use the telemetry experiments machinery [1] to try different displayport sizes and see what the impact is on checkerboarding. We already have various telemetry probes in the tree to measure checkerboarding metrics, so the experiment would just need to flip the displayport size prefs and see what the result is. Ideally we want to reduce the displayport to as small as possible without increasing checkerboarding.

[1] https://wiki.mozilla.org/Telemetry/Experiments
bsmedberg, I'd like to put together a telemetry experiment that aims to find the smallest displayport size that doesn't significantly increase checkerboarding. Reducing the displayport size will both save on memory usage and main-thread paint time with APZ enabled. In particular this should help reduce the tp5o_scroll regression seen in e10s vs non-e10s talos testing. Reducing the displayport past a certain point will also increase checkerboarding when scrolling rapidly. The experiment aims to find a better "sweet spot" than we're at now, which was mostly just a random guess from anecdotal data.

We already have telemetry probes landed (CHECKERBOARD_SEVERITY, CHECKERBOARD_PEAK, CHECKERBOARD_DURATION, CHECKERBOARD_POTENTIAL_DURATION) that record various aspects of checkerboard events. The experiment code just needs to divide the population and set different displayport-sizing prefs for each group in the population. All the prefs that need to be modified are "live" and don't require a browser restart. The groups I'd like to make are:

Group 1: Control group
- no pref changes

Group 2: Small horizontal displayport, unchanged vertical displayport
- apz.x_skate_size_multiplier set to 0.5
- apz.x_stationary_size_multiplier set to 0.5

Group 3: Unchanged horizontal displayport, smaller vertical displayport
- apz.y_skate_size_multiplier set to 2.0
- apz.y_stationary_size_multiplier set to 2.0

Group 4: Unchanged horizontal displayport, even smaller vertical displayport
- apz.y_skate_size_multiplier set to 1.0
- apz.y_stationary_size_multiplier set to 1.0

Group 5: Unchanged horizontal displayport, smaller vertical displayport (skate only)
- apz.y_skate_size_multiplier set to 2.0

Group 6: Unchanged horizontal displayport, even smaller vertical displayport (skate only)
- apz.y_skate_size_multiplier set to 1.0

Based on recent historical telemetry data, we get reports of about 1 million checkerboard events per 7 days worth of builds. Even if we split that into 6 groups I'm hoping running this experiment on nightly for 7 days should be sufficient to give us some useful data.

Please let me know if you have any questions or concerns. ni? also to :botond if he has any comments or improvements to suggest.
Flags: needinfo?(botond)
Flags: needinfo?(benjamin)
I'm also in the process of running t5po_scroll locally on OS X and windows for each of these groups to see what kind of improvement it yields. I'm expecting more improvement on windows than on OS X, because windows doesn't have tiling enabled and so the displayport isn't getting rounded up beyond the specified multipliers.
What information do you need from me? https://wiki.mozilla.org/Telemetry/Experiments has guidelines: if you want somebody to review the basic framework, please ask felipe.

An example of an experiment which just sets prefs can be found at http://hg.mozilla.org/webtools/telemetry-experiment-server/file/bb8b48cf0418/experiments/flash-protectedmode-beta

You should do some testing to make sure that the prefs set by the experiment take effect properly. A lot of graphics prefs are only read at startup before experiments are launched, which kinda sucks. Are all these prefs "live"?
Flags: needinfo?(benjamin)
I was doing the "Before building an experiment" step at https://wiki.mozilla.org/Telemetry/Experiments#Data_Collection, but maybe that's not needed in this case. I'll just go ahead and build it.

And yes, all the prefs are live.
Ah ok, if this is all telemetry that already exists there's really nothing additional I need to review.
Attached patch Experiment (obsolete) — Splinter Review
I'm assuming that the XPI doesn't need signing since it's a nightly-only experiment - please correct me if I'm wrong. Here's the patch to the telemetry-experiment-server repo that adds the experiment.
Attachment #8723811 - Flags: review?(felipc)
Comment on attachment 8723811 [details] [diff] [review]
Experiment

Review of attachment 8723811 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, I don't think you need to sign if this is gonna stay nightly-only.

::: experiments/displayport-tuning-nightly/code/bootstrap.js
@@ +68,5 @@
> +      Cu.reportError("Got error during bootstrap startup: " + e);
> +    });
> +}
> +
> +function shutdown() {

shutdown can receive a reason parameter, and you can condition this to 
reason != APP_SHUTDOWN

so that the clean-up only happens when the add-on is being disabled/uninstalled, but not during the normal shutdown process

@@ +74,5 @@
> +  defaultPrefs.reset("apz.x_skate_size_multiplier");
> +  defaultPrefs.reset("apz.y_skate_size_multiplier");
> +  defaultPrefs.reset("apz.x_stationary_size_multiplier");
> +  defaultPrefs.reset("apz.y_stationary_size_multiplier");
> +}

can you define an empty install() and uninstall() functions?  Otherwise the add-ons manager will spew warnings about them being undefined

::: experiments/displayport-tuning-nightly/manifest.json
@@ +7,5 @@
> +  "manifest"    : {
> +    "id"               : "displayport-tuning-nightly@experiments.mozilla.org",
> +    "startTime"        : 1456365600,
> +    "endTime"          : 1458007200,
> +    "maxActiveSeconds" : 6048000,

70 days is too many. please use a more realistic value (since this ends in Mar 14 anyway).  This can be used to limit a maximum time for users who are activated early on (i.e., to finish earlier than endTime).

@@ +12,5 @@
> +    "appName"          : ["Firefox"],
> +    "channel"          : ["nightly"],
> +    "minVersion"       : "47.0a1",
> +    "maxVersion"       : "48.*",
> +    "sample"           : 1.0

Figure out with bsmedberg which sample you want here. We have avoided 100% in the past
Attachment #8723811 - Flags: review?(felipc) → feedback+
(In reply to :Felipe Gomes (needinfo me!) from comment #7)
> 70 days is too many. please use a more realistic value (since this ends in
> Mar 14 anyway).  This can be used to limit a maximum time for users who are
> activated early on (i.e., to finish earlier than endTime).

Whoops, I meant to make it 7 days. Got an extra 0 in there. I'll update the rest as you suggested. I might also extend the end date by a week to give enough time for QA and deployment; I'm not sure how long that process usually takes.

> > +    "sample"           : 1.0
> 
> Figure out with bsmedberg which sample you want here. We have avoided 100%
> in the past

Given that we're splitting the sample population into 6 branches I'd like to start with as large a sample as I can get away with, to maximize the amount of data for each branch. (Note that one of the branches is the control group, so if you prefer I can use a 83% sample rate and just 5 branches). Is there a reason to avoid 100% sampling though? I saw a few experiments listed at https://wiki.mozilla.org/QA/Telemetry that did it.
ni? to bsmedberg for guidance on the sample rate (see last paragraphs of comment 7 and comment 8)
Flags: needinfo?(benjamin)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #2)
> I'm also in the process of running t5po_scroll locally on OS X and windows
> for each of these groups to see what kind of improvement it yields. I'm
> expecting more improvement on windows than on OS X, because windows doesn't
> have tiling enabled and so the displayport isn't getting rounded up beyond
> the specified multipliers.

tp5o_scroll results from my local runs:

            OS X            Windows
No e10s     4.5274102       8.8103926
            (baseline)      (baseline)

e10s        6.0060741       13.0563724
            (+32.66%)       (+48.19%)

Group 2     5.9652480       12.5385882
            (+31.76%)       (+42.32%)

Group 3     5.4796194       9.7264680
            (+21.03%)       (+10.40%)

Group 4     4.7459634       8.7287164
            (+4.82%)        (-0.93%)

Group 5     5.7147330       12.6219963
            (+26.22%)       (+43.26%)

Group 6     5.9374061       12.5258583
            (+31.14%)       (+42.17%)

Group 4 did the best, significantly better than all the other groups. Groups 2 and 6 barely moved the needle. These results are more or less expected since reducing the y_stationary multiplier produces the greatest impact. The telemetry experiment will give us a view on how much checkerboarding results from these as well, and then we can pick some multipliers that make sense.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #1)
> ni? also to :botond if he has any comments or improvements to suggest.

Sounds like a good idea to me!

One question: would it make sense to include a group that has the displayport size reduced in both dimensions at once?
Flags: needinfo?(botond)
release-drivers needs a clean (no experiments) beta population to validate releases, so we must never use more than 50% of the beta population. We're also running into the fact that we're splitting beta in half e10s/non-e10s starting in 46 (probably). The couple of experiments that have run on a 100% sample were run on very restrictive populations (only Polish locale, for example).

What is the size of the population you need to have a representative measurement? The goal is to have the sample be as small as will produce the statistical power you need. If you're not sure, please talk to rvitillo about the statistics.
Flags: needinfo?(benjamin)
Also, do you want to run this experiment on both e10s and non-e10s users?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #12)
> release-drivers needs a clean (no experiments) beta population to validate
> releases, so we must never use more than 50% of the beta population. We're
> also running into the fact that we're splitting beta in half e10s/non-e10s
> starting in 46 (probably). The couple of experiments that have run on a 100%
> sample were run on very restrictive populations (only Polish locale, for
> example).

This experiment is nightly-only, so the beta concerns shouldn't apply.

> Also, do you want to run this experiment on both e10s and non-e10s users?

e10s is the relevant population; the non-e10s users won't be generating the telemetry data I care about.
ah, ok. Please do 50% of nightly.

I'm concerned that nightly won't produce interesting or representative data, because the hardware is so skewed compared to the rest of our populations. So I will recommend that you repeat this experiment on aurora/beta in order to validate/tune the defaults for the largest population.

You may want to use an experiment filter to explicitly exclude people from the experiment who are not running e10s: you can write a filter on environment.settings.e10sEnable
(In reply to Botond Ballo [:botond] from comment #11)
> One question: would it make sense to include a group that has the
> displayport size reduced in both dimensions at once?

I thought about that, but I figured that we could probably derive an approximation of that by combining the values from the individual dimensions. I also don't want to increase the number of groups too much. Perhaps when we run the experiment again in the future we can try combinations of both dimensions.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> ah, ok. Please do 50% of nightly.

Sounds good.

> I'm concerned that nightly won't produce interesting or representative data,
> because the hardware is so skewed compared to the rest of our populations.
> So I will recommend that you repeat this experiment on aurora/beta in order
> to validate/tune the defaults for the largest population.

Fair enough. For now I'm happy with a first-approximation on this data, and we can run further experiments in the future to tune it more.

> You may want to use an experiment filter to explicitly exclude people from
> the experiment who are not running e10s: you can write a filter on
> environment.settings.e10sEnable

Will do. I'll also exclude people who don't have telemetry enabled since we won't get data from them either.
When setting the initial branch, you might also want first to check for pre-existing non-default values and put users on that condition on a "not part of experiment" group.
Attached patch Experiment (v2)Splinter Review
With all the comments addressed. I retested it in a fresh Nightly profile and it seems to be working as expected.
Attachment #8724184 - Flags: review?(felipc)
(In reply to :Felipe Gomes (needinfo me!) from comment #17)
> When setting the initial branch, you might also want first to check for
> pre-existing non-default values and put users on that condition on a "not
> part of experiment" group.

Oh, sorry. Didn't see this. Would I just create a new branch to indicate the "not part of experiment"?
Actually in practice I think the number of people who have manually set these prefs is going to be ~0 so I'm not sure if it's worth worrying about.
Comment on attachment 8724184 [details] [diff] [review]
Experiment (v2)

Review of attachment 8724184 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #19)
> (In reply to :Felipe Gomes (needinfo me!) from comment #17)
> > When setting the initial branch, you might also want first to check for
> > pre-existing non-default values and put users on that condition on a "not
> > part of experiment" group.
> 
> Oh, sorry. Didn't see this. Would I just create a new branch to indicate the
> "not part of experiment"?

Yeah. Just as a way to have them not skew your data. Up to you to figure out if that would be important or not
Attachment #8724184 - Flags: review?(felipc) → review+
I wrote up a test plan at https://public.etherpad-mozilla.org/p/bug_1251052_experiment - it's pretty basic since there's not much that the add-on really does. ni? to RyanVM for QA testing on it as per IRC discussion.
Flags: needinfo?(ryanvm)
After reading a recent dev-platform thread I realized I didn't get the xpi signed which I probably need for installation on default-configuration Nightly builds. Looks like I should be able to sign it with jpm, will do that tomorrow or Monday morning and push the updated xpi to the repo.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #23)
> I wrote up a test plan at
> https://public.etherpad-mozilla.org/p/bug_1251052_experiment - it's pretty
> basic since there's not much that the add-on really does. ni? to RyanVM for
> QA testing on it as per IRC discussion.
Flags: needinfo?(ryanvm) → needinfo?(rares.bologa)
Attachment #8723811 - Attachment is obsolete: true
Attached file experiment.xpi
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #24)
> After reading a recent dev-platform thread I realized I didn't get the xpi
> signed which I probably need for installation on default-configuration
> Nightly builds. Looks like I should be able to sign it with jpm, will do
> that tomorrow or Monday morning and push the updated xpi to the repo.

Looks like I can't use the automation signing key to sign experiment xpi's (I get a "You cannot submit this type of add-on" error). On IRC :felipe said I should ask :jason to get it signed - Jason, could you sign this XPI for a telemetry experiment? Thanks!
Flags: needinfo?(jthomas)
Flags: needinfo?(rares.bologa) → needinfo?(mfunches)
Preliminary Detailed Test Cases for Checkerboard Experiment
https://docs.google.com/spreadsheets/d/1NADaSY5pqFc8EBYk2hOAzoh4-OuAjQK-XCID7mnukLk/edit#gid=280492042
a) will await Signed XPI
b) will I need to adjust the "experiments.manifest.uri"
(In reply to Michelle Funches - QA from comment #27)
> Preliminary Detailed Test Cases for Checkerboard Experiment
> a) will await Signed XPI
> b) will I need to adjust the "experiments.manifest.uri"

The experiments.manifest.uri that you should use is "https://telemetry-experiment-dev.allizom.org/firefox-manifest.json"

I also suggest creating a new String pref named "experiments.force-sample-value" with a value of "0.1"
Attached file signed experiment.xpi
Please see signed experiment.xpi
Flags: needinfo?(jthomas)
Thanks!

I landed the signed experiment.xpi into the repo at https://hg.mozilla.org/webtools/telemetry-experiment-server/rev/db30cb1c7ff5 and verified it's updated at the telemetry-dev.allizom.org URL.
Testing complete; sign-off sent, wiki pages have been updated.
Flags: needinfo?(mfunches)
I sent an email to release-drivers requesting approval to ship the experiment.
In the next 7 days, the Nightly channel will transition from 47 -> 48 and I am told that this should have an impact on the effectiveness of this telemetry experiment. With that, it's a GO!
Thanks! I updated the release_tag in the repo, and will file a bug for pushing it to production.

https://hg.mozilla.org/webtools/telemetry-experiment-server/rev/b00d24d4799d
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
It's only been a few days of the experiment being live, but I ran an analysis on the telemetry data that's accumulated so far:

https://github.com/staktrace/telemetry-analysis/blob/b31ba0e8e21c2e3d11740da7abbeed255320167a/checkerboarding.ipynb

It looks like groups 4 and 6 have a significantly higher number of checkerboarding events. I need to slice and dice the data some more to see if we just ended up replacing a few large checkerboarding events with a lot of small ones, or something along those lines. Just eyeballing the graphs though it looks like groups 4 and 6 are worse across the board which is bad news :(
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #35)
> Just eyeballing the graphs
> though it looks like groups 4 and 6 are worse across the board which is bad
> news :(

Isn't it expected that reducing one or both of the vertical multiplers to 1.0 will cause more checkerboarding?
(In reply to Botond Ballo [:botond] from comment #36)
> Isn't it expected that reducing one or both of the vertical multiplers to
> 1.0 will cause more checkerboarding?

Not really - given that we can paint much faster with a smaller displayport it wasn't clear if this would be a net win or loss for checkerboarding - and even if you assume it would cause more checkerboarding, it wasn't clear how much it would increase checkerboarding by.
Also to be clear the reason I said it's "bad news" is because it makes much less likely that we can use "shrink the displayport" as a magic bullet for all of our talos woes.
I updated the ipynb with some more stuff. The new version is at https://github.com/staktrace/telemetry-analysis/blob/0c1af2aab4a268b21e296346d8568ed6da2eeeb8/checkerboarding.ipynb

Roberto, do you mind taking a quick look at this notebook and reviewing the bits where I'm pulling the data out of telemetry? Just as a sanity check to make sure I'm not doing anything stupid. Thanks!
Flags: needinfo?(rvitillo)
It looks generally good. To compute the sample sum you are using the lower bound of the buckets which yields an approximation; The raw histogram representation has a "sum" field which is accurate but to access it you would have to retreive that data without "get_pings_properties", which as you know returns only a pandas series for a histogram, e.g:

pings.map(lambda p: p["payload"]["histograms"].get("CHECKERBOARD_DURATION", {}).get("sum", 0)).sum()

That said, I wouldn't expect big suprises using the accurate sums.
Flags: needinfo?(rvitillo)
Updated the analysis now that there's the experiment has been running a while longer and there's more data. I also looked at some other relevant probes (composite/paint time probes) to see if there was a significant difference there. Not too much unexpected there.

As far as I can tell the control group is the best of the different experiment groups in terms of checkerboarding. group2 is also decent but the sample sum of severity is still (very slightly) worse.

https://github.com/staktrace/telemetry-analysis/blob/5552e29600ea6603eb6368f4a3fba03e5ea1a022/checkerboarding.ipynb
What's interesting is that in the experiment, almost all the groups were worse than the control group. So you'd expect that over the entire population, the net effect of the experiment would be to make things worse. However, looking at the telemetry evolution dashboard I see a marked improvement over the time that the experiment was running [1] (starting March 4th/5th, going for a week from there). I'm not sure what's going on here...

[1] http://mzl.la/1VdeVnF
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: