Open Bug 922874 Opened 11 years ago Updated 2 years ago

Analyze power usage impact of most-used add-ons

Categories

(Mozilla Metrics :: Frontend Reports, task)

task
Not set
normal

Tracking

(Not tracked)

Unreviewed

People

(Reporter: kmag, Unassigned)

References

Details

(Keywords: power)

Attachments

(1 file)

7.95 KB, application/vnd.ms-excel
Details
We've been running automated tests to collect data on the impact of add-ons on system power usage, as a proxy for performance impact. The initial data are promising, but we need help analyzing it for strong correlations, and in particular to mitigate for the impact of interference from other processes running on the system during tests.

This blog post gives an overview of the procedures: http://blog.5digits.org/2013/09/20/add-on-performance-testing-via-power-usage/ and this chart gives a broad representation of the latest results: https://people.mozilla.org/~kmaglione/power-tests/results/boxplot.svg

Since the blog post was written, we've done a few things to tighten the error bars:

• Killed any process or service which we logged as taking significant CPU time during the test run, if at all possible.
• Loaded each test URL on the initialization run so that the cache would be in a similar state for each run.
• Interleaved the tests so that each run tested every add-on in sequence before proceeding to the next run, rather than testing each add-on 5 times and then proceeding to the next add-on, to mitigate for time of day factors.
• Logged the activity of other processes during each run so they could be accounted for in post-analysis.
• Logged garbage collection activity.

The data for each run are grouped into a single JSON file with the following data:

    "power_usage": Power usage slices from Intel's Power Gadget, with timestamps in miliseconds, power usage for the current slice, and cumulative power usage.

    "process_cpu_samples": A per-process list of timestamps, in μs, for xperf samples when the process was running.

    "process_cpu_percentage": A per-process list of one-second slices where the process was active, and the percentage of samples during that slice during which the process was active.

    "pid": The process ID of the Firefox process, as it appears in the process CPU usage data above.

    "times": Timestamps, in ms, for events in the test process. URL loads include a start time when the load was triggered, and a stop time when the load event fired. GC slices include a start and a stop time indicating when the GC was active.

    "memory": Snapshot memory usage data for the start and end of the process. Should not be necessary for this analysis.

The raw data is available publicly here: https://people.mozilla.org/~kmaglione/power-tests/addon-power-data-20131001.tbz

I'd like to use this data to rank the add-ons in question by their impact on power consumption, accounting for other processes as much as possible, with an estimate of the magnitude of overall impact and a rough degree of certainty.

Let me know if anything above isn't clear. Thanks!
Group: metrics-private
Here are a few suggestions on comparing power consumption between addons (with more details below).

* use the Mann-Whitney test
* increase sample size to improve power and check basic assumptions
* model power consumption against other collected data if appropriate (need larger samples or domain knowledge to check appropriateness)

First I wanted to clarify how exactly you are comparing the addons. One way is to compare all of them against a common baseline addon, and the other is to order them by mean/median as in the boxplots and compare consecutive addons. Because many of the addons are so close, the second approach is not really feasible - I found that you'd need sample sizes in the 100s to detect some of the differences as significant. I took the first approach, selecting a baseline addon and comparing the others against it.

Expanding on suggestions above:

* With samples of size 5, I would recommend using the Mann-Whitney test (also known as the Wilcoxon test) to compare groups. This is a nonparametric test, comparing entire distributions rather than just means. It can be thought of as detecting shift differences, so if the test rejects the null, that would mean that the median and quantiles of one distribution tend to all be higher or lower than the other. The test gives valid (ie. believable) results for samples of any size and distribution.

The t test is the standard test for comparing groups by comparing their means. However, for small samples (eg. less than 30) it relies quite heavily on the assumption that the groups are both normally distributed. If this assumption is not met, the results are not reliable. We can't really check this assumption with 5 datapoints per group, but I think it's probably unlikely. For larger samples, the normal approximation kicks in regardless of the group distribution, so the t test is viable.

Regarding reducing variability, the ways this can be done are by (1) changing the design to control more factors, (2) collecting additional data and using it to explain away some of the variability in the main response observations (power consumption), and (3) increasing sample size. If you want to replicate "in the wild" conditions, you don't want to control external factors like network speed, in which case you can't really avoid this variability. You have already collected additional data, and this can be incorporated into a model along with the response. Aside from this, increasing sample size is always better, as it makes tests more powerful (more precise at detecting smaller differences).

* So, I would recommend increasing sample size as much as is feasible - perhaps 20 runs per addon? The other thing is that with only 5 observations per group, you can't really check any assumptions about the group distributions. You can't really tell whether they're normal or not, and unless you know that something went wrong in testing, it doesn't really make sense to treat any of them as outliers and discard them. Outliers are unusual observations, and if something occurred once out of 5 observations, you can't really say it's unusual.

* Finally, I would make use of the additional data as much as possible. I haven't looked at doing anything really sophisticated, but a simple thing I noticed is that the session lengths vary across sessions. I would expect longer sessions to consume more power just because of having more uptime. So, if you divide power consumption by session length, you get power consumption per unit time. Doing this does seem to improve the power of the tests. (However, this only makes sense if power consumption is *linearly* related to session length, which I am assuming, but which is difficult to check with such small samples.)
Thanks for looking into this.

(In reply to dzeber from comment #1)
> Regarding reducing variability, the ways this can be done are by (1)
> changing the design to control more factors, (2) collecting additional data
> and using it to explain away some of the variability in the main response
> observations (power consumption), and (3) increasing sample size. If you
> want to replicate "in the wild" conditions, you don't want to control
> external factors like network speed, in which case you can't really avoid
> this variability. You have already collected additional data, and this can
> be incorporated into a model along with the response. Aside from this,
> increasing sample size is always better, as it makes tests more powerful
> (more precise at detecting smaller differences).

I agree that we want to replicate real-world conditions as much as
possible. I think it would still be useful to try to use sites with
less variability in performance and content. Someone in my office
suggested using a transparent caching proxy located at a different
datacenter, which I'm considering testing.

I also talked to someone who works on power measurement at Intel,
and it seems that they're working on a tool that measures power
consumption and correlates it with which processes are currently
active, which should be able to cut down some noise.

> * So, I would recommend increasing sample size as much as is feasible -
> perhaps 20 runs per addon? The other thing is that with only 5 observations
> per group, you can't really check any assumptions about the group
> distributions. You can't really tell whether they're normal or not, and
> unless you know that something went wrong in testing, it doesn't really make
> sense to treat any of them as outliers and discard them. Outliers are
> unusual observations, and if something occurred once out of 5 observations,
> you can't really say it's unusual.

I think I can manage 20 runs per add-on.

> * Finally, I would make use of the additional data as much as possible. I
> haven't looked at doing anything really sophisticated, but a simple thing I
> noticed is that the session lengths vary across sessions. I would expect
> longer sessions to consume more power just because of having more uptime.
> So, if you divide power consumption by session length, you get power
> consumption per unit time. Doing this does seem to improve the power of the
> tests. (However, this only makes sense if power consumption is *linearly*
> related to session length, which I am assuming, but which is difficult to
> check with such small samples.)

Longer sessions do correlate with higher overall power usage.
However, I don't want to compare power usage as a function of
session length, because longer sessions may be the result of the
performance impact of an add-on.
> I also talked to someone who works on power measurement at Intel,
> and it seems that they're working on a tool that measures power
> consumption and correlates it with which processes are currently
> active, which should be able to cut down some noise.

It sounds like this will be useful. 

> I think I can manage 20 runs per add-on.

Aside from increasing the power of the tests, this will give you a better idea of how spread out the distributions are, and what constitutes an outlier. 

> Longer sessions do correlate with higher overall power usage.
> However, I don't want to compare power usage as a function of
> session length, because longer sessions may be the result of the
> performance impact of an add-on.

Using some of the other information in the dataset might also turn out to be helpful, for example the events in the "times" subtree. Eg. maybe unusually long sessions (consuming extra power) can be explained by the occurrence of some extra GC work. 

Let me know if you'd like to work together on any further analysis for this.
(In reply to David Rajchenbach Teller [:Yoric] <needinfo? me> from comment #4)
> Could that tool be
> http://software.intel.com/en-us/articles/intel-power-gadget-20 ?

No, that's the tool that we're already using. The tool that we talked about uses the same API, but correlates the power usage data with process activation.

(In reply to dzeber from comment #3)
> Using some of the other information in the dataset might also turn out to be
> helpful, for example the events in the "times" subtree. Eg. maybe unusually
> long sessions (consuming extra power) can be explained by the occurrence of
> some extra GC work. 
> 
> Let me know if you'd like to work together on any further analysis for this.

Sounds good. I'm running a set of tests with 20 runs per add-on now.
I'll get back to you when it's done.
Attached file add-on-results.csv
Adding the data set from the most recent collection.
I think with the variance from run-to-run, it's worth doing an investigation where not only do we correlate against CPU and GPU utilization, but we also grab some profiling data at the same time to see what is going on in the outlier cases.
(In reply to Joe Olivas from comment #7)
> I think with the variance from run-to-run, it's worth doing an investigation
> where not only do we correlate against CPU and GPU utilization, but we also
> grab some profiling data at the same time to see what is going on in the
> outlier cases.

http://vps.glek.net/power/add-on-results.html graph to show variance in above csv. The new tool makes better graphs, but there is still excessive variability
Blocks: power
Type: defect → task
Keywords: power
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: