Experimenting around with "mochiperf" tests

RESOLVED FIXED

Status

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: jimm, Assigned: jimm)

Tracking

Trunk
x86_64
Windows 8.1
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 6 obsolete attachments)

Posted patch mochiperf tests wip (obsolete) — Splinter Review
This is an experiment revolving around using mochitest to micro-benchmark the metro browser. The wip here is the mochitest side with a few test, plus the beginnings of a PerfTest object to facilitate measuring performance and reporting test data out to mochitest logs. 

The other side of this that I haven't put together yet is the graph server, which will likely be a simple python script that will monitor checkins on inbound and parse out mochitest-metro-chrome logs for test results. We don't have metro mochitest automated yet so I'll lok at this part after we do. 

The output from the perf tests currently looks like this:

TEST-INFO | chrome://mochitests/content/metro/browser/metro/base/tests/perf/browser_tabs_01.js | PERF-TEST | DECLARE {"id":"FBD7A532-D63A-44B5-9744-5CB07CFD131A","name":"Tab Open","category":"Jim's Tests","subcategory":"UX","description":"Open twenty tabs in succession, closing each before the next is opened. Gives the browser time to settle in between. Lets the ui react however it is designed to. Strips outliers."}

TEST-INFO | chrome://mochitests/content/metro/browser/metro/base/tests/perf/browser_tabs_01.js | PERF-TEST | RESULTS {"r1":201.31578947368422}

The idea here is that all the data about the test displayed on the graph server is contained in the logs. Tests are identified by a unique id. Everything else is updated every time a log is parsed, so it's easy to change test names, categories, etc.. The web interface will render using a current snapshot of static json data dumped by the log parser.

For maintenance, I figure the graph server will keep the historical data around (and append new data points to it) as long as the test is reporting. If any particular test stops reporting for a period of time (~1 month say) all the data is deleted and the test is dropped.
Posted patch mochiperf tests wip (obsolete) — Splinter Review
Attachment #734411 - Attachment is obsolete: true
some changes to our existing harness which I think will make these tests more successful:
* remove as many addons as possible (including specialpowers)
* fresh start of the browser for each test

I looked at doing this with mochitest plain+chrome, but allowing browser-chrome style tests would give us the best flexibility for writing test cases.  The danger in hacking the harness to remove unecessary profile bits and to start the browser fresh for each test is that we end up with a confusing harness.  I don't mind doing that up front to get tests going.

Another option we have is to upload data from the test/testharness directly to datazilla instead of scraping.  Adding into datazilla doesn't require a lot of work.  To me that makes the most sense, but I would stand behind other decisions if they make as much or more sense.
(In reply to Joel Maher (:jmaher) from comment #2)
> some changes to our existing harness which I think will make these tests
> more successful:
> * remove as many addons as possible (including specialpowers)
> * fresh start of the browser for each test
> 
> I looked at doing this with mochitest plain+chrome, but allowing
> browser-chrome style tests would give us the best flexibility for writing
> test cases.  The danger in hacking the harness to remove unecessary profile
> bits and to start the browser fresh for each test is that we end up with a
> confusing harness.  I don't mind doing that up front to get tests going.

Sure, lots we can do here. Maybe a mochitest variant would run these types of tests. Being able to control the profile that gets loaded might be really interesting too. Not worth worrying about while experimenting though.

> Another option we have is to upload data from the test/testharness directly
> to datazilla instead of scraping.  Adding into datazilla doesn't require a
> lot of work.  To me that makes the most sense, but I would stand behind
> other decisions if they make as much or more sense.

I'm not at all familiar with datazilla. How does it work? Is it open ended enough such that checking in new tests / removing tests / serving up graphs could be 'automatic'? To make something like 'mochipref' really useful I think you want to avoid the *heavy* releng overhead we have in talos.

Again though, this is still just an experiment, so I'm not looking to hook up a production system today. :) What I am interested in finding is a way to run a background log parser / web server someplace. I was wondering if people.moz supported that. Any ideas on where I might be able to hook something like that up?
Datazilla allows us to post arbitrary data without predefined test names, machine names, etc.  We would need to create a project.  The good thing is this data can easily be uploaded to from the test machines  and accessible via an api.  We can then write any front end to manage the data (i.e. your own stuff, chart.io, excel, custom datazilla pages, etc...).  My goal is to make this as open an hackable to developers instead of the tightly closed system we have for Talos.  So any decisions we would make here, that is how I will approach it.

I don't know about running via people.m.o, but there might be a way to do it.  If it was all javascript I could see that working.
(In reply to Joel Maher (:jmaher) from comment #4)
> Datazilla allows us to post arbitrary data without predefined test names,
> machine names, etc.  We would need to create a project.  The good thing is
> this data can easily be uploaded to from the test machines  and accessible
> via an api.  We can then write any front end to manage the data (i.e. your
> own stuff, chart.io, excel, custom datazilla pages, etc...).  My goal is to
> make this as open an hackable to developers instead of the tightly closed
> system we have for Talos.  So any decisions we would make here, that is how
> I will approach it.

Hmm, this would solve the problem of the server middle man. Is this open and available now such that I could experiment around with it, or is it still under development? If it's ready for experimenting, how should I go about creating a project and sending some test data to it?
I found this, curious if it's up to date - 

http://datazilla.readthedocs.org/en/latest/
Posted patch latest tests wip (obsolete) — Splinter Review
Attachment #734650 - Attachment is obsolete: true
Posted patch latest tests wip (obsolete) — Splinter Review
I was planning on emails you about this to figure out next steps with datazilla.

FWIW, what I have thus far is:

1) this set of tests that generate about 5 different metrics in the metro browser
2) a basic log parser that gets the data and writes it out into a nice json format suitable for reading into a web page.
3) a graphs page that turns the json into flot graphs

What I'm unsure of going forward is how datazilla best fits into this picture.

I think maybe it initially fits in well as a replacement for the json data file, with the python log parser writing to datazilla and the web page reading from datazilla. But maybe taking it one step further and writing data directly to datazilla from tests is possible without a lot of work.

I can post the python log parser and graphs page if you're curious.
Attachment #735950 - Attachment is obsolete: true
Posted patch latest tests wip (obsolete) — Splinter Review
Attachment #736900 - Attachment is obsolete: true
Duplicate of this bug: 847341
Blocks: metro-omtc
Blocks: 866852
No longer blocks: metro-omtc
Posted patch tests v.1 (obsolete) — Splinter Review
Assignee: nobody → jmathies
Attachment #740820 - Attachment is obsolete: true
Now that we have mochitests going, I'd like to land these tests to collect some data for a few weeks to see how reliable the data sets are.
Attachment #760853 - Attachment is obsolete: true
Attachment #760856 - Flags: review?(mbrubeck)
Posted patch test set v.1Splinter Review
This is a set of 12 starter perf tests that measure the following:

- browser deck shift times
- new tab open performance
- basic canvas rendering
- composition perf (for omtc)
- scroll perf / jank
- basic video playback
- message manager perf

ping me on irc if you have any questions.
Attachment #760857 - Flags: review?(mbrubeck)
This is relevant to bug 867742 (for unifying test harnesses) and for Chris Manchester's summer project, which is to add machine readable output from Automation.
Comment on attachment 760857 [details] [diff] [review]
test set v.1

Review of attachment 760857 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/metro/base/tests/mochiperf/Makefile.in
@@ +33,5 @@
> +libs:: $(PERF_TESTS)
> +	$(INSTALL) $(foreach f,$^,"$f") $(DEPTH)/_tests/testing/mochitest/metro/$(relativesrcdir)
> +
> +libs:: $(BROWSER_PERF_RESOURCES)
> +	$(INSTALL) $(foreach f,$^,"$f") $(DEPTH)/_tests/testing/mochitest/metro/$(relativesrcdir)/res

Use INSTALL_TARGETS (don't write your own one-off rules).

If this is a common pattern for Metro, we should add a moz.build variable for defining Metro mochitests.
(In reply to Gregory Szorc [:gps] from comment #17)
> 
> Use INSTALL_TARGETS (don't write your own one-off rules).

Always freaking forget that exists. Sorry, will them up.
Comment on attachment 760856 [details] [diff] [review]
helper functions .v1

Review of attachment 760856 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/metro/base/tests/mochiperf/perfhelpers.js
@@ +27,5 @@
> +   * Declare a test which the graph server will pick up and track.
> +   * Must be called by every test on startup. Graph server will
> +   * search for result data between this declaration and the next.
> +   *
> +   * @param UID string for this particular test, case sensitive.

naming nit: should this be more specific and say "UUID"?

@@ +135,5 @@
> +
> +  computeMedian: function computeMedian(aArray, aOptions) {
> +    aArray.sort(function (a, b) {
> +      return a - b;
> +    });

Isn't this equivalent to "aArray.sort();"?

@@ +228,5 @@
> +   * Returns the total time ellapsed in milliseconds. If the stopwatch
> +   * has not been stopped, stops it. Returns zero if no time has been
> +   * accumulated.
> +   */
> +  time: function time() {

"If the stopwatch has not been stopped, stops it" -- this appears to be false.
Attachment #760856 - Flags: review?(mbrubeck) → review+
Comment on attachment 760857 [details] [diff] [review]
test set v.1

Review of attachment 760857 [details] [diff] [review]:
-----------------------------------------------------------------

These look good to me.  I'd love to have a second opinion from Sam on these patches just to get a few more eyes on the code.

I agree with gps that we should add a special build variable for mochitest-metro-chrome files, similar to the ones for mochitest-browser-chrome et al.

::: browser/metro/base/tests/mochiperf/browser_miscgfx_01.js
@@ +139,5 @@
> +                         "heavy graphics demo using canvas.");
> +    let stopwatch = new StopWatch(true);
> +    let event = yield waitForEvent(win, "test", 20000);
> +    let msec = stopwatch.stop();
> +    PerfTest.declareNumericalResult((event.detail.frames / msec) * 1000.0, "fps");

There are a bunch of "fps" benchmarks in this patch; should they use PerfTest.declareFrameRateResult?
Attachment #760857 - Flags: review?(sfoster)
Attachment #760857 - Flags: review?(mbrubeck)
Attachment #760857 - Flags: review+
ping - sfoster for review. bbondy would like to turn omtc on by default so before he does that I'd like to land these to collect some prelim data.
Comment on attachment 760857 [details] [diff] [review]
test set v.1

I dont feel like I (yet) have all the context I need to give a useful review here, so please accept some drive-by comments. I understand that this gets us an easy way to measure and track some metro-specific performance metrics over time with (vs. talos which know little about tbh). To that end, minus the grumbling below, this seems direct and pragmatic and we should start hooking something up asap to see how it plays out. Provided we evaluate each perf test to assure ourselves that it will provide meaningful insights (not just data for its own sake), I tend to think some data is better than none and its easier to iterate on a thing that's working. 

That said, I find the nesting of test syntaxes odd. We have test() and gTest.push({... }) doing one thing, with the ok/is/etc. assertions we know and love, then the PerfTest.declareTest apparently declaring a test inside a test. Its just a naming oddness, but maybe one we can tuck away in our helpers so we can at least interact with a more intuitive API, and start establishing cow-paths. 

Are there perf assertions we want to make, or is gathering this data just a side-effect for later perusal? There are some basic, default thresholds we could plug in that would fail a test if it took too long. E.g. feedback on a button click (toggling :active) should take < 100ms. Similarly with immediate consequence of any user action - like focusing a field, dragging to highlight text and so on. More than .1 second (roughly) tends to break the illusion of direct manipulation; we should indicate something, even if we can't always complete the outcome of the action in so short a time.

Those thresholds would expect to change over time as our and our users' expectations are raised and as we go from being acceptable to slick/snappy/shit-of-shovel fast.

For perf data, we may want to log std deviation, to know if the average is meaningful at all? 

Out of the corner of my eye, I'm watching bug 867742, which is a proposal for a new testsuite module. Its not relevant here while we are experimenting, just something to keep in mind as our own automating testing needs take shape.
Attachment #760857 - Flags: review?(sfoster) → feedback+
(In reply to Sam Foster [:sfoster] from comment #22)
> Comment on attachment 760857 [details] [diff] [review]
> test set v.1
> 
> I dont feel like I (yet) have all the context I need to give a useful review
> here, so please accept some drive-by comments. I understand that this gets
> us an easy way to measure and track some metro-specific performance metrics
> over time with (vs. talos which know little about tbh). To that end, minus
> the grumbling below, this seems direct and pragmatic and we should start
> hooking something up asap to see how it plays out. Provided we evaluate each
> perf test to assure ourselves that it will provide meaningful insights (not
> just data for its own sake), I tend to think some data is better than none
> and its easier to iterate on a thing that's working. 
> 
> That said, I find the nesting of test syntaxes odd. We have test() and
> gTest.push({... }) doing one thing, with the ok/is/etc. assertions we know
> and love, then the PerfTest.declareTest apparently declaring a test inside a
> test. Its just a naming oddness, but maybe one we can tuck away in our
> helpers so we can at least interact with a more intuitive API, and start
> establishing cow-paths. 

Yes it does seem somewhat hackish. I can ingrain these tests into our testing logic in head.js more. We have gTests that have methods on them that do specific things, we could have a gBenchmarks that have a similar execution model. Test info/output would be available via methods on the test object and spit out by the logic that controls test runs. Results could be returned in a tear down type method. head.js could also stand to be broken up into a shared library of test helpers and two shims that execute two different types of tests. 

Am I headed in the right direction here? 

I'd still like to get these landed as is and work on something like this in a follow up when I have time over the next week or so. 

> Are there perf assertions we want to make, or is gathering this data just a
> side-effect for later perusal? There are some basic, default thresholds we
> could plug in that would fail a test if it took too long. E.g. feedback on a
> button click (toggling :active) should take < 100ms. Similarly with
> immediate consequence of any user action - like focusing a field, dragging
> to highlight text and so on. More than .1 second (roughly) tends to break
> the illusion of direct manipulation; we should indicate something, even if
> we can't always complete the outcome of the action in so short a time.

Yep, if we find something like that useful. Although we probably wouldn't tie test failures to something as specific as a button state change, I think we'd end up with more random orange than we could handle. But I can see us setting a min/max on the composition test in here, so that we could pick up serious regressions in gfx. 

> For perf data, we may want to log std deviation, to know if the average is
> meaningful at all? 

Don't see why not. This could be spit out with the other data and graphed along side other data points.

> Out of the corner of my eye, I'm watching bug 867742, which is a proposal
> for a new testsuite module. Its not relevant here while we are
> experimenting, just something to keep in mind as our own automating testing
> needs take shape.

If and when there's a push to better standardize mozilla in-tree testing we'll be a part of it.
(summarizing #windev discussion)
Yes something like a gBenchmarks sounds like the right direction to me.
The distinction is testing and making assertions about the results, vs. running benchmarks and gathering data over time. So they share some requirements from the harness but have very different goals otherwise and the code should reflect that.

In either case, lets land this and follow up with issues/enhancements.
Blocks: 885503
Depends on: 885536
experimental graphs page: http://www.mathies.com/mozilla/mochiperf.html

note this only loads in fx (nightly?) currently.
OS: Windows 8 Metro → Windows 8.1
You need to log in before you can comment on or make changes to this bug.