Implement a harness for regression testing responsiveness of individual actions

RESOLVED FIXED

Status

defect
RESOLVED FIXED
8 years ago
2 years ago

People

(Reporter: ted, Assigned: ahal)

Tracking

(Depends on 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

For the ongoing e10s work, improved responsiveness is one of the key goals. bsmedberg said that one thing they'd like to have is regression tests for particular things in web content. The goal here would be to allow developers to write test cases for things that web content can do and ensure that they do not cause the chrome process to become unresponsive.
Target Milestone: --- → mozilla8
Version: unspecified → Trunk
So what constitutes a "regression"?

For example, lets say we run a test and it reports that while opening a page, the threshold was breached twice with values of 60ms and 120ms.  We then run the test a second time and it reports two breaches again with reversed values of 120ms and 60ms.

Is this a regression?  How do we know that these are even the same two events that are exceeding the threshold?  Do we care?  One simple solution would be to just take the sum of all breaches as the value to use to check for regressions. 

I can also imagine these tests producing fairly varied results from one run to the next which could lead to a lot of random oranges if we aren't careful.
Depends on: 631571
If zero runs before the check-in were over the breach, looks like a regression either way. Maybe I'm misunderstanding the question?
Target Milestone: mozilla8 → ---
I was thinking more about:

> If we haven't yet fixed the problem, we measure the latency and make sure it doesn't get
> worse.

from https://bugzilla.mozilla.org/show_bug.cgi?id=631571#c20

But it turns out that alice, jlebar, ted +others have been working on a way to quantify these results into a metric.
Target Milestone: --- → mozilla8
Target Milestone: mozilla8 → ---
I've started some basic work on this bug, but the requirements are still a bit murky. The big question I have is: Do these tests need to run in chrome scope?  Or is running in content with a SpecialPowers API good enough?

If the answer is the former, then I think that this harness should be implemented as a part of Mozmill. Mozmill already allows us to automate any part of the UI including dialogs.  If we went this route I'd also be making Mozmill e10s ready.

If the answer is the latter, maybe writing a lighter-weight harness is the better way to go. Such a harness would be simpler and faster than Mozmill.
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
As may be clear from the description, the requirements for this *are* murky and require thought and design.

Note that the metric that jlebar and alice are working on is very general and probably won't be directly useful for this. In particular, there are going to be responsiveness problems which aren't related to particular test actions, specifically GC/CC pauses. In these cases, we don't want to fail a particular responsiveness test because a general problem happened to occur during the running of that test.

Some ways around this are:
* disable bad actions entirely while running the responsiveness tests
* log bad actions and discard responsiveness issues that they cause

There are perhaps other ways around this.

I can think of at least the following testcases to start out with:

* open a page containing a test plugin for the first time (which is a proxy measurement for startup time of the plugin process and plugin)
* open a new tab to about:blank
* open a new window to about:blank
* do a session restore with a small or large set of tabs
* open the bookmarks menu for the first time (before history is loaded?)

In terms of mozmill, I don't particularly care about whether you adapt an existing test harness, but the test harness needs to avoid introducing its own noise into the results, and the tests will need to run on a perf-stable set of machines, and we should be reporting the results on TBPL: my impression of the external mozmill harness is that it has been very hard to achieve these goals, which is why many developers are anti-mozmill in general.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #5)
> I can think of at least the following testcases to start out with:
> 
> * open a page containing a test plugin for the first time (which is a proxy
> measurement for startup time of the plugin process and plugin)
> * open a new tab to about:blank
> * open a new window to about:blank
> * do a session restore with a small or large set of tabs
> * open the bookmarks menu for the first time (before history is loaded?)

So I'm thinking it's safe to bet that we will want the tests running in chrome scope.
 
> In terms of mozmill, I don't particularly care about whether you adapt an
> existing test harness, but the test harness needs to avoid introducing its
> own noise into the results, and the tests will need to run on a perf-stable
> set of machines, and we should be reporting the results on TBPL: my
> impression of the external mozmill harness is that it has been very hard to
> achieve these goals, which is why many developers are anti-mozmill in
> general.

This is true. I had a brief discussion with Clint this morning about this. One of our goals for some time has been to isolate the driver aspect of mozmill from the test harness / reporting infrastructure, which would basically need to happen if I did go down this route. The main reason I consider using mozmill is that I worry devs will find a need for more and more features until I basically find myself re-building mozmill from scratch. I have a feeling that isolating and using mozmill's driver might not be as bad as it looks, I'll investigate a bit.
I've documented some of my work over here: https://wiki.mozilla.org/Auto-tools/Projects/peptest
Code lives here: https://github.com/ahal/peptest

As it currently stands, you can write simple JS tests which can optionally use Mozmill's driver. Tests can call a 'performAction' method and pass a function pointer to it. Only EventTracer logs that are generated while the test is inside a 'performAction' call are evaluated (to get rid of noise generated during setup/teardown etc)

Currently I'm starting to work with releng to get this harness into buildbot staging.

As for discarding bad actions, I might be able to use some of the work that Dietrich has done (http://etherpad.mozilla.com:9000/responsiveness-profiling) though I still haven't really looked into this very much.

One problem that will need to be addressed (though isn't my priority right now) is that it is pretty common for an action to generate at least one responsiveness measurement over 50ms. This means that as it stands, pretty much every test that does something meaningful will fail. We could try doing something like uploading the test results to a server and then only fail a test if it regressed according to those results. Or we could simply adjust the threshold/interval to a 'reasonable' level.
See Also: → 692091
Do we know *why* every test would do something over 50ms? Assuming that we're not talking about GC/CC pauses which we should be blacklisting, we should probably understand what's taking so long and either customize the threshold per-test or fix the unexpected pause.
When I say *every test* I actually mean, every test that I've written for the purpose of testing the harness. There are a few reasons that most of them fail for me:

1) The tests I've written tend to be fairly long (writing shorter tests results in fewer actions and more passes)

2) Certain actions are less responsive. For example, opening a page always seems to generate a few events over 50ms whereas, other things (like searching on google) never seem to have a problem.

3) I haven't looked into disabling GC/CC pauses yet so it is possible that some of these messages should be discarded. I'm not quite sure yet how I would tell whether an EventTracer event comes from GC or not.

Another possible route we could take is to use the metric jlebar et al have been working on to calculate a single number from all the responsiveness times that occur during an action. Rather than failing a test if there is a single event over 50ms, we could fail the test if its responsiveness 'score' is above a certain number.

Regardless of whether we use a metric or stick to a simple threshold, per test thresholds are definitely something I want to implement.
Depends on: 705342
Peptest has landed in m-c along with a make target (just run 'make peptest').
make target: https://hg.mozilla.org/mozilla-central/rev/65c05ff60e47

Resolving this fixed. Further bugs should be filed in the testing/peptest component.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: New Frameworks → General
You need to log in before you can comment on or make changes to this bug.