Closed Bug 631571 Opened 13 years ago Closed 13 years ago

Measure browser responsiveness in Talos

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: jmaher)

References

Details

Attachments

(3 files, 3 obsolete files)

We'd like to measure browser responsiveness so we can find spots where we're currently not very responsive, and ensure that we improve the metric. In addition, we want to make sure that large architectural changes like e10s don't harm responsiveness of common tasks.

The scope of this work will be to get some measure of responsiveness out of the browser (an estimate of how long we're taking to service events), and have Talos exercise various functionality while sampling that measurement, and then have this all running on Tinderbox and reporting something useful.

We discussed this work with Blizzard a few months ago, I think Sheila took some notes. I'll have to dig through my email to see if I can find them.
Depends on: 606574
The patches in bug 606574 are currently written to fire a tracing event every 50ms (so 20/s), and report how long that event took to get processed. Right now the output just looks like:
MOZ_EVENT_INTERVAL <duration>

where duration is in integer milliseconds. We may need to tweak that to be more useful. The output goes to stdout by default, but can be redirected to a file by setting MOZ_INSTRUMENT_EVENT_LOOP_OUTPUT to a filename.
I should also have this in my email, I'll go hunting.
From the summary of the last meeting here's the top 7 items requested:

1. Startup 
a. Time between startup and visible window
b. Time between visible window appearing and the browser is completely responsive (type in a url) 
2. Switching tabs
3. Opening a new window
4. Quit the browser
5. Awesome bar search results (how fast it returns).
6. Right click menu.
7. Responsiveness in foreground tab during page load.
Good list, but mobile needs at least two more:

1) What about scrolling time?
2) What about zoom-in zoom-out time?

I think everyone would benefit if we started measure these.
The instrumentation I'm adding in bug 606574 won't let us measure "how long does it take to do X?", I figured the existing Talos harness was good enough for that (Tpan, etc). What it will let us measure is "how responsive is the browser while doing X?", which we currently have no data on.

Specifically, it will tell you how long events take to be serviced at 20Hz samples, so we can perform an action (load a page), and determine roughly how responsive we are until the action finishes.
The UX team is helping to come up with a list of UI events that we should prioritize the measurement and optimization of. I'll post it here once it's been gathered.

Re comment #5: We should do some manual testing of common events (new tab, scroll on large page) to verify that attempting to correlate responsiveness via the event loop logging is useful for these higher-level user scenarios.
I've run through a few non-scientific tests while developing the patch, and it's definitely producing measurable data. You can watch the event servicing time spike when you load a new page or do other intensive tasks. I've got somewhat-outdated try server builds here that you can try out if you'd like to see what it produces:
http://people.mozilla.com/~tmielczarek/event-loop-instrument/
That's fantastic, thanks Ted!

The next bit would be to find a way to make the correlation, either in the harness, or maybe by doing post-processing of the event loop logs and a log we spit out from the high-level tests, possibly via timestamp?
Yeah, I hadn't quite worked that out yet. I was worried about getting timestamps that actually line up, given how finicky system clocks can be. I'm using mozilla::TimeStamp internally for interval measurement, but it doesn't actually let you get a wallclock time, only intervals between points in time.
As a note, we're punting on Talos measurement for this quarter. We're going to figure out how to measure something useful on developer machines first, then we'll figure out what we want to automate.
Assignee: ted.mielczarek → anodelman
From discussions in email/irc the initial tests we are going measure responsiveness during are:

- window open
- tab open
- tp5
As a proof of concept flipped on ted's responsiveness measurement while running a single cycle of tdhtml (a test that cycles through a few dhtml heavy web pages).

Here's the output from the instrumentation:

MOZ_EVENT_TRACE start 1311194309792
MOZ_EVENT_TRACE sample 1311194310055 109
MOZ_EVENT_TRACE sample 1311194310158 51
MOZ_EVENT_TRACE sample 1311194311994 152
MOZ_EVENT_TRACE sample 1311194315275 67
MOZ_EVENT_TRACE sample 1311194320416 244
MOZ_EVENT_TRACE sample 1311194322645 87
MOZ_EVENT_TRACE stop 1311194326678

We can see that we exceeded the responsiveness measure of the instrumentation several times - but I don't know how to correlate that to exactly what the browser was doing at the time.  Ted - do you have any advice here?
If we're just cycling through pages, maybe we can get load start times for each page? Then developers would be able to see which pages were the cause of the unresponsive periods.
I was going to ask the same question w.r.t. tp5 as mentioned in Comment 11.  How are you going to measure responsiveness during the Tp5 test?  Is that a measure of time before we load the page? i.e. from "starting page n load" to "page n load finished"? Is that a measure of time from the event of "page n loaded" to "page n+1 loaded"?
The numbers in comment 12 are simply samples from when the event loop took longer than 50ms to respond to an event. You can correlate them to whatever you want providing you have timestamps, the third field is a unix timestamp in milliseconds.

But most of our Talos tests don't try to drill down into too much detail, do they? They simply give you a number.
Here's the mixed together results for tdhtml for comparison:

NOISE: MOZ_EVENT_TRACE start 1311278181023
NOISE: MOZ_EVENT_TRACE sample 1311278181283 106
NOISE: MOZ_EVENT_TRACE sample 1311278181453 67
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/colorfade.html (next: http://localhost/page_load_test/dhtml/diagball.html)
NOISE: MOZ_EVENT_TRACE sample 1311278183256 219
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/diagball.html (next: http://localhost/page_load_test/dhtml/fadespacing.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/fadespacing.html (next: http://localhost/page_load_test/dhtml/imageslide.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/imageslide.html (next: http://localhost/page_load_test/dhtml/layers1.html)
NOISE: MOZ_EVENT_TRACE sample 1311278186559 90
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers1.html (next: http://localhost/page_load_test/dhtml/layers2.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers2.html (next: http://localhost/page_load_test/dhtml/layers4.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers4.html (next: http://localhost/page_load_test/dhtml/layers5.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers5.html (next: http://localhost/page_load_test/dhtml/layers6.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers6.html (next: http://localhost/page_load_test/dhtml/meter.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/meter.html (next: http://localhost/page_load_test/dhtml/movingtext.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/movingtext.html (next: http://localhost/page_load_test/dhtml/mozilla.html)
NOISE: MOZ_EVENT_TRACE sample 1311278191699 240
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/mozilla.html (next: http://localhost/page_load_test/dhtml/replaceimages.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/replaceimages.html (next: http://localhost/page_load_test/dhtml/scrolling.html)
NOISE: MOZ_EVENT_TRACE sample 1311278193913 71
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/scrolling.html (next: http://localhost/page_load_test/dhtml/slidein.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/slidein.html (next: http://localhost/page_load_test/dhtml/slidingballs.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/slidingballs.html (next: http://localhost/page_load_test/dhtml/textslide.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/textslide.html (next: http://localhost/page_load_test/dhtml/zoom.html)
NOISE: Cycle 1: loaded http://localhost/page_load_test/dhtml/zoom.html (next: http://localhost/page_load_test/dhtml/colorfade.html)


I think that if I added a timestamp to the pageload information we might be able to do a proper correlation.
Still I'm not sure what metric we want to be monitoring - should we do something like "amount of unresponsive events during test"?  Then you'd still need a human to look over the results to see where things got worse.
And again with more timestamps:

NOISE: MOZ_EVENT_TRACE start 1311279480976
NOISE: MOZ_EVENT_TRACE sample 1311279481244 114
NOISE: 1311279482448: Cycle 1: loaded http://localhost/page_load_test/dhtml/colorfade.html (next: http://localhost/page_load_test/dhtml/diagball.html)
NOISE: MOZ_EVENT_TRACE sample 1311279483176 147
NOISE: 1311279483773: Cycle 1: loaded http://localhost/page_load_test/dhtml/diagball.html (next: http://localhost/page_load_test/dhtml/fadespacing.html)
NOISE: 1311279485150: Cycle 1: loaded http://localhost/page_load_test/dhtml/fadespacing.html (next: http://localhost/page_load_test/dhtml/imageslide.html)
NOISE: 1311279485852: Cycle 1: loaded http://localhost/page_load_test/dhtml/imageslide.html (next: http://localhost/page_load_test/dhtml/layers1.html)
NOISE: MOZ_EVENT_TRACE sample 1311279486457 68
NOISE: 1311279486567: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers1.html (next: http://localhost/page_load_test/dhtml/layers2.html)
NOISE: 1311279487112: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers2.html (next: http://localhost/page_load_test/dhtml/layers4.html)
NOISE: 1311279487657: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers4.html (next: http://localhost/page_load_test/dhtml/layers5.html)
NOISE: 1311279488342: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers5.html (next: http://localhost/page_load_test/dhtml/layers6.html)
NOISE: 1311279488885: Cycle 1: loaded http://localhost/page_load_test/dhtml/layers6.html (next: http://localhost/page_load_test/dhtml/meter.html)
NOISE: 1311279489824: Cycle 1: loaded http://localhost/page_load_test/dhtml/meter.html (next: http://localhost/page_load_test/dhtml/movingtext.html)
NOISE: 1311279490634: Cycle 1: loaded http://localhost/page_load_test/dhtml/movingtext.html (next: http://localhost/page_load_test/dhtml/mozilla.html)
NOISE: MOZ_EVENT_TRACE sample 1311279491566 212
NOISE: 1311279492340: Cycle 1: loaded http://localhost/page_load_test/dhtml/mozilla.html (next: http://localhost/page_load_test/dhtml/replaceimages.html)
NOISE: 1311279493135: Cycle 1: loaded http://localhost/page_load_test/dhtml/replaceimages.html (next: http://localhost/page_load_test/dhtml/scrolling.html)
NOISE: MOZ_EVENT_TRACE sample 1311279493730 64
NOISE: 1311279493743: Cycle 1: loaded http://localhost/page_load_test/dhtml/scrolling.html (next: http://localhost/page_load_test/dhtml/slidein.html)
NOISE: 1311279495274: Cycle 1: loaded http://localhost/page_load_test/dhtml/slidein.html (next: http://localhost/page_load_test/dhtml/slidingballs.html)
NOISE: 1311279495934: Cycle 1: loaded http://localhost/page_load_test/dhtml/slidingballs.html (next: http://localhost/page_load_test/dhtml/textslide.html)
NOISE: 1311279496723: Cycle 1: loaded http://localhost/page_load_test/dhtml/textslide.html (next: http://localhost/page_load_test/dhtml/zoom.html)
NOISE: 1311279497428: Cycle 1: loaded http://localhost/page_load_test/dhtml/zoom.html (next: http://localhost/page_load_test/dhtml/colorfade.html)
(In reply to comment #17)
> Still I'm not sure what metric we want to be monitoring - should we do
> something like "amount of unresponsive events during test"?  Then you'd
> still need a human to look over the results to see where things got worse.

This is the crux of the issue. bsmedberg at one point suggested reporting something like the max/median/mean values, but the latter two probably aren't quite right since we're reporting only samples >50ms. I could change the code to report all samples again, but it's a lot (20 samples per second at most).
I don't think we should treat responsiveness the same way we treat speed performance numbers: there isn't a single number to report, in most cases. 

Instead, I suspect it almost needs to be like "responsiveness mochitesting": we have a series of actions that we know do or could or used to cause the event loop to go away for more than 50ms. If we've fixed the problem, any latency over 50ms would cause the test to fail/go orange/whatever. If we haven't yet fixed the problem, we measure the latency and make sure it doesn't get worse.

There are some issues with this, mostly the random pauses which we can't control yet (GC/CC), so perhaps we need a way to blacklist/measure those separately from the main test item.
I filed bug 674606 on bsmedberg's "responsiveness mochitest" idea, after discussion we're going to continue to work on this as well, since it should produce useful data in the short term.
Depends on: 679427
Blocks: 674606
Graph server patch already applied to graphs-stage for testing.
Attachment #560439 - Flags: review?(jmaher)
Currently missing:
- ability to parametrize time between pings to event loop/threshold for recording slow pings as already described in bug 679427 (waiting on checkin, deployment)
- determination of the 'alpha' variable to be used in calculating the responsiveness metric
- further testing to ensure that the responsiveness test option doesn't wobble the regular tp5 results (this would make roll out a lot simpler)
Attachment #560439 - Flags: review?(jmaher) → review+
(In reply to alice nodelman [:alice] [:anode] from comment #24)
> Currently missing:
> - ability to parametrize time between pings to event loop/threshold for
> recording slow pings as already described in bug 679427 (waiting on checkin,
> deployment)
> - determination of the 'alpha' variable to be used in calculating the
> responsiveness metric
> - further testing to ensure that the responsiveness test option doesn't
> wobble the regular tp5 results (this would make roll out a lot simpler)

Is this the full set of things you see standing between the talos user responsiveness test and rolling it out to staging?

How do you determine the alpha variable?  I'm not clear on what that step entails. (sorry if you explained it already and I missed it).
Version: unspecified → Trunk
There's currently a discussion occurring over email between jlebar and bz about determining a way to calculate a useful metric from the responsiveness data that we are collecting.

Ted has checked in the ability to set the time between event loop pings/threshold for recording pings to moz-central, so that is done.

I will be working to complete the further testing to determine if adding the responsiveness test will perturb the general tp5 results - that will decide if we can roll the test out as part of standard tp5 or if it needs to be split off into a separate test run (which would be expensive, as tp has our longest test run time).
Depends on: 690203
Adding correct names to graph server db, now that mozAfterPaint tests are standard.  Already applied to graphs-stage.
Attachment #563464 - Flags: review?(jmaher)
Attachment #563464 - Flags: review?(jmaher) → review+
as a note the sql pages haven't landed on the graph server.  I am not sure if this bug is ready for those.  Please let me know if we need to add more to the server or just land these and I can land them to the graphs repository.
Depends on: 693679
updated patch with latest responsiveness calculation and for bitrot.
Assignee: anodelman → jmaher
Attachment #563467 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Depends on: 696059
updated patch, passes on talos staging.  The numbers for tp5 do not change, so we don't need to side by side stage this.
Attachment #567451 - Attachment is obsolete: true
Attachment #568714 - Flags: review?(wlachance)
Comment on attachment 568714 [details] [diff] [review]
have talos support 'responsiveness' test option (1.0)

Overall looks like a nice straightforward addition to Talos. My only comment:

+def responsiveness_Metric(val_list):

Should probably be responsiveness_metric, going by PEP8, which is endorsed by the Mozilla style guidelines (https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide#Python_practices). I know we have 3(!) different styles in this file but we might as well settle on the "right" one for new code.
Attachment #568714 - Flags: review?(wlachance) → review+
Comment on attachment 568714 [details] [diff] [review]
have talos support 'responsiveness' test option (1.0)

Oh, one other minor comment: do we actually need to specify responsiveness: False for most of the tests in the config file? It looks like we just assume the value is false if it's not set...
Depends on: 696810
Glad to see that this has been resolved. Has this update been deployed or when is it expected to be deployed?
this is deployed, so when builds start running again, we will see results :)  I will post a link in this bug to a graph server view to show the responsiveness data.
We had to back this out due to OSX 10.5.8 and 10.6.2 failing to terminate the browser and Fedora crashing.
Depends on: 697555
Depends on: 697769
Depends on: 702351
ok, ready to deploy again in bug 702351
Thanks Joel. I added a link to the project Snappy wiki [1]. We'll review during next week's call.

[1]https://wiki.mozilla.org/Performance/Snappy#Current_Infrastructure
Depends on: 904642
You need to log in before you can comment on or make changes to this bug.