Last Comment Bug 845943 - Talos tscroll (incorrectly) depends on timing
: Talos tscroll (incorrectly) depends on timing
Status: RESOLVED FIXED
[SfN]
:
Product: Datazilla
Classification: Other
Component: Metrics (show other bugs)
: unspecified
: x86_64 Windows 7
: -- normal
: ---
Assigned To: Avi Halachmi (:avih)
:
Mentors:
: 872803 (view as bug list)
Depends on: 873180 888899
Blocks: 590422 771326 897054
  Show dependency treegraph
 
Reported: 2013-02-27 11:30 PST by Avi Halachmi (:avih)
Modified: 2014-07-17 09:13 PDT (History)
16 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now (15.72 KB, patch)
2013-03-07 13:28 PST, Avi Halachmi (:avih)
jmaher: review+
jmaher: feedback+
Details | Diff | Splinter Review
v2: cleanup and remove debug code (14.18 KB, patch)
2013-03-09 12:45 PST, Avi Halachmi (:avih)
no flags Details | Diff | Splinter Review
v3 - restored fallback requestAnimationFrame (Opera desktop needs it) (14.73 KB, patch)
2013-03-10 09:52 PDT, Avi Halachmi (:avih)
jmaher: review+
Details | Diff | Splinter Review

Description Avi Halachmi (:avih) 2013-02-27 11:30:08 PST
All the Talos tscroll tests currently do:

1. Load a scrollable page (different "real-world" content for each sub-test).
2. Scroll NN pixels every 10ms - until reaching the bottom.
3. Measure and report average interval in microseconds (supposedly, hopefully as close as possible to 10000).

I think the test is not measuring a well defined parameter and thus providing meaningless results:

- If the system is fast enough (per sub-test) to complete a single scroll step in under ~10ms, all the results would be 10000 (depending on timer accuracy) + <some average delay due to refresh driver paint collision with the 10ms intervals> -> the results depend mostly on these paint collisions and timer accuracy and are mostly unrelated to the different scrolled content types.

- If the system is not fast enough, the number of scroll steps which would actually be rendered (and thus the major contribution to the reported result on this case) depend almost exclusively on the refresh driver average interval (since only those will trigger layout flush + rendering of the changes) -> the results would converge to a multiple of 16.67ms (depending on intervals accuracy), since the refresh driver tries to fire on vsync (or equivalent) which is an independent fixed interval. Again, almost regardless of the content type.

- If the system is marginally fast enough (per content), the result would flip randomly between the above two clusters.

-> So the content type would only split the results into one of the above categories, with probably meaningless variance and absolute value.


After some discussion with jmaher and jrmuizel, my plan is to instead measure the following:

1. Smoothness/jitter with few, equally light, contents:
- This would be std-dev of the delta from the optimal interval (16.667ms intervals now, possibly real vsync, etc).
- Not taking into account jank (exsessivly long intervals). This should have a tjank of its own. 

2. The speed at which we're able to scroll (and render each frame), per few pre-defined "real-world" contents.
- By setting layout.frame_rate=1000, and scrolling using rAF as fast as possible. This will probably have a 1ms resolution, and will depend on consistent performance of the test machines.

We could combine these tests into one with normal refresh interval, making #1 quantitative (of rAF consistency/accuracy), and #2 boolean (threshold which we don't wanna cross).
Comment 1 Joel Maher ( :jmaher) 2013-02-28 03:55:10 PST
if you could provide a patch to the talos repository (http://hg.mozilla.org/build/talos) to implement this, I can work on making sure we get this turned on in automation as well as verify the numbers over time.
Comment 2 Avi Halachmi (:avih) 2013-03-06 02:42:53 PST
I have a suspicion that currently tscroll measures jank more than anything else.

When testing locally using the tscroll webpage in a browser (not via the talos framework), result seems mostly related to the number of small hangs which happened during the scroll. Not sure if this extends to platforms other than Windows 7 (or if it also happens within the talos framework), but I'm guessing yes for both.
Comment 3 Avi Halachmi (:avih) 2013-03-07 13:28:09 PST
Created attachment 722456 [details] [diff] [review]
Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now

Changes:
- Replaced individual scroll functions (for a window, iframe and arbitrady DOM element) with a unified parametric one which can handle all scrolls and all reports.
- Replaced 10ms intervals with requestAnimationFrame.
- Replaced Date.now() intervals measurements with performance.now() (microseconds resolution instead of ms when possible).
- Replaced a single report per test (of the average interval) with all raw data points (intervals).
- Increased the svg scroll test length by scrolling in smaller steps (all other tests use 5px steps, the svg test used 30px and took about 300ms to complete with ~20 intervals, now using 10px).

Notes:
- The first few scroll intervals are typically longer (to much longer) and can greatly affect the average and stddev. Ignoring the first 5 data points reduces noise greatly (e.g. in one case, stddev is 40ms for all data points, 2-3ms if ignoring the first few).
- All the tscroll files are using tabs for indentation. If this gets r+, I'll post an updated patch with tabs->spaces (till then it'll just make the diff noisier).
Comment 4 Avi Halachmi (:avih) 2013-03-07 13:46:46 PST
Also, this patch makes all the tests work on all browsers. When ignoring the first 5 data points, looks like we're doing similar to Chrome (stddev ~0.8ms), better than Opera (stddev 1-3ms), but worse than IE10 (stddev ~0.15ms). Looks like IE is properly using vsync.
Comment 5 Joel Maher ( :jmaher) 2013-03-07 14:27:20 PST
Comment on attachment 722456 [details] [diff] [review]
Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now

Review of attachment 722456 [details] [diff] [review]:
-----------------------------------------------------------------

this is shaping up well.

::: talos/page_load_test/scroll/scroll-test.js
@@ +18,5 @@
> +
> +      return Math.sqrt(
> +        values.map(function (v) { return Math.pow(v - avg, 2); })
> +              .reduce(function (a, b) { return a + b; }) / (values.length - 1));
> +    }

I would like to remove these sum/average/stddev functions, but we could use them for reporting in the webpage; I just know we won't use them in talos proper.  Would you like these functions to remove the first 5 replicates from the data set?

@@ +105,5 @@
> +        durations.pop(); // Last step was 0.
> +        durations.pop(); // and the prev one was shorter and with end-of-page logic, ignore both.
> +
> +        for (var i in durations) {
> +          tpRecordTime(durations[i]);

this should be a comma separated list of values, for example: http://hg.mozilla.org/build/talos/file/880a267bfacc/talos/page_load_test/kraken/driver.html#l163
Comment 6 Joel Maher ( :jmaher) 2013-03-07 14:27:48 PST
Comment on attachment 722456 [details] [diff] [review]
Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now

Review of attachment 722456 [details] [diff] [review]:
-----------------------------------------------------------------

this is shaping up well.

::: talos/page_load_test/scroll/scroll-test.js
@@ +18,5 @@
> +
> +      return Math.sqrt(
> +        values.map(function (v) { return Math.pow(v - avg, 2); })
> +              .reduce(function (a, b) { return a + b; }) / (values.length - 1));
> +    }

I would like to remove these sum/average/stddev functions, but we could use them for reporting in the webpage; I just know we won't use them in talos proper.  Would you like these functions to remove the first 5 replicates from the data set?

@@ +105,5 @@
> +        durations.pop(); // Last step was 0.
> +        durations.pop(); // and the prev one was shorter and with end-of-page logic, ignore both.
> +
> +        for (var i in durations) {
> +          tpRecordTime(durations[i]);

this should be a comma separated list of values, for example: http://hg.mozilla.org/build/talos/file/880a267bfacc/talos/page_load_test/kraken/driver.html#l163
Comment 7 Avi Halachmi (:avih) 2013-03-07 23:03:23 PST
(In reply to Joel Maher (:jmaher) from comment #6)
> I would like to remove these sum/average/stddev functions, but we could use
> them for reporting in the webpage; I just know we won't use them in talos
> proper.

Yes, they definitely don't belong to talos proper. I'll remove them.

That being said, I found them very useful as a debugging tool which lets me test within the browser by just reloading the page. I also noticed that other talos tests (including tscroll) have similar facilities, commented out alerts, averages, etc. So it seems like a common need.

How about we put them instead in a stand alone JS include file (close to the top of the tree) which analyzes the data set when calling tpReportTime? (are there other supported similar report functions?). It will display some common statistical values about the set, possibly with some less obvious ones, such as warm-up iterations (see next). And it can be included in any talos test file within the repository, since it will only add its alternative tpRecordTime if the talos one isn't there (as my patch does). talos-debug.js ?


> Would you like these functions to remove the first 5 replicates
> from the data set?

Not sure what "these functions" are.

As for removing the first 5 items, I'm ambivalent with that. On one hand, they obviously pollute the real value we're after (e.g. the first 2 frame shouldn't really be a part of a scroll smoothness report). On the other hand, the number of warm-up iterations which is required before the values stabilize carries a weight of its own. If this number increases, then it's probably a real regression, and vice verse.

Since it might not be practical to manually continuously adjust the warmup iterations values per test, I was thinking of estimating this value systematically, per test, from the data set itself.

The talos-debug.js can do that (seems feasible to me), and the talos framework (DB etc) can calculate and hold this value as well, and report it.

If this works out as well as I hope, it will reduce the overall noise level within talos reports, and will provide another value which can be tracked orthogonally to the warmed-up test results.
Comment 8 Joel Maher ( :jmaher) 2013-03-08 07:54:22 PST
I like the idea of a talos-debug.js that we could include in various tests while running locally. I would recommend in the talos/scripts directory (which also holds MozillaFileLogger.js)

It would be nice to have these functions take a parameter for the number of replicates to avoid:
function avg(dataset, numIgnore)...
Comment 9 Avi Halachmi (:avih) 2013-03-09 12:45:59 PST
Created attachment 723109 [details] [diff] [review]
v2: cleanup and remove debug code

Instead of the debug functions, I've left a small piece of code which loads talos-debug.js and uses if it exists.

After 5 data points we should have relatively stable data. The code sends all the raw data, so the initialization-bias should be removed from outside the test itself.
Comment 10 Avi Halachmi (:avih) 2013-03-09 13:23:50 PST
(In reply to Avi Halachmi (:avih) from comment #9)
> ... so the initialization-bias should be removed from outside
> the test itself.

s/removed/handled/

Also, I filed bug 849558 for talos-debug.js, and included an initial patch.
Comment 11 Avi Halachmi (:avih) 2013-03-10 09:52:17 PDT
Created attachment 723210 [details] [diff] [review]
v3 - restored fallback requestAnimationFrame (Opera desktop needs it)
Comment 12 Joel Maher ( :jmaher) 2013-03-11 07:21:46 PDT
Comment on attachment 723210 [details] [diff] [review]
v3 - restored fallback requestAnimationFrame (Opera desktop needs it)

Review of attachment 723210 [details] [diff] [review]:
-----------------------------------------------------------------

I will run this on try server and compare this to production.

::: talos/page_load_test/scroll/scroll-test.js
@@ +41,5 @@
> +            window.mozRequestAnimationFrame    ||
> +            window.webkitRequestAnimationFrame ||
> +            window.oRequestAnimationFrame      ||
> +            window.msRequestAnimationFrame     ||
> +            rAFFallback;

I don't know what rAR stands for, could we make this more descriptive?
Comment 13 Avi Halachmi (:avih) 2013-03-11 07:47:05 PDT
(In reply to Joel Maher (:jmaher) from comment #12)
> I will run this on try server and compare this to production.

The values will not be comparable to historic results because:
- It uses requestAnimationFrame (16.67ms intervals) instead of 10ms.
- It reports all raw values instead of the average only.

> > +            window.mozRequestAnimationFrame    ||
> > +            window.webkitRequestAnimationFrame ||
> > +            window.oRequestAnimationFrame      ||
> > +            window.msRequestAnimationFrame     ||
> > +            rAFFallback;
> 
> I don't know what rAR stands for, could we make this more descriptive?

rAF commonly refers to requestAnimationFrame, as the patch title suggests, and as appears on all the method names which you quoted...

Also, since it now reports all the raw values (including the first few noisy ones), will it be handled at the python script which processes it?
Comment 14 Matt Woodrow (:mattwoodrow) 2013-05-15 16:54:54 PDT
*** Bug 872803 has been marked as a duplicate of this bug. ***
Comment 15 Matt Woodrow (:mattwoodrow) 2013-05-15 16:57:27 PDT
This bug is required for enabling OMTC on mac, as we get a performance regression without it.

OMTC changes the amount of time it takes (on the main thread) to paint, and so we end up dropping less frames and take longer.

I filed bug 872803 for this, which I've now closed as a duplicate. It creates a small merge conflict with this patch which can be ignored, or we can back the other patch out.
Comment 16 Matt Woodrow (:mattwoodrow) 2013-05-15 17:04:40 PDT
(In reply to Avi Halachmi (:avih) from comment #3)
 
> Notes:
> - The first few scroll intervals are typically longer (to much longer) and
> can greatly affect the average and stddev. Ignoring the first 5 data points
> reduces noise greatly (e.g. in one case, stddev is 40ms for all data points,
> 2-3ms if ignoring the first few).

For the tests that are scrolling a frame, this is expected. We initially have the frame marked as 'inactive' and once we start scrolling we switch to 'active' we need to repaint the entire frame area. Scrolls after this point should only need to repaint the changed area.

It's potentially interesting data at least, though it makes sense to separate it from the remainder of the data points.
Comment 17 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-05-15 17:22:02 PDT
Joel, what do we need to do to get this landed?
Comment 18 Avi Halachmi (:avih) 2013-05-16 03:20:08 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17)
> Joel, what do we need to do to get this landed?

AFAIK a script to process the different output, and possibly a decision on what should that processing be (if we want to change/enhance the current type of processing).

The "old" tscroll only reported average frame interval per test (taking 100% of the intervals into account within the test itself). The new test reports all frame intervals per test.

Also, it should be noted that while the bug title mentions that tscroll was depending on timing and we'd like to improve that dependency (whatever that means), in practice, the patch essentially changes two things:
- Changes scroll intervals from 10ms to requestAnimationFrame (16.67ms by default).
- Reports all frame intervals instead of average only (per sub-test).

I'd say that the stddev per sub-test is not less important than the average interval. The average interval is basically a cliff, since it'll start dropping (from 16.67ms by default) only from a certain level of regression, while stddev represents the stability of the intervals, which is quite important while scrolling, and probably more sensitive to smoothness regressions.

So few options for post-processing script:
1. To keep the same type of output as the current tscroll (but a bit better): single score - Drop the first few interval and report average of the rest.
2. Two scores per test: average and stddev after dropping the first few intervals (5).
3. Four scores per test: same as #2 but add stddev (WRT avg from #2) and average of the first few frames.
4. Somehow provide a single score which weights in the discrete scores from #2 or #3 (3 is better probably).
Comment 19 Joel Maher ( :jmaher) 2013-05-16 07:13:59 PDT
Right now with the patch here we pass on all platforms and report the same single number to graph server.  Our overall goal is to use datazilla so we can look at >1 number easily (i.e. trend mean, stddev, mean, etc...) for each page.

I will work on deploying this change soon so we can enjoy the fixes to the tests, and we need to continue work on actually putting the numbers to real use!
Comment 20 Joel Maher ( :jmaher) 2013-05-16 07:23:49 PDT
landed on talos:
http://hg.mozilla.org/build/talos/rev/e1611c2d088b

will work on deploying later today or first thing tomorrow.
Comment 21 Robert O'Callahan (:roc) (Exited; email my personal email if necessary) 2013-05-16 15:08:52 PDT
Thanks!!!
Comment 22 Avi Halachmi (:avih) 2013-05-20 00:29:45 PDT
Will you post some links once it's operational such that we could assess it's usefulness? Thanks.
Comment 23 Joel Maher ( :jmaher) 2013-05-21 12:27:52 PDT
Got this landed after a false start last week, the new values post 16.6x whereas the previous test ran at 100xx:
* linux 32: http://graphs.mozilla.org/graph.html#tests=[[222,131,33]]
* linux 64: http://graphs.mozilla.org/graph.html#tests=[[222,131,35]]
* osx 10.6: http://graphs.mozilla.org/graph.html#tests=[[222,63,21]]
* osx 10.7: http://graphs.mozilla.org/graph.html#tests=[[222,63,22]]
* osx 10.8: http://graphs.mozilla.org/graph.html#tests=[[222,63,24]]
* winxp: http://graphs.mozilla.org/graph.html#tests=[[222,131,1]]
* win7: http://graphs.mozilla.org/graph.html#tests=[[222,131,25]]
* win8: http://graphs.mozilla.org/graph.html#tests=[[222,131,31]]

Give this a couple days to see a real trend of the data.
Comment 24 Avi Halachmi (:avih) 2013-05-21 12:59:53 PDT
Nice. So how is the output processed? averaging the intervals except the first few? this is a similar cliff to the original tscroll test, just with 60% further threshold, and more correct.

We could enhance it by:

1. Also measuring stddev of the reported intervals (instead of stddev of average interval between runs).

2. Add a proper performance regression test (not a cliff) by running it also with the preference layout.frame_rate=1000. This will make it run as fast as possible, and will detect performance regressions, without any modification to the test itself (except the prior change of that pref).
Comment 25 Avi Halachmi (:avih) 2013-05-30 07:01:13 PDT
Resolved: FIXED?

We could open a new bug for improvements.

Also, the graphs from comment 23 show the wrong scale even if looking at the last 7 days and all the results are similar.
Comment 26 Boris Zbarsky [:bz] 2013-05-30 07:37:08 PDT
> Resolved: FIXED?

So what is the new number showing, exactly?  There were the mails about "99.8%" improvements to tree-management, and dholbert asked what the story was a week ago, but no response from anyone involved in this bug...
Comment 27 Avi Halachmi (:avih) 2013-05-30 07:45:02 PDT
(In reply to Boris Zbarsky (:bz) from comment #26)
> So what is the new number showing, exactly?  There were the mails about
> "99.8%" improvements to tree-management, and dholbert asked what the story
> was a week ago, but no response from anyone involved in this bug...

Missed those emails.

The scale and value of the results changed.

The original tests were reporting average interval in microseconds, and the iterations were set to 10 milliseconds using setTimeout, so the results were around 10000.

The modified tests iterate using rAF, and report all intervals, which are later processed in a script which drops the first few and averages the rest. This produces results around 16.67 (ms).

10000 -> 16.67 is probably that 98% "improvement".
Comment 28 Boris Zbarsky [:bz] 2013-05-30 07:53:27 PDT
I see.  Worth answering dholbert's email on the list then (and maybe an ahead-of-time heads-up next time so people know to ignore the notifications).
Comment 29 Avi Halachmi (:avih) 2014-07-17 09:13:38 PDT
We're done here.

ASAP mode solves timing dependency by iterating as fast as possible regardless of vsync or other "60hz" timers.

ASAP mode is the equivalent (and sometimes literally also) turning vsync synchronization off.

Note You need to log in before you can comment on or make changes to this bug.