Closed Bug 845943 Opened 12 years ago Closed 11 years ago

Talos tscroll (incorrectly) depends on timing

Categories

(Datazilla Graveyard :: Metrics, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: avih, Assigned: avih)

References

Details

(Whiteboard: [SfN])

Attachments

(1 file, 2 obsolete files)

All the Talos tscroll tests currently do: 1. Load a scrollable page (different "real-world" content for each sub-test). 2. Scroll NN pixels every 10ms - until reaching the bottom. 3. Measure and report average interval in microseconds (supposedly, hopefully as close as possible to 10000). I think the test is not measuring a well defined parameter and thus providing meaningless results: - If the system is fast enough (per sub-test) to complete a single scroll step in under ~10ms, all the results would be 10000 (depending on timer accuracy) + <some average delay due to refresh driver paint collision with the 10ms intervals> -> the results depend mostly on these paint collisions and timer accuracy and are mostly unrelated to the different scrolled content types. - If the system is not fast enough, the number of scroll steps which would actually be rendered (and thus the major contribution to the reported result on this case) depend almost exclusively on the refresh driver average interval (since only those will trigger layout flush + rendering of the changes) -> the results would converge to a multiple of 16.67ms (depending on intervals accuracy), since the refresh driver tries to fire on vsync (or equivalent) which is an independent fixed interval. Again, almost regardless of the content type. - If the system is marginally fast enough (per content), the result would flip randomly between the above two clusters. -> So the content type would only split the results into one of the above categories, with probably meaningless variance and absolute value. After some discussion with jmaher and jrmuizel, my plan is to instead measure the following: 1. Smoothness/jitter with few, equally light, contents: - This would be std-dev of the delta from the optimal interval (16.667ms intervals now, possibly real vsync, etc). - Not taking into account jank (exsessivly long intervals). This should have a tjank of its own. 2. The speed at which we're able to scroll (and render each frame), per few pre-defined "real-world" contents. - By setting layout.frame_rate=1000, and scrolling using rAF as fast as possible. This will probably have a 1ms resolution, and will depend on consistent performance of the test machines. We could combine these tests into one with normal refresh interval, making #1 quantitative (of rAF consistency/accuracy), and #2 boolean (threshold which we don't wanna cross).
Blocks: 590422
if you could provide a patch to the talos repository (http://hg.mozilla.org/build/talos) to implement this, I can work on making sure we get this turned on in automation as well as verify the numbers over time.
I have a suspicion that currently tscroll measures jank more than anything else. When testing locally using the tscroll webpage in a browser (not via the talos framework), result seems mostly related to the number of small hangs which happened during the scroll. Not sure if this extends to platforms other than Windows 7 (or if it also happens within the talos framework), but I'm guessing yes for both.
Changes: - Replaced individual scroll functions (for a window, iframe and arbitrady DOM element) with a unified parametric one which can handle all scrolls and all reports. - Replaced 10ms intervals with requestAnimationFrame. - Replaced Date.now() intervals measurements with performance.now() (microseconds resolution instead of ms when possible). - Replaced a single report per test (of the average interval) with all raw data points (intervals). - Increased the svg scroll test length by scrolling in smaller steps (all other tests use 5px steps, the svg test used 30px and took about 300ms to complete with ~20 intervals, now using 10px). Notes: - The first few scroll intervals are typically longer (to much longer) and can greatly affect the average and stddev. Ignoring the first 5 data points reduces noise greatly (e.g. in one case, stddev is 40ms for all data points, 2-3ms if ignoring the first few). - All the tscroll files are using tabs for indentation. If this gets r+, I'll post an updated patch with tabs->spaces (till then it'll just make the diff noisier).
Attachment #722456 - Flags: review?
Attachment #722456 - Flags: feedback?(jmaher)
Also, this patch makes all the tests work on all browsers. When ignoring the first 5 data points, looks like we're doing similar to Chrome (stddev ~0.8ms), better than Opera (stddev 1-3ms), but worse than IE10 (stddev ~0.15ms). Looks like IE is properly using vsync.
Comment on attachment 722456 [details] [diff] [review] Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now Review of attachment 722456 [details] [diff] [review]: ----------------------------------------------------------------- this is shaping up well. ::: talos/page_load_test/scroll/scroll-test.js @@ +18,5 @@ > + > + return Math.sqrt( > + values.map(function (v) { return Math.pow(v - avg, 2); }) > + .reduce(function (a, b) { return a + b; }) / (values.length - 1)); > + } I would like to remove these sum/average/stddev functions, but we could use them for reporting in the webpage; I just know we won't use them in talos proper. Would you like these functions to remove the first 5 replicates from the data set? @@ +105,5 @@ > + durations.pop(); // Last step was 0. > + durations.pop(); // and the prev one was shorter and with end-of-page logic, ignore both. > + > + for (var i in durations) { > + tpRecordTime(durations[i]); this should be a comma separated list of values, for example: http://hg.mozilla.org/build/talos/file/880a267bfacc/talos/page_load_test/kraken/driver.html#l163
Attachment #722456 - Flags: feedback?(jmaher) → feedback+
Comment on attachment 722456 [details] [diff] [review] Unified scrolling, report raw instead of average, use requestAnimationFrame, use performance.now Review of attachment 722456 [details] [diff] [review]: ----------------------------------------------------------------- this is shaping up well. ::: talos/page_load_test/scroll/scroll-test.js @@ +18,5 @@ > + > + return Math.sqrt( > + values.map(function (v) { return Math.pow(v - avg, 2); }) > + .reduce(function (a, b) { return a + b; }) / (values.length - 1)); > + } I would like to remove these sum/average/stddev functions, but we could use them for reporting in the webpage; I just know we won't use them in talos proper. Would you like these functions to remove the first 5 replicates from the data set? @@ +105,5 @@ > + durations.pop(); // Last step was 0. > + durations.pop(); // and the prev one was shorter and with end-of-page logic, ignore both. > + > + for (var i in durations) { > + tpRecordTime(durations[i]); this should be a comma separated list of values, for example: http://hg.mozilla.org/build/talos/file/880a267bfacc/talos/page_load_test/kraken/driver.html#l163
Attachment #722456 - Flags: review? → review+
(In reply to Joel Maher (:jmaher) from comment #6) > I would like to remove these sum/average/stddev functions, but we could use > them for reporting in the webpage; I just know we won't use them in talos > proper. Yes, they definitely don't belong to talos proper. I'll remove them. That being said, I found them very useful as a debugging tool which lets me test within the browser by just reloading the page. I also noticed that other talos tests (including tscroll) have similar facilities, commented out alerts, averages, etc. So it seems like a common need. How about we put them instead in a stand alone JS include file (close to the top of the tree) which analyzes the data set when calling tpReportTime? (are there other supported similar report functions?). It will display some common statistical values about the set, possibly with some less obvious ones, such as warm-up iterations (see next). And it can be included in any talos test file within the repository, since it will only add its alternative tpRecordTime if the talos one isn't there (as my patch does). talos-debug.js ? > Would you like these functions to remove the first 5 replicates > from the data set? Not sure what "these functions" are. As for removing the first 5 items, I'm ambivalent with that. On one hand, they obviously pollute the real value we're after (e.g. the first 2 frame shouldn't really be a part of a scroll smoothness report). On the other hand, the number of warm-up iterations which is required before the values stabilize carries a weight of its own. If this number increases, then it's probably a real regression, and vice verse. Since it might not be practical to manually continuously adjust the warmup iterations values per test, I was thinking of estimating this value systematically, per test, from the data set itself. The talos-debug.js can do that (seems feasible to me), and the talos framework (DB etc) can calculate and hold this value as well, and report it. If this works out as well as I hope, it will reduce the overall noise level within talos reports, and will provide another value which can be tracked orthogonally to the warmed-up test results.
I like the idea of a talos-debug.js that we could include in various tests while running locally. I would recommend in the talos/scripts directory (which also holds MozillaFileLogger.js) It would be nice to have these functions take a parameter for the number of replicates to avoid: function avg(dataset, numIgnore)...
Instead of the debug functions, I've left a small piece of code which loads talos-debug.js and uses if it exists. After 5 data points we should have relatively stable data. The code sends all the raw data, so the initialization-bias should be removed from outside the test itself.
Attachment #722456 - Attachment is obsolete: true
Attachment #723109 - Flags: review?(jmaher)
(In reply to Avi Halachmi (:avih) from comment #9) > ... so the initialization-bias should be removed from outside > the test itself. s/removed/handled/ Also, I filed bug 849558 for talos-debug.js, and included an initial patch.
Attachment #723109 - Attachment is obsolete: true
Attachment #723109 - Flags: review?(jmaher)
Attachment #723210 - Flags: review?(jmaher)
Comment on attachment 723210 [details] [diff] [review] v3 - restored fallback requestAnimationFrame (Opera desktop needs it) Review of attachment 723210 [details] [diff] [review]: ----------------------------------------------------------------- I will run this on try server and compare this to production. ::: talos/page_load_test/scroll/scroll-test.js @@ +41,5 @@ > + window.mozRequestAnimationFrame || > + window.webkitRequestAnimationFrame || > + window.oRequestAnimationFrame || > + window.msRequestAnimationFrame || > + rAFFallback; I don't know what rAR stands for, could we make this more descriptive?
Attachment #723210 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #12) > I will run this on try server and compare this to production. The values will not be comparable to historic results because: - It uses requestAnimationFrame (16.67ms intervals) instead of 10ms. - It reports all raw values instead of the average only. > > + window.mozRequestAnimationFrame || > > + window.webkitRequestAnimationFrame || > > + window.oRequestAnimationFrame || > > + window.msRequestAnimationFrame || > > + rAFFallback; > > I don't know what rAR stands for, could we make this more descriptive? rAF commonly refers to requestAnimationFrame, as the patch title suggests, and as appears on all the method names which you quoted... Also, since it now reports all the raw values (including the first few noisy ones), will it be handled at the python script which processes it?
See Also: → 854746
This bug is required for enabling OMTC on mac, as we get a performance regression without it. OMTC changes the amount of time it takes (on the main thread) to paint, and so we end up dropping less frames and take longer. I filed bug 872803 for this, which I've now closed as a duplicate. It creates a small merge conflict with this patch which can be ignored, or we can back the other patch out.
(In reply to Avi Halachmi (:avih) from comment #3) > Notes: > - The first few scroll intervals are typically longer (to much longer) and > can greatly affect the average and stddev. Ignoring the first 5 data points > reduces noise greatly (e.g. in one case, stddev is 40ms for all data points, > 2-3ms if ignoring the first few). For the tests that are scrolling a frame, this is expected. We initially have the frame marked as 'inactive' and once we start scrolling we switch to 'active' we need to repaint the entire frame area. Scrolls after this point should only need to repaint the changed area. It's potentially interesting data at least, though it makes sense to separate it from the remainder of the data points.
Joel, what do we need to do to get this landed?
Flags: needinfo?(jmaher)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #17) > Joel, what do we need to do to get this landed? AFAIK a script to process the different output, and possibly a decision on what should that processing be (if we want to change/enhance the current type of processing). The "old" tscroll only reported average frame interval per test (taking 100% of the intervals into account within the test itself). The new test reports all frame intervals per test. Also, it should be noted that while the bug title mentions that tscroll was depending on timing and we'd like to improve that dependency (whatever that means), in practice, the patch essentially changes two things: - Changes scroll intervals from 10ms to requestAnimationFrame (16.67ms by default). - Reports all frame intervals instead of average only (per sub-test). I'd say that the stddev per sub-test is not less important than the average interval. The average interval is basically a cliff, since it'll start dropping (from 16.67ms by default) only from a certain level of regression, while stddev represents the stability of the intervals, which is quite important while scrolling, and probably more sensitive to smoothness regressions. So few options for post-processing script: 1. To keep the same type of output as the current tscroll (but a bit better): single score - Drop the first few interval and report average of the rest. 2. Two scores per test: average and stddev after dropping the first few intervals (5). 3. Four scores per test: same as #2 but add stddev (WRT avg from #2) and average of the first few frames. 4. Somehow provide a single score which weights in the discrete scores from #2 or #3 (3 is better probably).
Right now with the patch here we pass on all platforms and report the same single number to graph server. Our overall goal is to use datazilla so we can look at >1 number easily (i.e. trend mean, stddev, mean, etc...) for each page. I will work on deploying this change soon so we can enjoy the fixes to the tests, and we need to continue work on actually putting the numbers to real use!
Flags: needinfo?(jmaher)
landed on talos: http://hg.mozilla.org/build/talos/rev/e1611c2d088b will work on deploying later today or first thing tomorrow.
Depends on: 873180
Will you post some links once it's operational such that we could assess it's usefulness? Thanks.
Nice. So how is the output processed? averaging the intervals except the first few? this is a similar cliff to the original tscroll test, just with 60% further threshold, and more correct. We could enhance it by: 1. Also measuring stddev of the reported intervals (instead of stddev of average interval between runs). 2. Add a proper performance regression test (not a cliff) by running it also with the preference layout.frame_rate=1000. This will make it run as fast as possible, and will detect performance regressions, without any modification to the test itself (except the prior change of that pref).
Resolved: FIXED? We could open a new bug for improvements. Also, the graphs from comment 23 show the wrong scale even if looking at the last 7 days and all the results are similar.
> Resolved: FIXED? So what is the new number showing, exactly? There were the mails about "99.8%" improvements to tree-management, and dholbert asked what the story was a week ago, but no response from anyone involved in this bug...
Flags: needinfo?(avihpit)
(In reply to Boris Zbarsky (:bz) from comment #26) > So what is the new number showing, exactly? There were the mails about > "99.8%" improvements to tree-management, and dholbert asked what the story > was a week ago, but no response from anyone involved in this bug... Missed those emails. The scale and value of the results changed. The original tests were reporting average interval in microseconds, and the iterations were set to 10 milliseconds using setTimeout, so the results were around 10000. The modified tests iterate using rAF, and report all intervals, which are later processed in a script which drops the first few and averages the rest. This produces results around 16.67 (ms). 10000 -> 16.67 is probably that 98% "improvement".
Flags: needinfo?(avihpit)
I see. Worth answering dholbert's email on the list then (and maybe an ahead-of-time heads-up next time so people know to ignore the notifications).
No longer blocks: 756601
Depends on: 888899
Blocks: 897054
Blocks: 771326
We're done here. ASAP mode solves timing dependency by iterating as fast as possible regardless of vsync or other "60hz" timers. ASAP mode is the equivalent (and sometimes literally also) turning vsync synchronization off.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: