Closed Bug 1020663 Opened 10 years ago Closed 8 years ago

Add WebGL to talos

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: avih, Assigned: avih)

References

()

Details

(Keywords: ateam-talos-task)

Attachments

(4 files, 5 obsolete files)

This discussion started at bug 899785 comment 22 - 27.

In a nutshell, we care a lot about WebGL performance and we don't have any automated tool to measure our WebGL performance.

On top of this, WebGL regressed considerably with OMTC, and we should really be able to track this specific regression. 

I asked if bjacob's webgl tutorial demos ( http://bjacob.github.io/webgl-tutorial/ which show this OMTC regression clearly) could be considered reasonable/valid test cases, and he replied yes, but that instead we could use existing WebGL tests from https://github.com/KhronosGroup/WebGLPerf

Continuing the discussion here.

(In reply to Benoit Jacob [:bjacob] from comment #24)
> We do have WebGL performance tests, no need to write new ones....

Yes, but maybe it's not enough.

We don't run them indeed, but as their readme says, they're regression tests and not benchmarks (benchmarks tend to test realistic workloads, regression tests may not). I.e. each of its 67 pages tries to test strictly a single WebGL performance aspect (and none of them renders "proper" 3D stuff to screen).

While such regression test is potentially very useful, we could still also use a more realistic workload which combines the various bits into a single "overall" performance to also be able to notice shifts in these overall "real world" numbers.

I think I'll try to add 2 things then:

1. The WebGL regression tests suite recommended by bjacob.
2. Some benchmark.

I'm unfamiliar with existing WebGL benchmarks, so if anyone would like to recommend one/few (or even use bjacob's simple tutorial demos as benchmarks), let me know what you recommend to use.

One thing to consider though, is that we don't want it to run for _too_ long (overall per generated test result would hopefully be not more than few minutes). Typically talos tests are repeated several times for each number they report, but maybe here we could use much fewer repetitions, or maybe even just run once and report. Depends how stable the numbers would be.
Depends on: 1020677
You're right, ideally we would have both micro regression tests like the WebGLPerfTests, and more macro tests starting with things like these demos but why not also heavier, more real-world examples such as some emscripten'd game demos. (Could start with BananaBread for something that we can use without asking for permission).
(In reply to Benoit Jacob [:bjacob] from comment #1)
> ...(Could start with BananaBread for something that we can use
> without asking for permission).

Right, which is why I was asking for recommendations on such benchmarks to use.

Another thing which we might need to consider is that maybe we don't want the tests to be very big (disk space wise), and some of these (I was thinking Epic Citadel) are dozens of megabytes.

I'm not sure yet how important is this factor, but maybe joel could help here? what kind of size-threshold we'd prefer not to go over?
Flags: needinfo?(jmaher)
size is not of utmost importance, but the bigger it is the slower it takes to clone for every job and more restrictive it is for users/contributors on lower bandwidth.

BenWa: how much overlap is this with the work that the games group has been doing?  They want GL benchmarks across desktop, android, b2g.  I haven't been able to get a solid list from them of what is a good benchmark, just aquarium.
Flags: needinfo?(jmaher) → needinfo?(bgirard)
I don't know what the games team is really doing. From what I've seen they were running tests outside our infrastructure. Maybe Alan knows?

Even if we do have tests out of talos we should really aim to have them, or at least something, in the talos.

If we do deploy our own WebGL tests I think we need both microbenchmark and real world scenario. Microbenchmarks are important to catch small and specific regressions, for example we often catch regressions on the fixed overhead of flipping simple colors.
Flags: needinfo?(bgirard) → needinfo?(akligman)
bjacob and me just discussed it. He thinks that the actual underlying "3d stuff" is not changing, regressing or expected to too much.

He says that Firefox has 2 main degrees of freedom as far as our WebGL performance is concerned:
- Usage of alpha channel.
- Antialiasing.

So if we take one of his WebGL tutorial demos (e.g. 12. textures), make it parametric with alpha and AA (which he kindly just did), and then create the 4* possible variations of these, it would give us a good enough coverage of the factors by which Firefox is most affected.

So this and the Khronos group regression suite should give us nice overall coverage. If we want, we could add a "real" benchmark at a later time (e.g. bananabread, epic citadel, etc).

* 4 and not 2 because their effect on performance is not necessarily orthogonal. Theoretically there could be a case where flipping each of those by itself would result in 10% regression, but together they might end up as 100% regression, etc.

So, this is going to be phase one of the talos WebGL thingy:
1. the Khronos regression test with 67 sub-results.
2. bjacob's textured terrain demo in 4 combinations of alpha/AA, where we measure FPS, probably in ASAP mode.


Oh, and one more thing, should these 2 tests work with b2g? is this a test we want to also run on b2g platforms?
Flags: needinfo?(dhuseby)
Flags: needinfo?(bjacob)
Yes, we care a lot about running such tests on b2g specifically, because these compositing paths tend to be very platform-specific, and even in some ways device-specific. (Availability of GL fences, etc). We have had OS-specific WebGL compositing performance regressions, including B2G-specific.

Like I said, the internal WebGL implementation stuff --- the part that depends on all the details of the WebGL application's rendering --- doesn't regress that much. It is important to have some testing for that, but IMHO it's enough to cover that with 1 or 2 big macro benchmarks (some emscripten'd game) representing a real-world workload.

The part that does tend to regress a lot is WebGL _compositing_ i.e. the interfacing with Layers. Like we discussed (and like you say in comment 5) this doesn't depend much on the rendering details, but does depend on some of the WebGL context's attributes: 'alpha' and 'antialias'. These both are booleans defaulting to true. So, 4 possibilities.
Flags: needinfo?(bjacob)
to set the stage, we don't have talos running on b2g, just android and all desktop platforms.  b2g has make test-perf, but I am not sure the solution for running a benchmark would be drag and drop.  What :ack has worked on and built for the games team allows benchmarks to run on all platforms.  This isn't inside a CI system like talos, but it does allow for running tests on a given platform/browser.

We should consider how to run these on b2g and if we can make it work with the existing make test-perf system:
https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/Gaia_performance_tests
(In reply to Benoit Jacob [:bjacob] from comment #6)
> Yes, we care a lot about running such tests on b2g specifically...

Good ;) because the WebGL demo at http://bjacob.github.io/webgl-tutorial/12-texture.html (which is the one we're using for this) shows a blank page on FxOS (thanks to mchang for testing it).

If you can make it work also there, ping me if/when it works?

I talked to the FxOS perf guys and they can't add this test right now, but it would still be nice if they could use it out of the box once they have time to get to it.

FWIW, the Khronos regression suite does work on FxOS (also according to mchang).

I'm still working on making a talos test of the current WebGL demo code, but I expect/hope that the changes for fixing it also for FxOS would integrate trivially.
Flags: needinfo?(dhuseby) → needinfo?(bjacob)
Does it fail to run on all b2g devices or only specific ones?

Is there a JS warning? (logcat might have those in debug builds; not sure).
Flags: needinfo?(bjacob)
(In reply to Benoit Jacob [:bjacob] from comment #9)
> Does it fail to run on all b2g devices or only specific ones?

I haven't tried any myself.

> Is there a JS warning? (logcat might have those in debug builds; not sure).

Mason, could you please help with this?
Flags: needinfo?(mchang)
Depends on: 1021910
bjacob doesn't have time to debug the FxOS issue, and neither do I.

To prevent stalling this for desktop platforms, I filed bug 1021910 to track coverage of these tests on b2g and android.

If anyone feels like providing info (try it on those platform), or to actually make it work where required, your help would be appreciated.
Flags: needinfo?(mchang)
TL;DR:
------

- The Khronos suite (67 micro-benchmarks) is either somewhat broken IMO or otherwise produces results which I can't make sense of. Unless a WebGL expert can examine all of it and vouch that it produces results which are useful for us on all the platforms which we care about (desktop would be enough for now), it raises too many question marks for me to trust it, and therefore I think we shouldn't use it.

- The terrain demo looks useful as a test and clearly reflects performance of (few) various WebGL and compositor bits, but it has slight issue with occasional short (~50ms) pauses at 1-2% of the frames (about once a sec but random) - on both OSX and windows. Need to decide how to handle this noise, but otherwise it's good. As a (known OMTC) note: this WebGL terrain demo sucks terribly on windows with OMTC (4-8 FPS), but works fine++ without OMTC (80-120 FPS). On OS X everything is fine only with OMTC.

- The ~50ms hangs at the terrain demo should be somewhat worrying IMO. They're probably GC/CC (though I didn't try to correlate them yet), but the JS code doesn't seem to do any notable allocations during the animation, and the WebGL scene is simple and static (except for lighting). Furthermore, they're more frequent in a "dirty" profile (i.e. with open tabs, history etc). Assuming that the terrain demo is reasonably efficient, it probably means that WebGL devs can't expect smooth WebGL animation from Firefox. I know that asm.js shouldn't have GC, but this demo is plain JS/WebGL and not asm.js.


The long version:
-----------------

These patches add the Khronos regressions test suite and the terrain demo as talos tests (after applying them they can be executed locally in talos, though they need few more steps before they can be used in production - such as adding the test names to the graphserver/datazilla databases etc).

The last (this) attachment is results logs of experimenting with these tests on two systems:
- Macbook air 2010 (OS X 10.9.3, 4G ram, nvidia 320m)
- Asus T100 laptop (windows 8.1 32bit, 2G ram, Bay Trail Atom z3740 - reasonable cpu/gpu).

On each system I experimented with all 4 combinations of OMTC yes/no and vsync-off (ASAP) yes/no (even if on OS X OMTC is on by default for a long time now). Here are my findings:


Khronos regressions test suite (pretty much unmodified from: https://github.com/KhronosGroup/WebGLPerf )
------------------------------

Of the 67 micro-benchmarks, I experimented with the first 4 (see full results log at the attachment of this post, including a full run of all 67 tests with default settings). The tests were executed as column-major (each test was executed once - to produce the first results column, then the 2nd column was produced, etc):

OS X:
-----

normal (default) 60hz, with OMTC either on or off:

clear-10x-default;                     17.11;  16.76;  16.92;  16.77;  16.83
clear-10x-nondefault-constant-color;   16.75;  16.95;  16.93;  16.96;  16.81
clear-default;                         16.94;  16.89;  16.95;  16.84;  16.88
clear-default-preserveDrawingBuffer;   16.69;  16.83;  16.83;  16.83;  16.84

It looks likely that it bumps into a 60hz (16.7 ms) wall and doesn't really reflect any performance/regressions.


OMTC: yes, Vsync: off (ASAP):

clear-10x-default;                     29.19;  18.21;  24.58;  21.67;  27.85
clear-10x-nondefault-constant-color;   28.22;  22.05;  22.16;  24.67;  25.77
clear-default;                         21.31;  25.03;  23.73;  24.24;  26.14
clear-default-preserveDrawingBuffer;   19.82;  27.57;  21.84;  30.76;  26.32

This is the first time I see that turning vsync off produces longer run times. Not absolutely weird, but still a first for me.


OMTC: off, Vsync: off (ASAP):

clear-10x-default;                     48.17;  47.44;  48.19;  47.53;  48.07
clear-10x-nondefault-constant-color;   47.95;  47.47;  47.65;  47.70;  47.27
clear-default;                         48.46;  48.69;  48.22;  48.57;  48.51
clear-default-preserveDrawingBuffer;   48.49;  48.86;  48.65;  46.76;  48.53

This is somewhat weird that the results are yet worse, but since OMTC is on by default on OSX for a long time now, we can probably disregard results without OMTC.

So far, on OSX, there's no useful combination of OMTC and vsync to make these 4 tests produce different results. It _is_ possible that they just happen to perform identically, but keep reading...


Windows:
--------

All combinations _except_ omtc: off + vsync: normal (the default before OMTC was turned on for windows) look like this with the same kinds of numbers and noise (see attachment for all those runs):

clear-10x-default;                   174.70;  182.22;  179.01;  180.32;  176.11
clear-10x-nondefault-constant-color; 176.80;  180.11;  174.54;  181.27;  174.86
clear-default;                       176.47;  180.02;  174.70;  173.36;  173.34
clear-default-preserveDrawingBuffer; 172.31;  176.25;  173.66;  177.74;  178.74

Again, all the 4 tests produce pretty much identical values.


omtc: off, vsync: normal (60hz):

clear-10x-default;                   17.47;  19.22;  19.39;  18.18;  17.29
clear-10x-nondefault-constant-color; 18.18;  19.38;  21.03;  20.22;  17.84
clear-default;                       22.95;  18.43;  23.11;  22.79;  23.08
clear-default-preserveDrawingBuffer; 89.92;  88.37;  87.96;  93.71;  88.61

Lo and behold, this is the only combination of OMTC and vsync on windows or osx which produces what seems like "real" results. There's a clear (even if mostly small) increase in average duration between the tests.


Overall, for this Khronos regression suite, I can't make sense of the numbers on these combinations, and the full runs of 67 tests (see attachment) looks suspiciously clustered to me, especially around (low) multiples of 16-17 ms. I'm no WebGL expert, but my experience tells me that I can't trust these numbers to reflect regressions properly.

If someone is capable and has the time to examine this suite carefully and be convinced that it does produce useful numbers (or make it so otherwise), then we might reconsider it. Until then, however, I don't think we should use it.




Terrain demo (originally: http://bjacob.github.io/webgl-tutorial/12-texture.html )
----------------------------------------------------------------------------------

To make a test of it, I modified it as follows:
- Kept the same setup, scene and (lighting) animation.
- It renders the first frame to screen.
- Waits a bit (500ms).
- Renders 130 frames of the animation.
- Measures the last 100 intervals (so the first 30 are "warmup" which aren't measured).
- reports to talos the average of those last 100 intervals.

- The whole sequence above is repeated 4 times for each combination of alpha and antialias.

OS X:
-----

- Seems somewhat broken without OMTC (normal vsync results in ~70ms/frame and alpha results in pink sky. vsync: off results on ~0.1ms/frame). Again, we probably don't care much for OMTC:off on osx.


OMTC + vsync: normal (default settings)

Representative frame intervals sequence:

1.WebGL-terrain-alpha-no-AA-yes - Average: 16.90  stddev: 5.3 (31.3%)
Intervals: 17.4  15.5  17.1  17.2  15.6  17.8  16.7  16.9  16.9  15.0  16.5  18.0  17.7  14.4  16.1  18.1  18.0  15.9  15.1  18.2  17.4  14.4  16.9  16.9  18.3  15.2  16.1  17.0  18.4  15.6  16.0  16.6  16.4  17.4  18.3  14.2  17.1  16.5  16.4  17.4  16.6   [_31.5_]   2.2  16.0  16.5  16.5  17.4  16.8   [_58.2_]   0.7  15.3  16.0  16.7  17.6  15.7  16.5   [_29.0_]   4.8  17.1  16.2  17.0  16.0  16.4  16.9  16.4  17.7  15.1  17.7  17.1  18.5  15.0  16.1  16.9  17.2  15.6  16.7  16.3  18.1  15.8  17.0  17.1  15.8  16.9  16.8  17.2  15.7  17.0  16.9  17.4  15.8  16.5  16.8  16.8  16.7  16.4  16.7  17.0  16.0  16.9  16.4

This bumps into a 60hz wall, and can't really detect regressions. Notice the occasional longer intervals.

Here are all the result on default settings, again, this won't detect perf changes:
0.WebGL-terrain-alpha-no-AA-no   - Average: 16.67  stddev: 0.7 (4.4%)
1.WebGL-terrain-alpha-no-AA-yes  - Average: 16.90  stddev: 5.3 (31.3%)
2.WebGL-terrain-alpha-yes-AA-no  - Average: 16.83  stddev: 2.2 (13.0%)
3.WebGL-terrain-alpha-yes-AA-yes - Average: 16.66  stddev: 1.4 (8.2%)


OMTC + vsync:off (ASAP)

representative frames interval sequence:

1.WebGL-terrain-alpha-no-AA-yes - Average: 8.42  stddev: 2.8 (33.0%)
Intervals: 6.0  9.6  10.5  10.6  0.5  7.8  11.2  8.1  8.0  8.2  8.0  9.9  7.7  7.5  8.0  8.2   [_18.6_]   6.2  9.4  9.5  9.7  9.4  9.7  9.7  9.5  9.6  9.5  9.5  9.5  9.7  10.0  11.0  0.7  7.5  10.9  8.2  8.0  8.0  8.4  8.2  7.9  8.2  8.2  7.9  8.2  8.1  8.1  8.3  8.0  8.1  8.2  8.0  8.1  8.2  8.1  7.9  8.3  8.0  8.1  12.1  5.9  8.4  7.9  7.9  8.0  8.3  8.1  8.0   [_22.5_]   4.5  1.0  5.0  10.5  8.2  8.1  8.2  8.1   [_17.1_]   4.6  0.5  7.6  11.4  8.0  8.2  8.3  8.1  7.9  8.2  8.0  8.1  8.2  8.3  7.9  8.2  8.1  8.1  8.4  8.0  8.0  8.3


And here are all the results for this combination (see attachment for the full intervals):
0.WebGL-terrain-alpha-no-AA-no   - Average: 7.03  stddev: 3.7 (52.9%)
1.WebGL-terrain-alpha-no-AA-yes  - Average: 8.42  stddev: 2.8 (33.0%)
2.WebGL-terrain-alpha-yes-AA-no  - Average: 8.31  stddev: 3.6 (42.7%)
3.WebGL-terrain-alpha-yes-AA-yes - Average: 9.78  stddev: 4.8 (49.3%)

This is useful. It clearly reflects the performance. It's somewhat more noisy than with normal vsync (maybe 50% more hangs), but that's expected and can be managed.


Windows
-------

I'll start with OMTC:off because it behaves nicely:

OMTC:off, vsync: normal (default before OMTC was turned on) - pretty much 60hz wall which is no good:

1.WebGL-terrain-alpha-no-AA-yes - Average: 17.61  stddev: 4.7 (26.6%)
Intervals:  [_35.0_]   12.8  23.9  22.3  22.6  24.2  25.6  24.3  23.7   [_39.3_]    [_46.1_]   15.3  14.9  14.9  16.3  17.9  15.0  17.0  16.1  18.0  15.5  15.6  17.9  15.0  17.1  17.0  16.9  16.1  16.1  16.5  15.3  17.2  17.9  15.0  17.0  15.3  17.7  16.1  17.0  15.0  18.8  15.1  16.0  17.1  15.9  16.1  17.0  16.0  17.0  17.4  16.5  15.1  17.0  17.0  17.0  16.1  16.0  18.0  15.0  17.0  16.0  16.9  16.8  15.2  17.0  16.0  16.9  16.0  17.0  17.1  14.9  17.0  17.1  16.9  16.0  17.1  14.8  18.1  16.0  17.0  24.0  14.4  13.0  13.6  15.9  16.5  17.4  15.0  16.0  17.0  16.0  18.0  15.1  17.0  15.9  18.0  15.0  17.0  16.0  17.0

0.WebGL-terrain-alpha-no-AA-no   - Average: 16.82  stddev: 5.6 (33.0%)
1.WebGL-terrain-alpha-no-AA-yes  - Average: 17.61  stddev: 4.7 (26.6%)
2.WebGL-terrain-alpha-yes-AA-no  - Average: 16.81  stddev: 5.1 (30.3%)
3.WebGL-terrain-alpha-yes-AA-yes - Average: 17.45  stddev: 4.9 (28.0%)


OMTC: off, vsync: off (ASAP) - nice and useful and even (much) less noisy than on OSX:

0.WebGL-terrain-alpha-no-AA-no - Average: 6.55  stddev: 2.0 (30.6%)
Intervals:  [_21.3_]   5.1  5.5  6.0  6.5  6.2  5.4  6.2  6.2  5.7  7.0  5.9  5.5  6.7  5.7  5.7  6.3  5.6  7.0  5.6  5.6  6.9  5.5  6.4  6.2  5.6  6.0  6.3  5.6  6.8   [_14.3_]   5.0  6.0  6.9  5.3  5.5  6.9  6.5   [_14.0_]   4.9  7.7  5.6  5.7  6.8  5.5  5.6  6.5  5.6  7.0  5.6  5.5  6.9  5.8  5.6  6.4  5.6  5.9  6.2  5.5  6.9  5.7  5.4  7.0  5.5  5.6  6.6  5.5  5.7  6.5  6.6  6.1  6.2  6.3  7.1  6.2  8.3  6.1  6.4  6.7  6.5  7.1  6.2  6.5  6.4  6.1  7.3  6.0  6.2  7.2  6.4  7.5  6.1  6.5  6.8  6.2  7.0  6.1  6.1  9.1  8.8

0.WebGL-terrain-alpha-no-AA-no   - Average: 6.55   stddev: 2.0 (30.6%)
1.WebGL-terrain-alpha-no-AA-yes  - Average: 11.53  stddev: 0.9 (7.4%)
2.WebGL-terrain-alpha-yes-AA-no  - Average: 6.11   stddev: 0.6 (9.9%)
3.WebGL-terrain-alpha-yes-AA-yes - Average: 11.92  stddev: 1.2 (10.2%)


OMTC: yes, vsync: both off and normal (default windows nightly setup) - pretty terrible (4/8 fps):

0.WebGL-terrain-alpha-no-AA-no   - Average: 254.94  stddev: 8.2 (3.2%)
1.WebGL-terrain-alpha-no-AA-yes  - Average: 253.49  stddev: 5.5 (2.2%)
2.WebGL-terrain-alpha-yes-AA-no  - Average: 123.81  stddev: 3.0 (2.4%)
3.WebGL-terrain-alpha-yes-AA-yes - Average: 123.15  stddev: 7.2 (5.9%)


Overall, the terrain demo is useful on both osx and windows and can reflect performance nicely when turning vsync off. Windows perf with OMTC are bad, but surely it will detect once it improves.

The issue to consider here is the noise/hangs. On one hand, noise hurts our automatic detection because we only automatically detect a change if it's (IIRC) at least twice the noise level.

We can filter the noise, e.g.:
- By taking the median instead of the average.
- By averaging only the values between the 25th and 75th percentiles.
- By repeating the test more times (though it might take considerably longer to complete)

However, I'm not sure we want to disregard this noise. While it's _probably_ unrelated to WebGL performance, it still affects it, and it might still be somehow related to or caused by it. We do want to know of these hangs if they exist.

---- summary ----

Overall, currently I'm thinking we should:

- Not use the Khronos suite.

- Filter the noise at the terrain demo (probably the percentiles approach) such that we can detect regressions in much higher resolution - except the noise.

- Add a new test which only measures this noise. This will have more considerations (such as if this sequence is representative, define thresholds to measure etc) but those are probably out of the scope of this bug.

Thoughts?
Attachment #8436417 - Flags: review?(jgilbert)
Attachment #8436417 - Flags: review?(bjacob)
I was adapting WebGL benchmarks from the Tom's Hardware browser showdown, but that work was interrupted by another project and no longer has an owner.
Flags: needinfo?(akligman)
Comment on attachment 8436417 [details] [diff] [review]
Part 2: Make a talos test of the terrain demo

Review of attachment 8436417 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/page_load_test/webgl/benchmarks/terrain/perftest.html
@@ +4,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> + Version 1.1 - Adapted for talos with with 4 runs on combinations of alpha/antialias
> + Version 1.0 - Benwa Jacob's WebGL tutorial demos: http://bjacob.github.io/webgl-tutorial/12-texture.html

s/Benwa/Benoit/

BenWa is the nick of Benoit Girard ;-)

@@ +193,5 @@
>    }
>  
>    function draw(timestamp) {
> +    //gResultTimestamps[gCurrentFrame++] = timestamp;
> +    gResultTimestamps[gCurrentFrame++] = performance.now();

Is it intentional that you use performance.now() here instead of the rAF timestamp? Please explain the rationale in a comment.
Attachment #8436417 - Flags: review?(bjacob) → review+
Alan Kligman already collected WebGL tests for the game benchmarking initiative. Let's try to learn from their experience if possible https://wiki.mozilla.org/Platform/Games/GameFocusedBenchmarking
(In reply to Benoit Jacob [:bjacob] from comment #18)
> s/Benwa/Benoit/
> BenWa is the nick of Benoit Girard ;-)

Ouch :) apologies, and noted.

> > +    //gResultTimestamps[gCurrentFrame++] = timestamp;
> > +    gResultTimestamps[gCurrentFrame++] = performance.now();
> 
> Is it intentional that you use performance.now() here instead of the rAF
> timestamp? Please explain the rationale in a comment.

I was experimenting with both. They _shouldn't_ provide different results, and indeed they don't from my testing.

I forgot to remove/revert the code. As far as I'm able to tell, their usefulness is identical, though they might at some stage produce slightly different results, depending on internal implementations.


(In reply to Vladan Djeric (:vladan) from comment #19)
> Alan Kligman already collected WebGL tests for the game benchmarking
> initiative. Let's try to learn from their experience if possible
> https://wiki.mozilla.org/Platform/Games/GameFocusedBenchmarking

Thanks. Sounds very useful and I'll follow it up once I get something solid in which starts providing continuous feedback. Though I won't mind hearing of it and how it relates to this bug meanwhile.
(In reply to Avi Halachmi (:avih) from comment #16)
> Overall, for this Khronos regression suite ... looks suspiciously clustered
> to me, especially around (low) multiples of 16-17 ms. I'm no WebGL expert,
> but my experience tells me that I can't trust these numbers to reflect
> regressions properly.

Just to clarify why this behavior is not good for a regression test, since bjacob commented (correctly) on IRC that this still can help with and catch regression of high magnitude (like the ones caused by OMTC on windows).

While it's true that it will detect big regressions (with magnitude higher than ~17 ms), the issue is with smaller regressions which don't "move the results to the next bucket".

Those small regressions will be mostly invisible to this test on one hand, but OTOH at some stage after accumulating several such small regressions, it will suddenly "jump bucket", supposedly pointing at an offending patch which caused "100% regression" (e.g. if it jumped from 16.7 ms bucket to 33 ms bucket), while in fact, this specific patch was not too bad and only regressed in much smaller percentage, but just happened to appear as 100% regression due to this clustering of the results.

For this reason, we want a regression test to be as least clustered as possible.
(In reply to Avi Halachmi (:avih) from comment #20)
> (In reply to Benoit Jacob [:bjacob] from comment #18)
> > Is it intentional that you use performance.now() here instead of the rAF
> > timestamp? Please explain the rationale in a comment.
> 
> I was experimenting with both. They _shouldn't_ provide different results,
> and indeed they don't from my testing.

Also a clarification here why I ended up with performance.now() rather than the rAF callback timestamp even while both perform the same:

It's possible that under some implementations (even if not our current one), the rAF callback arg will be in some way "optimized", e.g. always point to the estimated next vsync timestamp, in order to allow the callee to have less jitter in its time-dependent positioning/processing.

While it would be great under normal conditions for smoother animations etc, such behavior would harm the effectiveness of our measurements, especially since these results are only effective when we turn vsync off (ASAP) which prevents clustering as noted on comment 21.

So performance.now() will not have this potential issue on one hand, and doesn't have lower resolution that the rAF callback arg OTOH, so overall it's better.
Flags: needinfo?(bjacob)
OK, sounds reasonable.
Flags: needinfo?(bjacob)
Comment on attachment 8436417 [details] [diff] [review]
Part 2: Make a talos test of the terrain demo

Review of attachment 8436417 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for doing this!
However, let's keep this code more obvious.

R- for what looks like an unnecessary O(N^2) component.

::: talos/page_load_test/webgl/benchmarks/terrain/perftest.html
@@ +172,5 @@
> +    var avg = average(arr);
> +
> +    return Math.sqrt(
> +      arr.map(function (v) { return Math.pow(v - avg, 2); })
> +         .reduce(function (a, b) { return a + b; }) / (arr.length - 1));

This is a lot of logic to have on one line.
var squareDiffArr = arr.map(...)
var sum = squareDiffArr.reduce()
var sampleStdDev = sqrt(sum / (count-1))

Also this is the 'sample standard deviation', vs the 'population standard deviation'. I agree that this is what we should be using, but we should call it out explicitly, because these two formulas are different.

You might even mention that we're effectively only sampling the population (of all frames), given that we don't have time to run the test forever, but we're interested in (an approximation of) the stddev as if the test was run forever. (though these values do converge as |sample| approaches infinity, which is our |population|)

@@ +175,5 @@
> +      arr.map(function (v) { return Math.pow(v - avg, 2); })
> +         .reduce(function (a, b) { return a + b; }) / (arr.length - 1));
> +  }
> +
> +  const PRETEST_DELAY_MS = 500;

It's not 'pretest' per se, but rather more like POST_FIRST_FRAME_DELAY.

@@ +194,5 @@
>  
>    function draw(timestamp) {
> +    //gResultTimestamps[gCurrentFrame++] = timestamp;
> +    gResultTimestamps[gCurrentFrame++] = performance.now();
> +    if (gCurrentFrame >= WARMUP_FRAMES + MEASURED_FRAMES) {

Should probably be > not >=.

If WARMUP_FRAMES+MEASURED_FRAMES is 1, then this happens on the first frame, when we haven't measured any frames, though we have recorded at timestamp for one. To measure the frame's length, you need two frame timestamp records.

@@ +205,5 @@
> +      return;
> +    }
> +
> +    // Used for rendering reproducible frames which are independent of actual performance (timestamps).
> +    var simulatedTimestamp = gCurrentFrame * 1000 / 60;

Yay!

@@ +274,5 @@
>      gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
>      gl.drawElements(gl.TRIANGLES, (terrainSize - 1) * (terrainSize - 1) * 6, gl.UNSIGNED_SHORT, 0);
>  
> +    if (gCurrentFrame == 1) {
> +      // First frame only - wait a bit (after rendering the scene once)

Explain why.

@@ +305,5 @@
> +
> +    // Trigger the run with specified args, with callback function to process the result
> +    startTest(alpha, antialias, function(intervals) {
> +        gResults.raw.push(intervals);
> +        gResults.values.push(average(intervals));

Why do this here, and not do stddev as well? We should be doing both of these in the same place, so it's very clear they're operating on the same data.

I think we should just report the raw data here, and do the processing of the data (mean and stddev) at result reporting time.

@@ +323,5 @@
> +           + "\nIntervals: " + results.raw[i].map(function(v, i, arr) {
> +                var avg = average(arr);
> +                var x = v > avg * 1.5 ? [" [_", "_] "] : ["", ""]; // 50% above average - make it noticeable
> +                return x[0] + v.toFixed(1) + x[1];
> +             }).join("  ")

This whole 'Intervals' bit seems unnecessarily tricky. 

var curData = results.raw[i];
var unusuallyLargeThreshold = avg * 1.5;
var intervalStrArr = curData.map(function(v) {
  var str = v.toFixed(1);
  if (v > unusuallyLargeThreshold)
    str = " [_" + str + "_] ";
  return str;
});

It also seems to calculate the average of arr |arr| times. (Unnecessary O(N^2) is r-)

@@ +354,5 @@
> +    // talos unfortunately trims common prefixes, so we prefix with a running number to keep the full name
> +    setupAndRun(false, false, "0.WebGL-terrain-alpha-no-AA-no", function() {
> +      setupAndRun(false, true,  "1.WebGL-terrain-alpha-no-AA-yes", function() {
> +        setupAndRun(true,  false, "2.WebGL-terrain-alpha-yes-AA-no", function() {
> +          setupAndRun(true,  true,  "3.WebGL-terrain-alpha-yes-AA-yes", function() {

This scope tower is gross, but ok for only four tests.

@@ +365,3 @@
>  </script>
> +</head>
> +<body onload="test(false, false);" style="overflow:hidden; margin:0">

test() takes (void) not (bool, bool).
Attachment #8436417 - Flags: review?(jgilbert) → review-
(In reply to Jeff Gilbert [:jgilbert] from comment #24)
> This is a lot of logic to have on one line.

Yup. Note taken.

> You might even mention that we're effectively only sampling the population
> (of all frames) ...

True, and I'll add an appropriate comment. However, note that at this code, the stddev value is only used as a debug aid (at the alert box and logs) which helps seeing the big picture better - it's not part of the "official" result/report to our servers. As a debug aid, it does not matter which variant of stddev we use here.

Even further, the distribution of the intervals is anything but normal (though it's probably close to normal if you ignore the outliers - but stddev here is mostly useful as an estimation for the outliers), so the stddev calculation, population or sample based, is not meaningful as stddev per-se and the difference matter very little, probably not at all in practice.

However, as I noted at the very bottom of comment 16, the noise level (outliers which are probably GC/CC) is meaningful on the platforms which I tested, and my intention is both to filter the outliers on one hand, and to add a metric for the noise level on the other hand (as part of the test itself and as yet another metric which it reports).

For this noise metric, I might use stddev. Not sure yet, as I'll have to find an estimation to the noise which is not noisy itself, and the stddev value (population or samples) might be too noisy here between runs. I'm still experimenting with that.

In general, my "style" is to pay a lot of attention where it matters (edge cases/conditions, reproducible test case, low noise level, hot code without excessive allocations, etc), and enough attentions at places where it matters little (the variant of stddev to use as a debug aid, complexity of the calculation when I know it won't matter anyway, etc).

But of course, more correct code is always better and especially if it doesn't cost us much, and in this case it indeed doesn't cost us anything.


> It's not 'pretest' per se, but rather more like POST_FIRST_FRAME_DELAY.

Technically both are correct (it's also before the test), but semantically the name at the patch represents its "true" intention.

In stress testing systems (which is what we do here when we turn vsync off), it's common practice to let the system "breath" a bit before starting the stress loop, and even once the stress has started, it's not being measured right from the get go, but only after some stress "warm up" period - hopefully to start measuring only after the system's response to the stress has reached its stable state.

This is in order to avoid measuring the transition phase (possible ramp up of the response to the stress), and focus on the stable state of the system.

Of course, sometimes it's also desirable to explicitly measure the response to transitions, but in this test I intended to test the stable state and not the transition.

I'll add appropriate comments to the code regarding the above.


> > +    gResultTimestamps[gCurrentFrame++] = performance.now();
> > +    if (gCurrentFrame >= WARMUP_FRAMES + MEASURED_FRAMES) {
> 
> Should probably be > not >=.

No. This condition is met once we've collected warmup+measured timestamps, as after the first timestamp is collected to gResultTimestamps[0], the counter is already 1 as it's increased before the condition is tested.

> If WARMUP_FRAMES+MEASURED_FRAMES is 1 ...

This should not happen. There's a comment at the definition of WARMUP_FRAMES that it must be at least 2.


> > +    var simulatedTimestamp = gCurrentFrame * 1000 / 60;
> 
> Yay!

:)

> > +      // First frame only - wait a bit (after rendering the scene once)
> 
> Explain why.

See the above bit about "common practices in stress test systems".


> Why do this here, and not do stddev as well? We should be doing both of
> these in the same place, so it's very clear they're operating on the same
> data.

> I think we should just report the raw data here, and do the processing of
> the data (mean and stddev) at result reporting time.

Aye. I've already modified it as I felt the same :) note again that stddev is not used for reporting at the code which you reviewed.

> This whole 'Intervals' bit seems unnecessarily tricky. 
> ...
> It also seems to calculate the average of arr |arr| times. (Unnecessary
> O(N^2) is r-)

I'll clean it up and reduce to O(N).


> This scope tower is gross, but ok for only four tests.

My thought exactly.

Thanks for a thoughtful review!
(In reply to Avi Halachmi (:avih) from comment #25)
> (In reply to Jeff Gilbert [:jgilbert] from comment #24)
> > This is a lot of logic to have on one line.
> 
> Yup. Note taken.
> 
> > You might even mention that we're effectively only sampling the population
> > (of all frames) ...
> 
> True, and I'll add an appropriate comment. However, note that at this code,
> the stddev value is only used as a debug aid (at the alert box and logs)
> which helps seeing the big picture better - it's not part of the "official"
> result/report to our servers. As a debug aid, it does not matter which
> variant of stddev we use here.
> 
> Even further, the distribution of the intervals is anything but normal
> (though it's probably close to normal if you ignore the outliers - but
> stddev here is mostly useful as an estimation for the outliers), so the
> stddev calculation, population or sample based, is not meaningful as stddev
> per-se and the difference matter very little, probably not at all in
> practice.
> 
> However, as I noted at the very bottom of comment 16, the noise level
> (outliers which are probably GC/CC) is meaningful on the platforms which I
> tested, and my intention is both to filter the outliers on one hand, and to
> add a metric for the noise level on the other hand (as part of the test
> itself and as yet another metric which it reports).
> 
> For this noise metric, I might use stddev. Not sure yet, as I'll have to
> find an estimation to the noise which is not noisy itself, and the stddev
> value (population or samples) might be too noisy here between runs. I'm
> still experimenting with that.
> 
> In general, my "style" is to pay a lot of attention where it matters (edge
> cases/conditions, reproducible test case, low noise level, hot code without
> excessive allocations, etc), and enough attentions at places where it
> matters little (the variant of stddev to use as a debug aid, complexity of
> the calculation when I know it won't matter anyway, etc).
> 
> But of course, more correct code is always better and especially if it
> doesn't cost us much, and in this case it indeed doesn't cost us anything.

I'll agree with the intent here, but we need to make it obvious what we did
and did-not try to get right. Without any other indication, the assumption is
that code is like it is for the right reasons. If the reason is "we don't want
to bother doing it right here", we should state that, because otherwise
I would think that code is trying to do the right thing. We should leave
comments that state that "this is wrong but we don't care for this case for
reasons X and Y.". This way, we're not left scratching our heads a month from
now wondering why we did A but not B. :)

> 
> 
> > It's not 'pretest' per se, but rather more like POST_FIRST_FRAME_DELAY.
> 
> Technically both are correct (it's also before the test), but semantically
> the name at the patch represents its "true" intention.
> 
> In stress testing systems (which is what we do here when we turn vsync off),
> it's common practice to let the system "breath" a bit before starting the
> stress loop, and even once the stress has started, it's not being measured
> right from the get go, but only after some stress "warm up" period -
> hopefully to start measuring only after the system's response to the stress
> has reached its stable state.
> 
> This is in order to avoid measuring the transition phase (possible ramp up
> of the response to the stress), and focus on the stable state of the system.
> 
> Of course, sometimes it's also desirable to explicitly measure the response
> to transitions, but in this test I intended to test the stable state and not
> the transition.
> 
> I'll add appropriate comments to the code regarding the above.

Great, thanks.

> 
> > > +    gResultTimestamps[gCurrentFrame++] = performance.now();
> > > +    if (gCurrentFrame >= WARMUP_FRAMES + MEASURED_FRAMES) {
> > 
> > Should probably be > not >=.
> 
> No. This condition is met once we've collected warmup+measured timestamps,
> as after the first timestamp is collected to gResultTimestamps[0], the
> counter is already 1 as it's increased before the condition is tested.

And this logic is either wrong, or the names should change. See below.

> 
> > If WARMUP_FRAMES+MEASURED_FRAMES is 1 ...
> 
> This should not happen. There's a comment at the definition of WARMUP_FRAMES
> that it must be at least 2.
I was not saying this could happen, but rather using this simpler case to point out that I believe > is more correct than >=.
Taking one timestamp doesn't constitute 'measuring a frame', since measuring a frame meaningfully (getting a frame interval) requires two timestamps.
If WARMUP_FRAMES is 2, and MEASURED_FRAMES is one, we can extract the frame length of the two warmup frames, but not the length of a non-warmup frame, thus our actual 'measured non-warmup frame lengths count' is zero.

The current logic is correct iff MEASURED_FRAMES is really BENCHMARK_FRAMES_TO_TIMESTAMP. We're not interested in the frame timestamps though, we're interested in the frame intervals, so we should phrase things in terms of intervals.

Short of a logic bug elsewhere which compensates for this, if MEASURED_FRAMES is 30, we'll get 29 measured frame intervals, which is IMO unexpected. I do not want to have to explain in perpetuity that if you want to measure 100 frames, you'll need to set this to 101.

Also, since this is the first of this sort of test (hopefully more to follow!), any mistake we leave here will likely spread into subsequent tests.

It's probably a good idea to assert that the number of frames reported matches MEASURED_FRAMES, as well. Better safe than sorry/confused. :)

> 
> > This whole 'Intervals' bit seems unnecessarily tricky. 
> > ...
> > It also seems to calculate the average of arr |arr| times. (Unnecessary
> > O(N^2) is r-)
> 
> I'll clean it up and reduce to O(N).

Thanks!
Attachment #8436416 - Attachment is obsolete: true
Attachment #8447753 - Flags: review?(jmaher)
This is mostly minor edits per the review, with 2 exceptions:

1. Added a very simple viewport rotation to make the test do just a bit more than only a moving light source (maybe to also prevent possible AA/Alpha optimization paths of a static viewport + scene - which bjacob doesn't think Gecko uses right now).

2. WRT the naming of frames/intervals, I edited it slightly, but still not sure if that's what you meant. I do think the logic/counts/etc was and is solid (unmodified). If you still prefer different naming, let me know how you'd like them called, and I'll modify appropriately. I really don't feel strongly at all about it.

The current changes here are:
- The timestamps counter gCurrentTimestamp instead of gCurrentFrame.
- WARMUP_FRAMES changed to WARMUP_TIMESTAMPS.
- MEASURED_FRAMES unmodified.

The reason is that timestamp and intervals are the same as far as recording timestamps go, but when processing the data, an interval is the interval from a specific timestamp to the previous one. Hence the array size is warmup timestamps + measured frames, etc.
Attachment #8436417 - Attachment is obsolete: true
Attachment #8447755 - Flags: feedback?(jgilbert)
Flags: needinfo?(jgilbert)
Attachment #8436418 - Attachment is obsolete: true
Attachment #8436419 - Attachment is obsolete: true
Attachment #8447756 - Flags: review?(jmaher)
Comment on attachment 8447753 [details] [diff] [review]
part 1 v2: import only the terrain demo

Review of attachment 8447753 [details] [diff] [review]:
-----------------------------------------------------------------

nothing here stands out to me as a problem- I assume this is just getting some of the tools in place.  Can you explain how you plan to use perftest.html?  Will we load this directly, or reference it?  I don't see how we will report results given this.
Attachment #8447753 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #30)
> nothing here stands out to me as a problem- I assume this is just getting
> some of the tools in place.  Can you explain how you plan to use
> perftest.html?  Will we load this directly, or reference it?  I don't see
> how we will report results given this.

Yeah, it's just importing the demo into talos. Part two modifies it to measure WebGL stuff we care about (already r+), and part 3 integrates it into talos and also awaiting your review ;)
Comment on attachment 8447756 [details] [diff] [review]
part 3 v2: add the test page as a talos test

Review of attachment 8447756 [details] [diff] [review]:
-----------------------------------------------------------------

is there a need for mozafterpaint here?  This is a % test (test does own timing), so that means it calls tpRecordTime() from the page? Also we need documentation the talos wiki: https://wiki.mozilla.org/Buildbot/Talos/Tests

::: talos/test.py
@@ +314,5 @@
> +    Each of these 4 runs is reported as a different test name.
> +    """
> +    tpmanifest = '${talos}/page_load_test/webgl/glterrain.manifest'
> +    tpcycles = 1
> +    tppagecycles = 25

do you no if we need 25 cycles for this?
Attachment #8447756 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #32)
> do you no if we need 25 cycles for this?

No one knows if anyone needs 25 cycles for all of the other tests which use 25 ;) It's a mostly arbitrary number which is used in most tests where it doesn't turn out too long to complete and usually provides decent stability.

At this case, it takes just over 2 minutes to complete 25 cycles on my system, and I don't expect it to take much longer on talos machines.

Also, you can give feedback on part two - which is what sends the results with tpRecordTime. I just tried to add feedback? for you but for some reason it didn't let me. Would still be appreciated. Thanks.
Comment on attachment 8447755 [details] [diff] [review]
part 2 v2: address review comments (turn the demop into a test)

Review of attachment 8447755 [details] [diff] [review]:
-----------------------------------------------------------------

glad to see how this all fits together.

::: talos/page_load_test/webgl/benchmarks/terrain/perftest.html
@@ +379,5 @@
> +    setupAndRun(false, false, "0.WebGL-terrain-alpha-no-AA-no", function() {
> +      setupAndRun(false, true,  "1.WebGL-terrain-alpha-no-AA-yes", function() {
> +        setupAndRun(true,  false, "2.WebGL-terrain-alpha-yes-AA-no", function() {
> +          setupAndRun(true,  true,  "3.WebGL-terrain-alpha-yes-AA-yes", function() {
> +            reportResults(gResults);

understand the common prefix stuff, does this need to be nested a series of function callbacks?
Thanks for going over it.

(In reply to Joel Maher (:jmaher) from comment #34)
> understand the common prefix stuff, does this need to be nested a series of
> function callbacks?

If it needs? I'm pretty sure it could be done differently, e.g. using promises and chaining rather than nesting, as the comment just above this piece of code mentions.

However, it's only 4 lines and the scope is limited and it's still readable, and using promises would limit the compatibility of this code for esthetic reasons only, which I find an inappropriate tradeoff.

It's probably still possible to do this without promises and without the nesting, but TBH, I don't think it's worth the time, unless someone insists.

Also, it was addressed in comment 25 and replied to in comment 26. In a nutshell: not the best looking lines of code but acceptable in this case.
Comment on attachment 8447755 [details] [diff] [review]
part 2 v2: address review comments (turn the demop into a test)

Review of attachment 8447755 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/page_load_test/webgl/benchmarks/terrain/perftest.html
@@ +206,5 @@
> +    // jitter in its time-dependent positioning/processing.
> +    // Such behaviour would harm our measurements, especially with vsync off.
> +    // performance.now() will not have this potential issue and is high-resolution.
> +    gResultTimestamps[gCurrentTimestamp++] = performance.now();
> +    if (gCurrentTimestamp >= WARMUP_TIMESTAMPS + MEASURED_FRAMES) {

`gCurrentTimestamp` sounds like it should be `gCurrentTimestampNum`.
More common would be `gTimestampCount`, or to just use gResultTimestamps.length directly.

@@ +208,5 @@
> +    // performance.now() will not have this potential issue and is high-resolution.
> +    gResultTimestamps[gCurrentTimestamp++] = performance.now();
> +    if (gCurrentTimestamp >= WARMUP_TIMESTAMPS + MEASURED_FRAMES) {
> +      var intervals = [];
> +      for (var i = WARMUP_TIMESTAMPS; i < gResultTimestamps.length; i++) {

`i = WARMUP_TIMESTAMPS` is true, but it is technically `WARMUP_TIMESTAMPS+1-1`.
Consider:
`var lastWarmupTimestampId = WARMUP_TIMESTAMPS-1;`
`for (var i = lastWarmupTimestampId+1; [...]`
Attachment #8447755 - Flags: feedback?(jgilbert) → feedback+
(In reply to Jeff Gilbert [:jgilbert] from comment #36)
> ... or to just use gResultTimestamps.length directly.

This will not work, since gResultTimestamps is created with the desired size before the animation starts to avoid unnecessary allocations during the animation. So with this suggestion, the condition will be met on the first time it's evaluated.

I'll make it a bit clearer regardless. Thanks.
Addressed comment 36, Carrying r+.

Joel, are we good to land the patches? If yes, I thought of pushing part 1 as is (for the record as the initial import), and fold parts 2/3 into one. You're good with this?

Also, after we land, could you please file followup bugs to take it on the road to a production test?
Attachment #8447755 - Attachment is obsolete: true
Attachment #8448628 - Flags: review+
Flags: needinfo?(jgilbert) → needinfo?(jmaher)
Attachment #8448628 - Attachment description: Part 2 - make a test of the terrain demo → Part 2 v3 - make a test of the terrain demo
Avi, please land these as 2 patches (v1, and v2+3). Once we get that done we can start looking at this on try server for stability, times, and what talos job to run this as part of.
Flags: needinfo?(jmaher)
Joel did a preliminary run of the new test with windows 7 on a talos machine (it should still take at least 2 weeks until it makes into production though - even without any bumps along the way).

The test was executed 25 times, and this is an example of the intervals on one of those runs (they all look with relatively similar pattern) https://tbpl.mozilla.org/php/getParsedLog.php?id=42859174&tree=Try&full=1#error0 :

0.WebGL-terrain-alpha-no-AA-no= Average: 4.09  stddev: 2.0 (50.0%)
07:46:53     INFO -  Intervals: 2.5  2.5  2.5  7.6  6.1  2.6  2.3  3.9  7.2  2.7  6.4  2.6  2.5  5.9  2.6  4.2  2.6  6.0  3.5  4.1  2.6  2.7  7.2  4.1  2.6  7.3  2.7  6.4  2.8  2.9  3.5  3.9  2.6  3.6   [_11.1_]   3.0  3.5  2.7  2.5  4.8  6.1  2.8  2.4  7.3  2.6  2.5  5.9  4.3  2.3  2.8  6.0  4.7  7.6  2.5  2.5  6.5  2.2  4.3  2.6  2.4  7.4  2.7  2.6  7.4  2.7  2.5  6.0  7.4  2.1  2.9  4.2  5.0  4.4  2.1  5.1  2.2  4.7  6.0  2.3  2.4  3.9   [_9.5_]   2.1  2.3  7.2  2.8  2.7  2.5   [_8.2_]   2.3  2.4  7.9  2.2  3.5  2.5  2.6  6.5  2.6  2.7  7.0

07:46:53     INFO -  1.WebGL-terrain-alpha-no-AA-yes= Average: 5.39  stddev: 1.8 (34.0%)
07:46:53     INFO -  Intervals: 6.8  8.5  4.3  8.1  4.2  5.0  3.7  8.3  3.3  5.4  5.0  7.1  4.9  6.8  4.2  4.4  5.9  6.3  3.4  7.6  3.7  6.2  3.4  5.8  3.9  9.2  6.8  3.3  3.2  5.8  3.8  7.1  5.0  7.1  3.8  5.3  3.9  6.8  8.5  5.0  3.5  5.7  3.9  7.0  5.2  6.9  3.8  5.4  3.8  7.7  5.1  6.1  3.4  7.3  3.7  6.2  3.3  5.8  3.7  6.6  5.9  3.6  10.0  3.8  3.5  3.8  10.4  4.0  4.5  5.0  7.0  3.2  5.7  3.8  7.8  5.0  6.2  5.7  5.0  3.6  7.1  5.5  3.6  3.8  9.7  3.6  3.5  3.7  7.4  3.7  9.9  3.4  3.6  3.7  7.5  3.7  7.9  5.1  6.3  3.3

07:46:53     INFO -  2.WebGL-terrain-alpha-yes-AA-no= Average: 4.36  stddev: 2.5 (57.3%)
07:46:53     INFO -  Intervals: 2.7  7.5  2.2  2.8  2.6  6.0  7.8  2.1  2.8  2.5   [_10.0_]   2.5  2.6  2.5  6.0  2.8  2.6   [_11.4_]   2.7  2.8  8.1  2.7  2.7  6.7  5.3  2.7  2.8  2.6  2.8  7.8  2.7  2.8  6.0  6.8  2.9  2.6  2.8   [_11.0_]   2.7  2.6  4.8  6.8  2.7  2.8  6.9  2.8  2.6   [_9.0_]   2.7  2.9  6.6  4.0  2.9  2.6  7.7  4.0  2.8  2.7  7.8  4.0  2.9  6.8  2.1  5.0  2.6  2.9   [_11.1_]   2.4  2.8  6.7  3.9  2.9  2.7  7.7  4.3  2.5  2.7   [_9.9_]   2.2  2.8  2.7  2.8   [_10.8_]   2.3  2.8  2.6  4.9  7.9  2.6  2.8  6.9  2.9  2.7  7.0  2.5  2.8  7.8  4.0  2.6  6.8

07:46:53     INFO -  3.WebGL-terrain-alpha-yes-AA-yes= Average: 5.69  stddev: 2.0 (35.3%)
07:46:53     INFO -  Intervals: 3.2  6.1  3.7  6.9  6.1  7.8  7.4  5.6  3.6  8.0  3.8  7.1  5.1  6.8  3.3  6.1  3.6  8.8  5.1  4.0  3.7  8.3  3.5   [_11.4_]   3.3  6.7  3.2  6.1  3.9  7.8  5.0  8.1  3.2  9.1  3.2  4.8  6.0  7.8  3.2  9.0  3.3  4.9  6.0  7.8  5.4  6.9  3.3  7.0  3.9  7.9  5.0  3.9  7.6  5.1  6.8  3.4  6.0  3.6  8.7  5.0  6.8  3.4  6.0  3.7  11.1  4.2  6.3  3.3  6.0  3.7  7.7  5.0  7.7  3.8  8.5  3.3  4.7  6.1  7.6  6.0  6.5  4.1  4.5  6.0  7.7  6.0  6.4  3.3  8.1  3.8  6.9  5.1  7.0  3.2  6.2  3.7  11.1  3.3  7.0  3.2


Jeff, what do you think of the stddev values, and intervals distribution in general? On all my local runs on windows and osx it was considerably more stable (stddev around 10% typically). Not sure what to make of it. What do you think?
Flags: needinfo?(jgilbert)
Depends on: 1032868
Depends on: 1032945
Joel pushed a test run of the new test on all platforms: https://tbpl-dev.allizom.org/?tree=Try&rev=60236049e355

I looked at one full log for each platform, and all look very reasonable.

One interesting thing I noticed - while the all the Windows machines are supposedly the same spec - https://wiki.mozilla.org/Buildbot/Talos/Misc#Hardware_Profile_of_machines_used_in_automation , the XP results are about 5x slower than the win7/8 results.

On XP it's 20-40ms/frame, while on windows 7/8 it's about 6-9 ms/frame.

Jeff, is this a known issue? should we file a bug on this?
(In reply to Avi Halachmi (:avih) from comment #42)
> Joel pushed a test run of the new test on all platforms:
> https://tbpl-dev.allizom.org/?tree=Try&rev=60236049e355
> 
> I looked at one full log for each platform, and all look very reasonable.
> 
> One interesting thing I noticed - while the all the Windows machines are
> supposedly the same spec -
> https://wiki.mozilla.org/Buildbot/Talos/
> Misc#Hardware_Profile_of_machines_used_in_automation , the XP results are
> about 5x slower than the win7/8 results.
> 
> On XP it's 20-40ms/frame, while on windows 7/8 it's about 6-9 ms/frame.
> 
> Jeff, is this a known issue? should we file a bug on this?

Yep, we know that XP has significant overhead per-frame because it does Readback instead of sharing a texture.
Flags: needinfo?(jgilbert)
(In reply to Jeff Gilbert [:jgilbert] from comment #43)
> Yep, we know that XP has significant overhead per-frame because it does
> Readback instead of sharing a texture.

Excellent! ermm.. well.. excellent that the test was able to reflect this difference between win7/8 and xp ;)
Joel added the template and I added the content at https://wiki.mozilla.org/Buildbot/Talos/Tests#glterrain

If need for more docs arise, please set this keyword again.
Keywords: dev-doc-needed
Interesting thing which this test just "caught".

On Aug 16th bug 862563 landed which displays a one-line notification at the bottom of the browser soon after it starts. This notification takes some space at the window, and so glterrain has to render to a slightly smaller area.

Rendering to a smaller area improves the rendering speed. The test caught it as a 6% regression.

http://graphs.mozilla.org/graph.html#tests=[[325,131,35]]&sel=1407565889696.2334,1409840455335,0.491803278688522,22.62295081967213&displayrange=90&datatype=running

On Sep 2nd a patch to remove the notification on talos tests landed (bug 1060460) and so the numbers returned to their values prior to Aug 16th.

w00t!
(In reply to Avi Halachmi (:avih) from comment #46)
> Rendering to a smaller area improves the rendering speed. The test caught it
> as a 6% regression.

Ermm.. adding the notification was the "improvement". Removing the notification was caught as a 6% regression, which we dismissed once we understood it.
Performance for configurations which do Readback for compositing WebGL (Linux, WinXP) is strongly influenced by the size of the canvas. Cool that the harness picked up on this.
(In reply to Jeff Gilbert [:jgilbert] from comment #48)
> Performance for configurations which do Readback for compositing WebGL
> (Linux, WinXP) is strongly influenced by the size of the canvas. Cool that
> the harness picked up on this.

I haven't tested on Linux or XP myself, but on Windows 8.1 with Intel HD4400 (Haswell) it's also clearly affected by the window size, almost linearly (e.g. on half of a full HD screen I got 5.3 on a specific subtest, and on full screen I got 9.5 ms per frame on average).

You can try it out yourself:
- Disable vsync (pref layout.frame_rate = 0) and restart.
- Visit http://hg.mozilla.org/build/talos/raw-file/tip/talos/page_load_test/webgl/benchmarks/terrain/perftest.html
can we close this bug?
Flags: needinfo?(avihpit)
Yes. The test proves very effective.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(avihpit)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: