902024 - (australis-tart) Investigate Australis tab animation performance compared to m-c

Reporter

Description

•

11 years ago

Summary of previous events:

Australis tab animation performance was improved greatly at bug 837885, up to a level which seemed quite close to m-c on a slow ATOM system (IIRC <10% and sometimes even slightly better than m-c).

Later, a talos tab animation regression test was developed (TART - bug 848358). This test provides higher resolution results than the instrumentation which was used previously.

Preliminary comparison using TART shows that on pure animation throughput, Australis is worse by 20-40% compared to m-c. These results are after testing on a single slow Win7 AMD E-350 system with blacklisted D2D and D3D HW.

The main differences of TART compared to the previous instrumentation are:

1. TART is running in ASAP mode (i.e. unlimited frame rate), which exposes perf differences even if under normal conditions both contenders would do 60fps.

2. TART reports the overall intervals average, but also provides another result: the average interval over the 2nd half of the animation. The main difference between those is that the latter is mostly unaffected by overheads which are unrelated to Australis (tab initialization etc which typically affects the first few frame intervals negatively, regardless of Australis).

The estimation of 20-40% regression is based on this "half duration" throughput on the test system.

Note that under normal conditions, the performance difference between Australis and m-c is much less acute. e.g. in absolute frame intervals values, Australis did, on average, still better than 60FPS throughput during the second half of the animation.

This bug is about investigating Australis regressions on more than one test system, maybe trying to find causes for these differences, and hopefully providing enough info to allow a more informed decision on how far do we want to go with improving Australis performance.

CC'ing everyone which was on the mail thread. I won't be offended if one un-CC herself ;)

Vladan Djeric (:vladan)

Updated

•

11 years ago

Depends on: 902678

Avi Halachmi (:avih)

Reporter

Comment 1

•

11 years ago

Attached file TART-bench-2013-08-06-data.txt — Details

I compared tab animation performance of m-c vs UX (both 26.a1 2013-08-06) using TART on 4 systems.

TL;DR:

- On the fast windows system and on OS X (MBA), UX performance is roughly similar to m-c (give or take on specific subtests) and has potential for well above 60fps.

- On both of the slow windows systems, UX performed generally worse than m-c by ~20%-100% depending on context and noise (100% = twice as long intervals = half the throughput). FPS ranges from around 60hz to ~30hz with noticeable occasional jank (on UX more than on m-c, on Atom much more than on the AMD system).

- DPI scaling noticeably regresses performance on both UX and m-c, but UX appears to suffer from it more.


Overall: Australis is great on fast systems, and not terrible IMO on slow systems, but there's definitely room for improvement on the slower systems, where m-c does better even if also not perfect.


---- The long version ----

- The TART XPI which I used is from bug 848358 comment 15 (v1.3 WIP) (navigate to chome://tart/content/tart.html once the addon is installed).

TART default configuration was used except:
- Configured to run 5 times in a row (so it also provides average and stddev of the runs).
- Only the icon-DPI1 and icon-Dpi2 tests were selected.

I ran each test both in default refresh rate, and in ASAP mode (layout.refresh-rate=0, but on windows 10000 will work as well).

I ran each configuration few times to make sure the results are as consistent as they can get (they're more noisy on the slower systems).

I also watched each run and can confirm that the results correlate reasonably well to my subjective smoothness assessment.

I also triggered some random cases manually (not using TART) and also confirm that I haven't noticed any notable subjective differences between the performance within TART and the manual triggers.

As for the results of ASAP mode vs default refresh rate, it's fairly expected, i.e:
- If it performs well above 60fps in ASAP, then it does pretty stable 60hz at default rate.
- If it performs worse than 60hz in ASAP, then that's also roughly the performance at the default rate.
- If it performs around or marginally better than 60hz in ASAP, then typically it performs somewhat worse than 60hz in default rate, but not by a lot.

You can examine the full results (averages) of the runs from each config at the attached file, but here I'll focus on ASAP mode only.

Also, I'm comparing the "half" results only - which are the average frame intervals over the last 125ms of the animation, when it's hopefully already stable.


Here are the numbers:

--------------------
Fast system:
(Win7 laptop, i7-3630qm+iGPU, cores: 4+HT, ram: 16G, win-accel: D3D10, d2d: yes, sunspider 1.0: 120ms)

m-c, DPI 1.0: 5ms/iteration on tab open/close.
m-c, DPI 2.0: 10ms/iteration on tab open, 8ms on tab close.

ux, DPI 1.0: 5-6ms/iteration on open/close.
ux, DPI 2.0: 7ms on open, 10ms on close.

----------------
Slow System 1:
(Win7 laptop, AMD E350+iGPU, cores: 2, ram: 4G, win-accel: no, d2d: no, sunspider 1.0: 450ms)

m-c, DPI 1: 7ms open, 6ms close.
m-c, DPI 2: 9ms open, 6ms close.

ux, DPI 1: 11ms open, 14ms close.
ux, DPI 2: 18ms open, 26ms close.

Forcing hw composition using layers.acceleration.force-enabled=true didn't change the results too meaningfully.

--------------------
Slow system 2:
(Win8 tablet, Atom z2760+iGPU, Cores: 2+HT, ram: 2G, win-accel: D3D9, d2d: no, sunspider 1.0: 620ms)

m-c, DPI 1: 12ms open, 15ms close
m-c, DPI 2: 25ms open, 19ms close

ux, DPI 1: 16ms open, 21ms close.
ux, DPI 2: 21ms open, 28ms close

--------------------
OS X, medium speed (OMTC disabled because it currently breaks intervals recording):
(MBA 13" late 2010 2.13GHz, M.Lion 10.8.4, cores: 2, ram: 4G, sunspider1.0: 230ms)

m-c, DPI 1: 13ms open, 10ms close.
m-c, DPI 2: 13ms open, 11ms close.

ux, DPI 1: 11ms open, 9ms close.
ux, DPI 2: 14ms open, 8ms close.

--------------------

The conclusion is at the TL;DR section above.

Please note that TART is still young, and it possibly makes incorrect assumptions/setups/etc. To help improve TART, followup at bug 848358.

However, overall I feel that the results are indicative enough of the rough state of affairs wrt tab animation performance in general, and also wrt to the differences between m-c and ux.

:Gijs (he/him)

Comment 2

•

11 years ago

I am confused by the results. Some questions:

(In reply to Avi Halachmi (:avih) from comment #1)
> TART default configuration was used except:
> - Configured to run 5 times in a row (so it also provides average and stddev
> of the runs).

What was the stddev for these runs?

> --------------------
> Fast system:
> (Win7 laptop, i7-3630qm+iGPU, cores: 4+HT, ram: 16G, win-accel: D3D10, d2d:
> yes, sunspider 1.0: 120ms)
> 
> m-c, DPI 1.0: 5ms/iteration on tab open/close.
> m-c, DPI 2.0: 10ms/iteration on tab open, 8ms on tab close.
> 
> ux, DPI 1.0: 5-6ms/iteration on open/close.
> ux, DPI 2.0: 7ms on open, 10ms on close.
> 
> ----------------
> Slow System 1:
> (Win7 laptop, AMD E350+iGPU, cores: 2, ram: 4G, win-accel: no, d2d: no,
> sunspider 1.0: 450ms)
> 
> m-c, DPI 1: 7ms open, 6ms close.
> m-c, DPI 2: 9ms open, 6ms close.
> 
> ux, DPI 1: 11ms open, 14ms close.
> ux, DPI 2: 18ms open, 26ms close.
> 
> Forcing hw composition using layers.acceleration.force-enabled=true didn't
> change the results too meaningfully.
> 
> --------------------
> Slow system 2:
> (Win8 tablet, Atom z2760+iGPU, Cores: 2+HT, ram: 2G, win-accel: D3D9, d2d:
> no, sunspider 1.0: 620ms)
> 
> m-c, DPI 1: 12ms open, 15ms close
> m-c, DPI 2: 25ms open, 19ms close
> 
> ux, DPI 1: 16ms open, 21ms close.
> ux, DPI 2: 21ms open, 28ms close
> 
> --------------------
> OS X, medium speed (OMTC disabled because it currently breaks intervals
> recording):
> (MBA 13" late 2010 2.13GHz, M.Lion 10.8.4, cores: 2, ram: 4G, sunspider1.0:
> 230ms)
> 
> m-c, DPI 1: 13ms open, 10ms close.
> m-c, DPI 2: 13ms open, 11ms close.
> 
> ux, DPI 1: 11ms open, 9ms close.
> ux, DPI 2: 14ms open, 8ms close.
> 
> --------------------

This is surprising. On OS X, we're faster than m-c everywhere but in double-DPI tabopening (+1ms). The effect is most pronounced when looking at double-DPI tabclosing, where m-c is 38% slower (3ms).

The opposite happens on the fast Windows machine, where we're faster in tabopening but slower at tabclosing.

The slow Windows machines also have some strange results. For one, spec-wise, it makes no sense that we're performing roughly identically on the two slower devices when it comes to double-DPI tabclose, but m-c performs 3 times as well on the AMD device (and 4 times as fast as UX on that machine, too).

For another, we're slower on all tests on both slow machines *except* on double-DPI tab-opening on the Atom, where m-c is ~20% slower.

I'd really like to see the spread on these numbers both between runs and between frames. Did you control the CPU stepping/speed for the slower devices? As they're somewhat newer laptops/tablets, they probably do highly magical things in determining how much CPU you're getting at one point, and that could influence the results rather dramatically...

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

Depends on: 902924

Avi Halachmi (:avih)

Reporter

Comment 3

•

11 years ago

(In reply to :Gijs Kruitbosch (PTO Aug 8-Aug 18) from comment #2)
> What was the stddev for these runs?

Stddev of the _averages_ (i.e. of several <name>.half values - and not of specific frame intervals) can be found at the attachment of comment 1. Note that it's over 5 runs only, so consider it possibly noisy. I haven't saved the individual frame intervals from each run, though TART displays them once it completes its run.

If you think some of the results are questionable, I can repeat these runs and post data including individual intervals from each run (a lot of data). Or you can run TART yourself and collect data on platforms with "unexpected" results.


> This is surprising. On OS X, we're faster than m-c everywhere but in
> double-DPI tabopening (+1ms). The effect is most pronounced when looking at
> double-DPI tabclosing, where m-c is 38% slower (3ms).
> ...

True about the differences. Here are the exact numbers (from the attachment above) on OS X:

Nightly ASAP:
TART.icon-open-DPI1.half     Average (5): 12.67 stddev: 3.84
TART.icon-close-DPI1.half    Average (5): 10.13 stddev: 4.57

TART.icon-open-DPI2.half     Average (5): 12.85 stddev: 2.99
TART.icon-close-DPI2.half    Average (5): 11.13 stddev: 3.42

UX ASAP:
TART.icon-open-DPI1.half     Average (5): 10.89 stddev: 1.22
TART.icon-close-DPI1.half    Average (5): 8.63  stddev: 4.29

TART.icon-open-DPI2.half     Average (5): 13.80 stddev: 5.35
TART.icon-close-DPI2.half    Average (5): 7.83  stddev: 2.56

Note that stddev here is not negligible. 


> The slow Windows machines also have some strange results...

We have a lot of performance differences between different HW. For instance, the AMD system which I tested is roughly 50% faster per core than the (terribly slow) Atom system. The AMD's iGPU is also vastly superior to the atom's (it actually runs Half life 2 in > 30fps at 720p, where the atom struggles to run HL1 without jerking terribly). However, the AMD machine also has GPU composition blacklisted in Firefox, while the atom machine has it enabled by default (D9D9).

So considering the many different code paths, optimizations, advantages and disadvantages of each system wrt to different computational and graphics performance, I'd find it hard to define results as "strange". They're just different.


> Did you control the CPU stepping/speed for the slower
> devices?... that could influence the results rather dramatically...

I didn't touch stepping/turbo, etc, however, both machine step up to the highest performance level as soon as some load starts, and stay at that level until the load is gone (I do have a habit of keeping various CPU monitors running). Both systems don't throttle (the AMD never throttles, and the atom is with passive cooling, yet I never saw it throttling. It's 3W TDP).

I don't think throttling is a factor on my results.

Avi Halachmi (:avih)

Reporter

Comment 4

•

11 years ago

Also, in general, unless there's a specific perf difference that really doesn't make sense at the numbers I've posted, and which is important to understand if this difference is real, I think the looking at specific values misses the big picture.

First of all, the results were noisy, and at the slower system they're quite more noisy and more janky. Individual "wrong" numbers might creep in.

But if you take a step back and consider all the results, I think that the trends are relatively clear.

When there's a specific difference or trend which we think is important, by all means it warrants running it more times and looking at the collected values with more scrutiny and in higher resolution.

Again, I'm willing to re-run whatever tests we need, under any special setup which we think is important, and collect as much data as deemed required. And also anyone else could do this on his/her machine. Using TART is pretty straightforward.

Jared Wein [:jaws] (please needinfo? me)

Updated

•

11 years ago

No longer depends on: 902924

Matthew N. [:MattN]

Updated

•

11 years ago

Depends on: 904924

Matthew N. [:MattN]

Comment 5

•

11 years ago

Hey Bas,

I did Moz2D recordings and SPS profiles of the basic TART test on both m-c and UX nightlies on a Windows 7 netbook. Could you help analyze them to identify what makes the tab animations slower on UX?

TART Results without recording/profiles on Nightly 20130820, Aero Glass, restored window, ASAP, HWA blocked (Half / All for 20 iterations)
m-c: open: 11.34 / 17.50  close:  8.79 / 12.13
UX:  open: 15.15 / 25.65  close: 18.61 / 22.64

Profiles for ~8 iterations:
m-c: http://people.mozilla.com/~bgirard/cleopatra/#report=721799032cf4e6ea1d86e61dfbeb50f449e8d203
ux:  http://people.mozilla.com/~bgirard/cleopatra/#report=21510942e438e7bbcb09b757a08a6dda2dfc0d60

Moz2D Recordings:
m-c: https://people.mozilla.com/~mnoorenberghe/gfx_recordings/mc-nightly-tart-basic-20130820.aer
UX:  https://people.mozilla.com/~mnoorenberghe/gfx_recordings/ux-nightly-tart-basic-20130820.aer

Flags: needinfo?(bas)

Keywords: meta

Hardware: x86_64 → All

Avi Halachmi (:avih)

Reporter

Comment 6

•

11 years ago

(In reply to Matthew N. [:MattN] from comment #5)
> 
> I did Moz2D recordings and SPS profiles of the basic TART test ...

Next time, when you profile, uncheck "Accurate first recorded frame" at TART.

If you keep it checked, then TART will change the opacity of the Back button several times before starting the tab animation. This is quite an ugly workaround to make sure that the 
(TART) recording of the first interval is accurate. However, if you profile then it adds irrelevant noise before animation which are measured by TART. When it's unchecked, then one should regard the first frame interval which TART measures/reports as inaccurate (measures longer than it actually is).

I'll add a note to the TART UI on this.

@bas, when you look at these profiles, select a single animation range (several frames) at the profiler to exclude this initial noise. The intervals are typically consistently longer on UX compared to m-c, so theoretically, selecting just few frames range should be enough to notice the difference in performance.

Matthew N. [:MattN]

Comment 7

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #6)
> (In reply to Matthew N. [:MattN] from comment #5)
> > 
> > I did Moz2D recordings and SPS profiles of the basic TART test ...
> 
> Next time, when you profile, uncheck "Accurate first recorded frame" at TART.

I did that because I thought of the issue you mentioned.

The recordings didn't seem to work anyways. It only recorded the content area for a short duration. I'm not sure of the cause yet but I wonder if it has to do with HWA being blocked or the new e!0s stuff.

Matthew N. [:MattN]

Comment 8

•

11 years ago

(In reply to Matthew N. [:MattN] from comment #7)
> The recordings didn't seem to work anyways. 

To clarify, I mean the Moz2d recordings didn't work, everything else is fine.

TART-bench-2013-08-06-data.txt 11 years ago Avi Halachmi (:avih) 16.03 KB, text/plain		Details
mc-merged-profile-tart-winxp-af36680248ef.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.72 MB, application/zip		Details
ux-merged-profile-tart-winxp-c7f2f6ee6139.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.61 MB, application/zip		Details
ux-merged-profile-tart-winxp-355582f77d71.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.73 MB, application/zip		Details
ux-profile-tart-winxp-351d311e107c.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.68 MB, application/zip		Details
mc-profile-tart-winxp-579bfedd91f6.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.65 MB, application/zip		Details
mc-merged-profile-tart-snowleopard-e9b79f481864.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.48 MB, application/zip		Details
ux-merged-profile-tart-snowleopard-34cbdba2e0de.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.55 MB, application/zip		Details
ux-merged-profile-tart-lion-34cbdba2e0de.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.65 MB, application/zip		Details
mc-merged-profile-tart-lion-e9b79f481864.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 2.66 MB, application/zip		Details
ux-merged-profile-tart-mountainlion-34cbdba2e0de.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 3.55 MB, application/zip		Details
mc-merged-profile-tart-mountainlion-e9b79f481864.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 3.82 MB, application/zip		Details
m-c vs UX reflow comparison profile - Lion 11 years ago Mike Conley (:mconley) (:⚙️) 3.80 MB, application/zip		Details
compare-snowleopard-03c32421e063-4a4ea81339a7.txt.zip 11 years ago Mike Conley (:mconley) (:⚙️) 3.63 MB, application/zip		Details