Investigate Talos regressions for Windows OMTC

NEW
Unassigned

Status

()

defect
6 years ago
4 years ago

People

(Reporter: nrc, Unassigned)

Tracking

Trunk
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

Landing Windows OMTC regressed Talos on pretty much every measure from 4% to 400%. We'll try and fix that before turning it on again.
Depends on: 933549
We improved tscroll-ASAP on Windows 8 by 36.3%

We regressed CanvasMark, SVG-ASAP, Tab Animation Test, TResize on all versions.
We regressed tscroll-ASAP, Ts paint on XP
We regressed Paint on Win 7 
We regressed SVG-Opacity Row Major on Win 7 and 8
We regressed various flavours of Tp5 on various versions, but mostly all of them.
(In reply to Nick Cameron [:nrc] from comment #1)
> We improved tscroll-ASAP on Windows 8 by 36.3%

This might (but not necessarily) be false, as this test currently measures a different thing with OMTC.

tscroll-ASAP and SVG-ASAP use requestAnimationFrame to advance to the next layout step, so they measure the refresh driver's frequency, and in ASAP - its throughput. Without OMTC, this was also rendering throughput, but with OMTC, since the refresh driver and compositor are not in sync, it might not be (depending if compositions are faster than refresh driver iterations or not).

The solution with those would be either to not overproduce (i.e. not trigger the refresh driver before its previous output was composed), or measure both throughputs and somehow take both into account.

Preventing overproduction is desirable regardless of tests, but last time I've heard mattw and jrmuizel discuss it, it wasn't prioritized.

Modifying those tests to measure both throughputs is possible.
A lot of these tests are tied to MozAfterPaint events. What does MozAfterPaint measure with omtc turn on? Maybe some of these regressions are due to measuring the wrong metric, or maybe MozAfterPaint is broken in some way with omtc.
(In reply to Avi Halachmi (:avih) from comment #2)
> Modifying those tests to measure both throughputs is possible.

To declare whether or not we regressed, however, assuming that these tests' goals is to measure how quickly we can update the screen with layout changes, might not be so easy.

For tscroll-ASAP, most of the updates are of similar intervals (they all move the content by a fixed distance), so taking the higher average interval of the two would be comparable to non-OMTC results, and probably good indication if we regressed or not.

But for SVG-ASAP, it renders different things with quite different complexities. _If_ the refresh driver iterates faster than the compositor, then it's possible that some layouts will never get composed and displayed, which makes it harder to compare to current results. Still not sure what's the best approach here. Suggestions are welcome.

Still though, preventing overproduction should solve all issues, and the tests could stay unmodified while still providing comparable results.


(In reply to Jim Mathies [:jimm] from comment #3)
> A lot of these tests are tied to MozAfterPaint events. What does
> MozAfterPaint measure with omtc turn on? Maybe some of these regressions are
> due to measuring the wrong metric, or maybe MozAfterPaint is broken in some
> way with omtc.

mozAfterPaint is only relevant for tests which don't report their own results. E.g. the TP5 tests where the pageloader addon measures load + first-paint duration instead of only the onload event.

tscroll-ASAP and SVG-ASAP measure the results on their own, using rAF, and therefore should be completely unaffected by mozAfterPaint.

TP5 and paint should be affected by mozAfterPaint, and I haven't looked at SVG-Opacity yet.
mstange, jmaher and myself discussed this on IRC. We have few problems here, and some solutions:

The problems:

1. We need to make the talos tests meaningful also with OMTC, i.e. represent, hopefully to a good degree, how the user experiences the relevant perf. This is true for OMTC in general and not only for windows, and AFAIK we didn't explicitly take care of it so far.

We're aware of at least two tests where the results don't represent the users' perspective with OMTC: tscroll and tsvg, since they currently measure refresh driver perf without taking into account compositions perf, while the users perception correlates with the slower of the two.


2. For the original goal of this bug: make sure we can compare OMTC on/off performance from a user's perspective. We need this in order to understand if/what we've regressed by moving windows to OMTC, and by how much, before deciding how to handle it (fix/accept/in-between) (wouldn't hurt to know that for OS X as well, but we're probably past the stage where it matters). Also, to be able to measure progress in this regard.

Solving 2 might also solve 1 (because the assumption is that non-OMTC talos results do reflect perf well), but not necessarily. Still, this bug should probably focus on 2.


As for specific tests:

A. All the OMTC off test runs (e.g. on XP, Linux) are perfectly comparable and any regressions there are real (due to OMTC refactoring rather than due to enabling OMTC).

B. Canvasmark or pageloader tests which don't use mozAfterPaint (not sure we have those though) are still valid with OMTC and comparable to OMTC off. They measure how long it takes to load a page (the load event). OMTC doesn't "pollute" these results, and any regressions here are real.

C*. TART is still valid and the OMTC results are comparable to OMTC-off results. TART (now) measures composition intervals, which is what matters on tab animations as far as the user is concerned, so we're covered here - any changes in results are "real".

D*. tscroll-ASAP is _probably_ NOT comparable right now. Currently it measures rAF throughput - which could be faster than compositions, but changing it to composition throughput is easy and would make the results comparable, as this is what the user sees while scrolling.

E. Everything which uses mozAfterPaint is NOT comparable right now (TP5 mozAfterPaint, ts_paint, paint, etc). According to mstange, the mozAfterPaint event is fired after Present without OMTC, but before the the compositor composes with OMTC. Therefore, all the mozAfterPaint results should show "better" results with OMTC, but only because they won't take composition into account. _If_ we still think we want to measure actual duration until Present, then mozAfterPaint would need to be modified for OMTC, and then the results would be comparable. 

F. SVG-ASAP results are probably not comparable with OMTC on/off, and we also don't currently have a good approach to make them comparable. The root problem here is that this test renders frames with varying complexity, which invalidates the approach of worst average between refresh-driver/compositions throughput (since the averages don't represent the whole set due to the variation).

If, however, we could make sure that every refresh driver iteration is composed and presented, the problem is solved.

G. The last point might be a bigger issue of OMTC in its current implementation AFAIK: it doesn't assure that every layout iteration is painted. I'm not sure what the spec says of this, but if a code makes a layout change which should bring the screen to state A, then waits for rAF callback, then makes another change to state B, AFAIK it might never paint A with OMTC if the compositor was busy and only became available to paint state B.


* When measuring compositions intervals, the results are valid and comparable to OMTC-off:

- Assuming that compositions can't happen more frequently than refresh driver iterations (so we don't compose the same layout twice and get false rate), and that there's no other APZC or async animations activity - which AFAIK is indeed the case during tscroll, TART and tsvg.

- However, as long as composition is the bottleneck, it will mask any refresh driver throughput regressions. This is not critical for 2 above, but is useful for 1. However, this is manageable by also measuring rAF and reporting it as well.


My suggestions are as follows:
- Start discussing A/B/C regressions now (examine/fix/accept/etc).
- I'll fix D, and then it could be handled as well.
- Someone will fix E (roc?) or decide that we don't care about mozAfterPaint anymore, or decide that we accept that it measures a slightly different thing now (making the results incomparable with OMTC on/off).
- Someone will check the spec for G (bz? roc?). If the spec says that we must paint every layout change on rAF, it probably means we should fix overproduction (mattw? mstange?), and it would also solve F, otherwise, not sure yet how to handle it.

Thoughts?
I think we should definitely fix overproduction ASAP. It should be easy to have the compositor send an async reply to the main thread when it's done a composition, and after sending a layer tree update the main thread should not do another refresh driver tick until that reply has been received.
Marcus points out that that could reduce paralellism. And in fact, it's really a half-baked version of vsync control. So why don't we do our vsync plan which we've already worked out fairly well:
https://groups.google.com/forum/#!msg/mozilla.dev.platform/dve8ZqaMMiU/yV9agk-Dp2gJ
(In reply to Avi Halachmi (:avih) from comment #5)
> - Someone will fix E (roc?) or decide that we don't care about mozAfterPaint
> anymore, or decide that we accept that it measures a slightly different
> thing now (making the results incomparable with OMTC on/off).

Note that MozAfterPaint is used in production code, not just tests.
(In reply to Dão Gottwald [:dao] from comment #8)
> (In reply to Avi Halachmi (:avih) from comment #5)
> > - Someone will fix E (roc?) or decide that we don't care about mozAfterPaint
> > anymore, or decide that we accept that it measures a slightly different
> > thing now (making the results incomparable with OMTC on/off).
> 
> Note that MozAfterPaint is used in production code, not just tests.

Indeed, I meant in the context of talos, but didn't express it clearly enough.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #7)
> ... So why don't we do our vsync plan which we've already worked out fairly well ..

The problem is that right now, some talos tests, specifically tscroll and tsvg, can't help us to measure regressions of the move to OMTC.

We can do:

1. Fix overproduction. If we wait to fix overproduction, then no further modification should be required IMO on talos, as rAF intervals would also reflect composition intervals (and in ASAP...).

2. Modify tscroll and tsvg, for instance to report average between rAF intervals and composition intervals. For tsvg it might also require to modify some tests slightly to produce more consistent intervals, such that the average will be meaningful even when compositions are possibly skipped (there's at least one tsvg test where it's relevant).

3. Both: do 2 now, and maybe revert it once 1 lands.


I can do 2 relatively easily, and I can't do 1. What are our priorities on this?
Flags: needinfo?(roc)
The way you've framed it, 3 is the obvious answer :-)
Flags: needinfo?(roc)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #11)
> The way you've framed it, 3 is the obvious answer :-)

Yup. I already got a local talos tscroll version which measures both rAF intervals and composition intervals, and then averages them.

While I only tested on OS X, and the intervals arrays are clearly different, their averages are pretty much the same. This suggests that compositions are not the bottleneck on the cases which I tested and that we don't over produce on these cases, which makes the current rAF based measurements still valid and indicative.

I'll test on windows as well, and do the same with svg. If all the cases will show that rAF alone is as good as rAF+compositions, then it means the current regressions/improvements numbers from turning OMTC on are valid, and can be worked on.

Still, if, due to some future bug, composition will take considerably longer, rAF based numbers will not expose the regression.

I'll post my conclusions tomorrow.
Profiling scrolling on d3d9 I observe a lot of time used where Present is calling WaitForSingleObject (on the compositor thread, Present is 16% of the time, and WaitForSingleObject takes 13%). I don't see this when we have single threading. I have tried passing D3DPRESENT_INTERVAL_IMMEDIATE to the device (we already pass it to the swap chain), and D3DPRESENT_INTERVAL_ONE. Also passing D3DPRESENT_DONOTWAIT to Present. None of those seem to affect it. Any ideas how to address this? Or any other ideas for optimising our d3d9 use - I don't see anything interesting other than this from the profiles (and we are ~400% worse on tscroll, which I do not understand at all).
Flags: needinfo?(dglastonbury)
Flags: needinfo?(bas)
(In reply to Nick Cameron [:nrc] from comment #13)
> Profiling scrolling on d3d9 I observe a lot of time used where Present is
> calling WaitForSingleObject (on the compositor thread, Present is 16% of the
> time, and WaitForSingleObject takes 13%). I don't see this when we have
> single threading. I have tried passing D3DPRESENT_INTERVAL_IMMEDIATE to the
> device (we already pass it to the swap chain), and D3DPRESENT_INTERVAL_ONE.
> Also passing D3DPRESENT_DONOTWAIT to Present. None of those seem to affect
> it. Any ideas how to address this? Or any other ideas for optimising our
> d3d9 use - I don't see anything interesting other than this from the
> profiles (and we are ~400% worse on tscroll, which I do not understand at
> all).

Interestingly, using D3DPRESENT_INTERVAL_ONE changes use from 13% to 85% waiting there, plus adds a whole lot more waiting to the main thread.
Depends on: 951554
Depends on: 951556
No longer depends on: 933549
Adding for reference the talos tscroll OMTC modification:
- Adds a framesRecordingApi addon which proxies [start/]stopFrameTimeRecording privileged API on dom utils, and include it when running the tscroll tests.
- Modified tscroll: instead of only measuring rAF intervals, it now also measures composition intervals, and then reports the average of the two.

Notes:
- To run tscroll pages outside of talos, the addon is required with this patch.
- On my tests on OSX and windows, the results didn't change meaningfully with this patch, indicating that we're not overproducing on these cases. That's good info, but it also makes this patch redundant if we assume that we're not overproducing.
(In reply to Nick Cameron [:nrc] from comment #14)
> (In reply to Nick Cameron [:nrc] from comment #13)
> > Profiling scrolling on d3d9 I observe a lot of time used where Present is
> > calling WaitForSingleObject (on the compositor thread, Present is 16% of the
> > time, and WaitForSingleObject takes 13%). I don't see this when we have
> > single threading. I have tried passing D3DPRESENT_INTERVAL_IMMEDIATE to the
> > device (we already pass it to the swap chain), and D3DPRESENT_INTERVAL_ONE.
> > Also passing D3DPRESENT_DONOTWAIT to Present. None of those seem to affect
> > it. Any ideas how to address this? Or any other ideas for optimising our
> > d3d9 use - I don't see anything interesting other than this from the
> > profiles (and we are ~400% worse on tscroll, which I do not understand at
> > all).
> 
> Interestingly, using D3DPRESENT_INTERVAL_ONE changes use from 13% to 85%
> waiting there, plus adds a whole lot more waiting to the main thread.

Try using a 'single threaded' device instead of a multithreaded device. This should be perfectly possible although it might lead to some artifacts.
Flags: needinfo?(bas)
(In reply to Bas Schouten (:bas.schouten) from comment #16)
> (In reply to Nick Cameron [:nrc] from comment #14)
> > (In reply to Nick Cameron [:nrc] from comment #13)
> > > Profiling scrolling on d3d9 I observe a lot of time used where Present is
> > > calling WaitForSingleObject (on the compositor thread, Present is 16% of the
> > > time, and WaitForSingleObject takes 13%). I don't see this when we have
> > > single threading. I have tried passing D3DPRESENT_INTERVAL_IMMEDIATE to the
> > > device (we already pass it to the swap chain), and D3DPRESENT_INTERVAL_ONE.
> > > Also passing D3DPRESENT_DONOTWAIT to Present. None of those seem to affect
> > > it. Any ideas how to address this? Or any other ideas for optimising our
> > > d3d9 use - I don't see anything interesting other than this from the
> > > profiles (and we are ~400% worse on tscroll, which I do not understand at
> > > all).
> > 
> > Interestingly, using D3DPRESENT_INTERVAL_ONE changes use from 13% to 85%
> > waiting there, plus adds a whole lot more waiting to the main thread.
> 
> Try using a 'single threaded' device instead of a multithreaded device. This
> should be perfectly possible although it might lead to some artifacts.

This led pretty quickly into deadlock. On the brightside, there were no rendering artefacts :-)
I discovered we are unintentionally doing double buffering with d3d9. I did a try/talos (https://tbpl.mozilla.org/?tree=Try&rev=95f8ef58faf1) push with single buffering and found that it basically makes no difference where it matters. However, for Windows 7/8 it vastly improves our drawing heavy tests (SVG/canvasmark) (more than the work I did in bug 951554). However, we lose the improvement in scrolling. I hope we can make double buffering better and so get the benefit from both, but we might need to make a tradeoff here.

Given the name, I assume canvasmark does canvas stuff, so I would hope it would not be too affected by how we compose content layers. That is a bit mysterious.
(In reply to Avi Halachmi (:avih) from comment #15)
> Created attachment 8349394 [details] [diff] [review]
> talos tscroll - OMTC
> 
> Adding for reference the talos tscroll OMTC modification:
> - Adds a framesRecordingApi addon which proxies
> [start/]stopFrameTimeRecording privileged API on dom utils, and include it
> when running the tscroll tests.
> - Modified tscroll: instead of only measuring rAF intervals, it now also
> measures composition intervals, and then reports the average of the two.
> 
> Notes:
> - To run tscroll pages outside of talos, the addon is required with this
> patch.
> - On my tests on OSX and windows, the results didn't change meaningfully
> with this patch, indicating that we're not overproducing on these cases.
> That's good info, but it also makes this patch redundant if we assume that
> we're not overproducing.

Did you try on Windows XP? Or using d3d9 and no d2d (which is almost equivalent)? TART and TScroll regressed by ~400% there and I really don't understand that. I see some bad waits in the profile and I don't know how to fix those, but I still don't think they should cause such a regression. Also using the browser, scrolling feels about the same with MT/OMTC, so 400% regression seems dubious. I guess it could just be because I am testing on a fast machine.
Flags: needinfo?(avihpit)
I didn't try on XP. Following chat with Nick on IRC, I'll try to reproduce the XP regression on windows 7 by forcing d3d9 and disabling d2d. I'll post some results soon.
Flags: needinfo?(avihpit)
Pastes of emails:

Regression: Mozilla-Inbound-Non-PGO - tscroll-ASAP - WINNT 5.1 (ix) - 387% increase
-----------------------------------------------------------------------------------
    Previous: avg 3.641 stddev 0.080 of 12 runs up to revision dda95b5cb528
    New     : avg 17.739 stddev 0.296 of 12 runs since revision ba41e330fc7a
    Change  : +14.099 (387% / z=175.939)
    Graph   : http://mzl.la/18DMdqt

Changeset range: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=dda95b5cb528&tochange=ba41e330fc7a

Changesets:
  * http://hg.mozilla.org/integration/mozilla-inbound/rev/ba41e330fc7a
    : Nicholas Cameron <ncameron@mozilla.com> - Bug 899785. Turn on Windows OMTC on nightly. r=mattwoodrow
    : http://bugzilla.mozilla.org/show_bug.cgi?id=899785

Bugs:
  * http://bugzilla.mozilla.org/show_bug.cgi?id=899785 - Turn on OMTC Direct3D11 by default



Regression: Mozilla-Inbound-Non-PGO - Tab Animation Test - WINNT 5.1 (ix) - 394% increase
-----------------------------------------------------------------------------------------
    Previous: avg 4.830 stddev 0.048 of 12 runs up to revision dda95b5cb528
    New     : avg 23.849 stddev 0.451 of 12 runs since revision ba41e330fc7a
    Change  : +19.018 (394% / z=399.816)
    Graph   : http://mzl.la/18DMdH1

Changeset range: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=dda95b5cb528&tochange=ba41e330fc7a

Changesets:
  * http://hg.mozilla.org/integration/mozilla-inbound/rev/ba41e330fc7a
    : Nicholas Cameron <ncameron@mozilla.com> - Bug 899785. Turn on Windows OMTC on nightly. r=mattwoodrow
    : http://bugzilla.mozilla.org/show_bug.cgi?id=899785

Bugs:
  * http://bugzilla.mozilla.org/show_bug.cgi?id=899785 - Turn on OMTC Direct3D11 by default


I see similar numbers with every talos run, I don't have any links to hand though, sorry. (oh actually, the one in comment 18).
System: Windows 7, i7-3630qm (HD4000), optimus nvidia gt650m, recent drivers.
"xp mode" (nrc): layers.prefer-d3d9 = true,  gfx.direct2d.disabled = true
OMTC on/off using: layers.offmainthreadcomposition.enabled

tscroll: tested only the pages: tiled-fixed.html and tiled-fixed-downscale.html
TART: simple-fade-dpi-current only (purest GFX test), numbers are for open/close tab animation.

The tests were performed outside of talos, as follows:
1. ASAP mode for both (needs restart): layout.frame_rate=0, docshell.event_starvation_delay_hint=1 (int)
2. For tscroll: just load the pages from the talos repo in the browser.
3. For TART: install the addon xpi from bug 848358, Enable only the "simpleFadeDpiCurrent" test, set repeat to 5, and click the start button.

- Lower numbers are better (represent average iteration interval). On all the tscroll runs, I used the patched version from comment 15 which also measure composition intervals, but they were not meaningfully different than the rAF numbers (less than 2% diff probably).

- intel, omtc off, xp mode off
tscroll tiled-fixed.html:   6.4
tiled-fixed-downscale.html: 6.9
TART: 2.8/1.9

- intel omtc on, xp mode off
tscroll tiled-fixed.html:   5.6
tiled-fixed-downscale.html: 5.8
TART: 4.2/2.2

- intel, omtc off, xp mode on
tscroll tiled-fixed.html:   5.2
tiled-fixed-downscale.html: 6.2
TART: 2.2/1.4

- intel, omtc on, xp mode on
tscroll tiled-fixed.html:    7.7
tiled-fixed-downscale.html : 8
TART: 11.2/10


- nvidia, omtc off, xp mode off
tscroll tiled-fixed.html:    3
tiled-fixed-downscale.html : 3
TART: 2.2/1.4

- nvidia, omtc on, xp mode off
tscroll tiled-fixed.html:    3.6
tiled-fixed-downscale.html : 3.6
TART: 2.4/1.6

- nvidia, omtc off, xp mode on
tscroll tiled-fixed.html:   2.1
tiled-fixed-downscale.html: 2.1
TART: 2.1/1.3

- nvidia, omtc on, xp mode on
tscroll tiled-fixed.html:   2.5
tiled-fixed-downscale.html: 2.5
TART: 5.9/3.8


Conclusions (rough numbers when moving from OMTC off -> on):

Intel HD4000, Windows 7:
- tscroll: 20% improvement
- TART:    50% regression

Intel HD4000, XP mode:
- tscroll: 40% regression
- TART:    400% regression

Nvidia GT650m, Windows 7:
- tscroll: 16% regression
- TART:    10% regression

Nvidia GT650m, XP mode:
- tscroll: 25% regression
- TART     300% regression
Flags: needinfo?(dglastonbury)
BenWa: this is the only bug for the OMTC regressions - there is not a d3d9 specific one.
Assignee: nobody → bgirard
Assignee: bgirard → nobody
You need to log in before you can comment on or make changes to this bug.