Closed Bug 938395 Opened 6 years ago Closed 6 years ago

Optimize our DrawThebesLayer invalid region for Direct2D

Categories

(Core :: Layout, defect, P3)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla32
Tracking Status
firefox31 --- fixed
firefox32 --- fixed

People

(Reporter: mattwoodrow, Assigned: mattwoodrow)

References

Details

(Keywords: perf, Whiteboard: [qa-])

Attachments

(2 files, 3 obsolete files)

Direct2D is fairly slow at drawing when we have a complex (anything other than a single rect) clip applied. It needs to push a temporary layer, draw to that, and then use the clip geometry as a mask when copying back.

Bug 934860 changed our ThebesLayer drawing behaviour to instead draw each rect of the invalid region separately to avoid this slow path. Unfortunately it appears that the cost of drawing display items is somewhat proportional to the number of items we draw, and not just the number of pixels filled (which makes sense). This resulted in a considerable regression for TART on the UX branch.

One possible solution is to modify the invalid region into something simpler that will result in optimal performance for direct2d.

A simple example of a bad case is:

Display item 1: 0,0,100,20
Display item 2: 100,0,10,19

We then invalidate both items and up with a region that contains:

Rect 1: 0,0,110,19
Rect 2: 0,19,100,1

We then have to paint item 1 twice, and if the overhead of doing so is considerable, then we lost.

Jeff and I discussed 3 possible options for improving this:

1)

Come up with a new version of SimplifyOutward for nsRegion that optimizes for some other metric of 'simplicity' rather than number of rects.

Ideally this would combine our two rects into a single one, but two small rects at opposite ends of the page would remain separate. I haven't really thought much about how to define this exactly.

2)

Rather than trying to reconstruct a smart representation during painting, have information similar to 'this display item is entirely invalidated and should be painted in one go' passed down from DLBI. I really have no idea what this representation would look like.

There's also some quite (edge?) cases where this approach would come up with awful results.

If we had two display items, one very tall and narrow, the other very wide and short and they overlap each other at the top-left corner. In this case the only way to respect the request that both items get painted at once is to expand the region to the entire bounds, potentially dragging in a bunch of new display items.

3)

Avoid clipping entirely whenever possible, and just let Direct2D overdraw. If we cull display items that don't intersect the invalid region, then the extra drawing shouldn't matter much since it's on the GPU.

However display items that aren't in the invalid region may occlude parts of display items that are. We'd need to make sure we include these items in the list to be painted. Computing this, along with a potentially large number of extra items being included could make this approach worse in some cases.


Does anyone have any thoughts on this?
How about for D2D we just fluff out a complex region to its bounding box if the area of the bounding box is not much more (say 2x?) than the area of the region?

If you want to be fancier we could count the number of display items that intersect the bounding box vs just the region and fluff out to the bounding box if there aren't too many of them. You could even use any cost metric for display items that you like.
What do you think about this?

Seems to work and not regress performance (this is with single rect painting enabled) - https://tbpl.mozilla.org/?tree=Try&rev=70dd9b3af195 (winxp failures are unrelated, and fixed locally)

We could go slightly further with this approach and try weight display items more if we think they have a high overhead. One simple options would be to add 2 for items like nsDisplayBoxShadowOuter, anything that paints via an inactive layer etc.

I don't know if we have good enough talos tests to really tune this though.
Attachment #8333718 - Flags: feedback?(jmuizelaar)
If we're going to simplify the draw region out, then we need to also clear the pixels for the entire draw region.

This code takes all the surface clears and moves them into DrawThebesLayer, so that we can do it after we decide what to actually draw.

Bas: For this to work, we need a way to initialize an area of a DrawTargetDual with black/white. I've hacked up one possibility in this patch, what do you think? I'm totally open to other suggestions.
Attachment #8333720 - Flags: feedback?(bas)
So I had a look at how IE solves this problem. Here are some things that I've discovered:
- IE never seems to PushLayer for clipping
- IE will double draw items that intersect multiple rects in the invalid region
- Multiple rectangles are only used if there is a reasonable large difference from the rectangles and the bounds.
So with these patches, we barely regress TART on win7, but regress win8 TART quite considerably.

https://tbpl.mozilla.org/?tree=Try&rev=ab6bdef8e52f

Interestingly, the first 8 cycles of TART seem to be getting scores very similar to that of trunk, and then we start regressing considerably.

For example:

Cycle 1  - newtab-open-preload-yes.half.TART: 2.3
Cycle 11 - newtab-open-preload-yes.half.TART: 38.3

Unfortunately, we were only profiling for a single cycle, so this didn't show anything.

I tried upping the cycles to 10, and getting profiles for that.

https://tbpl.mozilla.org/?tree=Try&rev=d4fe9977da5f

This doesn't seem quite as bad as the earlier runs, but some of the results are dropping off badly.

Cycle 1  - icon-open-DPI2.half.TART: 3.3
Cycle 10 - icon-open-DPI2.half.TART: 8.6

Profile for this run with markers left in: http://people.mozilla.org/~bgirard/cleopatra/#report=a259d5c3bf34cb38c6618f5a2e2832346910cd0d

The range [12765, 12897] covers this test for cycle 10. (Only cycles 9/10 are in the profile, the rest got dropped out of the circular buffer).

We can see in this range that we spend 30% of our profile time in NtFlushBuffersFile, more time than we spent painting.

It looks like this is causing some frames to take huge amounts longer than the others.

But we also appear to just get slower for almost every frame in later cycles, I haven't seen any explanation for that yet.
Oh, I just realised that this might be because of the profiler running, and completely unrelated to the slowdown.
When using TART, besides the actual animation, most of the tests also open or close an actual browser tab. This has some overhead which affects the results (typically the first few frames). While the animation is hopefully stabilized at the second half of the animation (the .half values), it's not always the case, and also, there could be some GC/CC residues due to the tab close/open, which may also affect the results.

When testing graphics efficiency of tab animations, the best tests are the fade tests. These tests only measures css fade in/out (the actual css animation) of an existing newly opened tab. The fade tests are typically very uniform in interval durations through the entire Animation, including the first frames, and also typically between repetitions of the same animation (the tests which open/close tabs have higher variance than the fade tests).

So I'd suggest to stick to the fade tests results for better consistency and representation of gfx perf.
Good run: http://people.mozilla.org/~bgirard/cleopatra/#report=68053466bac32fa50d2fee04fabc8484a580fe5b [9799, 9915]

Bad run: http://people.mozilla.org/~bgirard/cleopatra/#report=8238fab39cb01fb8277b1d3bd3afe646b95bf452 [14159, 14278]

All the extra time is being spent inside compositing with LayerManagerD3D10, inside CDXGISwapChain::Present.
I tried using IDXGIDevice1::SetMaximumFrameLatency(1) in case we were producing marginally faster than the GPU could composite, and it was taking a while before it caught up.

No change in results.

I'm kind of out of ideas here. Maybe d3d11+OMTC won't be affected by this.
Keywords: perf
Priority: -- → P3
Depends on: 943048
Same issue with d3d11+OMTC. Trying to get a win8 talos machine loaned to me to investigate this further.
Blocks: 939826
I was able to get an apitrace of this happening. Replaying it on the talos machines produces present times shown in the attached graph. The graph suggests that the problem is being reproduced in the trace. I should be able to bisect the trace to figure out what's going wrong.
Attachment #8349824 - Attachment is patch: false
Attachment #8349824 - Attachment mime type: text/plain → image/png
I also just noticed that during Present() the nvidia driver is calling Sleep(). This is not a good thing for it to be doing.
Assignee: nobody → jmuizelaar
Nvidia's most recent driver (334.89) fixes this problem.
Depends on: 974684
Did we have any involvement, or was this fixed purely by luck?
(In reply to Matt Woodrow (:mattwoodrow) from comment #14)
> Did we have any involvement, or was this fixed purely by luck?

I believe we had some involvement. We reached out to Nvidia, they confirmed there was a bug and said that it would be fixed in the driver released on the 18th. It's possible that it was coincidence that it was fixed in their most recent driver release, I don't know either way.
I rebased the patches from here now that the drivers have been updated.

Try run with these patches: https://tbpl.mozilla.org/?tree=Try&rev=bee492a8dae3
Compare talos: http://compare-talos.mattn.ca/?oldRevs=95613a41c9de,a2d71850d783,43ba36bc10f2,727238c13756,db8043427ac2&newRev=bee492a8dae3&server=graphs.mozilla.org&submit=true


I also tried using SimplifyRegionByArea(100*100) in ThebesLayerD3D10.cpp (from bug 945079) instead of these patches.
Try run: https://tbpl.mozilla.org/?tree=Try&rev=911ff5c77e99
Compare Talos: http://compare-talos.mattn.ca/?oldRevs=95613a41c9de,a2d71850d783,43ba36bc10f2,727238c13756,db8043427ac2&newRev=911ff5c77e99&server=graphs.mozilla.org&submit=true

The latter avoids a win7 TART regression, and also improves tsvgx considerably. It doesn't improve win8 TART quite as much though.


I think we should try get bug 945079 landed and just go with that, since it's simpler and faster most of the time.

What do you think Jeff?
Flags: needinfo?(jmuizelaar)
Depends on: 945079
Attachment #8333718 - Attachment is obsolete: true
Attachment #8333718 - Flags: feedback?(jmuizelaar)
Attachment #8333720 - Attachment is obsolete: true
Attachment #8333720 - Flags: feedback?(bas)
Attached patch Enable single rects (obsolete) — Splinter Review
Attachment #8419041 - Flags: review?(bas)
Flags: needinfo?(jmuizelaar)
Comment on attachment 8419041 [details] [diff] [review]
Enable single rects

Review of attachment 8419041 [details] [diff] [review]:
-----------------------------------------------------------------

Very soon we'll be switching to OMTC with D3D11, we should take care to fix it there too.

::: gfx/layers/d3d10/ThebesLayerD3D10.cpp
@@ +411,5 @@
>    } else {
>      destinationSurface = mD2DSurface;
>    }
>  
> +  aRegion.SimplifyOutwardByArea(100*100);

nit: spaces around operator

::: layout/base/FrameLayerBuilder.cpp
@@ +3698,5 @@
>  {
>    static bool sPaintRectsSeparately;
>    static bool sPaintRectsSeparatelyPrefCached = false;
>    if (!sPaintRectsSeparatelyPrefCached) {
> +    mozilla::Preferences::AddBoolVarCache(&sPaintRectsSeparately, "layout.paint_rects_separately", true);

Please just add this to the gfxPrefs code.
Attachment #8419041 - Flags: review?(bas) → review-
Assignee: jmuizelaar → matt.woodrow
Attachment #8419041 - Attachment is obsolete: true
Attachment #8419065 - Flags: review?(bas)
Attachment #8419065 - Flags: review?(bas) → review+
https://hg.mozilla.org/mozilla-central/rev/d1155f994a3a
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
(In reply to Matt Woodrow (:mattwoodrow) from comment #19)
> Created attachment 8419065 [details] [diff] [review]
> Enable single rects v2

We should make sure we do the region simplification for OMTC as well.
Depends on: 1013536
Depends on: 1013543
Depends on: 1013759
(In reply to Jeff Muizelaar [:jrmuizel] from comment #22)
> (In reply to Matt Woodrow (:mattwoodrow) from comment #19)
> > Created attachment 8419065 [details] [diff] [review]
> > Enable single rects v2
> 
> We should make sure we do the region simplification for OMTC as well.

Filed bug 1013759 for this.
Comment on attachment 8419065 [details] [diff] [review]
Enable single rects v2

[Approval Request Comment]
Bug caused by (feature/regressing bug #): IE 10 update
User impact if declined: Bug 812695 remains
Testing completed (on m-c, etc.): Has been on m-c for quite a while
Risk to taking this patch (and alternatives if risky): Some risk but it's easy to back out.
Attachment #8419065 - Flags: approval-mozilla-aurora?
Attachment #8419065 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Matt, is this something that needs further testing from QA, or does it have automated tests? I'm just looking at this because it came up as fixed and I'm not sure how to test it. Thanks!
Flags: needinfo?(matt.woodrow)
(In reply to Liz Henry :lizzard from comment #26)
> Matt, is this something that needs further testing from QA, or does it have
> automated tests? I'm just looking at this because it came up as fixed and
> I'm not sure how to test it. Thanks!

I don't think it needs further testing. It's a very broad change to how we paint, so bugs in it will show up anywhere/everywhere. I can't think of any specific ways to test the changes beyond all the testing we do already.
Flags: needinfo?(matt.woodrow)
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.