Closed Bug 629866 Opened 9 years ago Closed 9 years ago

Investigate regressions in wmode="opaque" GUIMark3 testcase

Categories

(Core :: Plug-ins, defect)

x86
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla2.0b11
Tracking Status
blocking2.0 --- final+

People

(Reporter: roc, Assigned: roc)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [softblocker])

Attachments

(11 files)

2.99 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
13.02 KB, patch
cjones
: review+
Details | Diff | Splinter Review
4.80 KB, patch
cjones
: review+
Details | Diff | Splinter Review
126.63 KB, image/png
Details
2.03 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
2.87 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
829 bytes, patch
Details | Diff | Splinter Review
12.22 KB, patch
Details | Diff | Splinter Review
7.15 KB, patch
Details | Diff | Splinter Review
1.39 KB, patch
roc
: review+
Details | Diff | Splinter Review
705 bytes, patch
jrmuizel
: review+
Details | Diff | Splinter Review
In the benchmark in bug 626602, we see that FF4 trunk rendering of wmode="opaque" Flash has slowed down a bit vs 3.6, due to some combination of layers and async rendering. This case should be relatively straightforward to optimize since there are no plugin alpha values to worry about, and the optimizations are likely to help us with wmode="transparent" (much more common) once 626602 is fixed.

Running xperf with BasicLayers on Windows 7, I get around 20% of the samples in firefox.exe and the rest in plugin-container.exe. plugin-container is hard to analyze --- almost entirely in Flash code, probably not much we can do to speed it up as long as we only ask it to paint once per frame.

In firefox.exe though, I see some fixable issues. All of the time is spent doing empty transactions, just recompositing the layer tree. That's good. The relevant part of the layer tree consists of the plugin ImageLayer over a ThebesLayer with the page background. Because we have overlapping layers, we trigger manual backbuffering. So for each frame:
1) We composite the ThebesLayer to the backbuffer
2) We composite the ImageLayer to the backbuffer
3) We blit the backbuffer to the window's HDC

The main issue here is that since the ImageLayer completely fills the dirty rect, we actually don't have to draw the ThebesLayer at all. An occlusion culling analysis could note that we don't need to paint the ThebesLayer here, and then MayHaveOverlappingOrTransparentLayers could return false so we don't bother with a manual backbuffer. So, we'd just be copying the plugin buffer to the window DC for each paint --- optimal.

The thing is, we actually have occlusion culling logic for BasicLayers: MarkLeafLayersCoveredByOpaque. It's not working here only because it doesn't take the dirty region into account, and if you look at the whole window, the plugin layer does not cover the ThebesLayer. So this should be relatively easy to fix.

A secondary issue is that step 1 takes twice as long as step 2 for some reason. For some reason _cairo_win32_surface_composite calls _cairo_image_surface_composite which does two heavyweight operations: _cairo_pattern_acquire_surfaces and pixman_image_composite. The former makes a copy of the source win32_surface. This seems very wrong, we're just compositing a subrectangle of one win32 surface into another, it should be doable with ::BitBlt! I think we should fix this even though the other optimization will bypass this code.
For the second issue, when we composite the ThebesLayer we're drawing an RGB24 win32 surface into an ARGB32 win32 surface (we've made the backbuffer transparent because the window is transparent). This fails all the Win32 fast paths (expected) and falls through to the code

	return dst->image->backend->composite (op, pattern, mask_pattern, ...);

But 'pattern' still contains our source win32 surface, which the image backend doesn't know has image bits. The image backend calls surface->backend->clone_similar to try to make a image-like surface out of the source source, but win32 surfaces don't implement clone_similar. So then we try acquire_source_image, which makes a copy since our RGB24 win32 surface's 'image' is null (!).
(In reply to comment #1)
> (we've made the backbuffer transparent because the window is transparent).

Which window?  I have a local fix to avoid creating unnecessary ARGB24 plugin surfaces when we have a background.  Is that what we're hitting here?  That fix isn't in my mq yet.
Sorry, disregard that.  Need sleep :).
Simple enough fix.

This would be affecting BasicLayers on Windows when glass is enabled *a lot*. Too bad we didn't catch this earlier. Good thing we caught it now!
Attachment #508104 - Flags: review?(jmuizelaar)
OK, I've got the occlusion culling working, and it does help.

Avoiding double-buffering requires something like https://bugzilla.mozilla.org/show_bug.cgi?id=588402#c33.
Attachment #508104 - Flags: review?(jmuizelaar) → review+
Keywords: checkin-needed
Whiteboard: [needs landing]
http://hg.mozilla.org/mozilla-central/rev/c7290e6e795b

I just realized that I landed this without it being approved/blocking.  Shame on me.  It would be great for someone to mark it as either blocking/approved to soothe my grief...
Blocks: 588402
Status: NEW → RESOLVED
Closed: 9 years ago
No longer depends on: 588402
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [needs landing]
Target Milestone: --- → mozilla2.0b11
It's blocking, but I'm also planning to attach another couple of patches here.
Status: RESOLVED → REOPENED
blocking2.0: --- → final+
Resolution: FIXED → ---
Hmm, the benchmark gets a lot faster if you wiggle the mouse over the page. We should investigate that.
This may help bug 612113 as well.
Whiteboard: [softblocker]
With these patches, the benchmark spends almost all the browser-process time simply doing fills to copy the plugin Image surface to the window's HDC. No double-buffering, no ThebesLayer compositing.

However, the copy to the window's HDC is still not as fast as it should be. The reason is that the window's HDC is not (known to be) a DIB. cairo decides that the best way to composite our source image surface to the window is to grab the destination pixels with cairo_surface_acquire_dest_image (which ends up doing a BitBlt into a temporary DIB), composite over the top (which is just a memcpy I guess), and write the results back with a BitBlt from the temporary DIB to the window.

It seems like there should be a way to just take those bits and composite them to the window DC.
So I'm thinking of adding to gfxImageSurface:
  /**
   * Pixels outside this rect have alpha values all 0xFF.
   * Initially this is the entire surface.
   */
  nsIntRect mInvalidAlphaValues;

  /**
   * Returns an image surface with the same contents as this surface,
   * but ARGB32.
   * Returns "this" if this surface is ARGB32. Otherwise
   * sets the alpha values in the data to 0xFF within
   * mInvalidAlphaValues, clears mInvalidAlphaValus, and
   * returns a temporary ARGB32 gfxImageSurface wrapping the
   * same data.
   */
  already_AddRefed<gfxImageSurface> GetSurfaceWithAlpha();

  /**
   * Call this to indicate that alpha values may have been changed/
   * destroyed. This needs to be called, passing in the bounding-rect for
   * what's changed in the surface, between calls to GetSurfaceWithAlpha().
   */
  void InvalidateAlphaValues(const nsIntRect& aRect)
  { mInvalidAlphaValues.UnionRect(mInvalidAlphaValues, aRect); }

How does that sound?
Sounds good.
The first three patches in this bug get us up to about 52fps, about 2 more than on trunk. Not a huge improvement.
Comment on attachment 509310 [details] [diff] [review]
Part 2: fix occlusion culling in BasicLayers

Shadow layers would benefit from this too, but I can add that.

>+MarkLeafLayersHidden(Layer* aLayer, const nsIntRect& aClipRect,
>+                     const nsIntRect& aDirtyRect,

I didn't understand what this param was for initially and it wasn't
commented.  A note above would be useful.

>+                     nsIntRegion& aRegion)

While you're at it, mind renaming this to |aOpaqueRegion|?

Looks good.
Attachment #509310 - Flags: review?(jones.chris.g) → review+
Comment on attachment 509312 [details] [diff] [review]
Part 3: make double-buffering-avoidance logic a bit smarter

>+    // If we have a transparent leaf layer in the dirty rect, that doesn't
>+    // overlap another leaf layer, it must be in a transparent part of the
>+    // window.

But a transparent leaf layer could also overlap another leaf layer,
but still partially cover a transparent part of the window that we'd
need to clear, right?  The implementation of
MayHaveOverlappingOrTransparentLayers gets this right, but this
comment throws me a bit.

r=me with a comment fix or pointing out what I'm missing
Attachment #509312 - Flags: review?(jones.chris.g) → review+
(In reply to comment #17)
> Comment on attachment 509312 [details] [diff] [review]
> Part 3: make double-buffering-avoidance logic a bit smarter
> 
> >+    // If we have a transparent leaf layer in the dirty rect, that doesn't
> >+    // overlap another leaf layer, it must be in a transparent part of the
> >+    // window.
> 
> But a transparent leaf layer could also overlap another leaf layer,
> but still partially cover a transparent part of the window that we'd
> need to clear, right?

If it overlaps a leaf layer, then we need to double-buffer because of the previous (unquoted) sentence. I'm not sure what to make clearer here.
What I've realized is that although I can reduce firefox.exe CPU usage quite a lot with my patches, it doesn't improve the benchmark framerate on my machine because plugin-container.exe's main thread is nearly maxing one CPU core on its own. I guess it's still worth shrinking firefox.exe CPU usage though, so I'll get some solid numbers on CPU usage reduction and post my patches.
Right now, with my patches I have Firefox using 2.15% and plugin-container.exe using 16.24% of the my 8 virtual CPU cores (4 cores, 2 hyperthreads per core). I guess Flash is doing some parallel processing to use more than one core (12.5%).

In plugin-container.exe, we're almost entirely in Flash as far as I can tell. The stacks are a bit messed up presumably because Flash does not preserve frame pointers. Certainly the time spent in stacks identifiably in our code is negligible.

In firefox.exe, 1.08% is entirely StretchDIBits drawing the plugin buffer to the HDC for the window. 0.52% is in SetAlphaValuesTo255 (a simple C implementation). The rest is miscellaneous overhead like firing MozAfterPaint and IPC processing. I'll try an SSE2 implementation of SetAlphaValuesTo255 to try to get that down a bit, and that's about as far as I think I can or should go.

I should also try to profile Chrome to see how it's doing.
Currently I'm getting 53fps.

I think there might be something funny going on with event scheduling, because I'm still seeing if that if I move the mouse over the window, the framerate increases.
Is that 53fps reported by GUIMark or by your mozPaintCount instrumentation?
Actually, I can't see that effect now.

This is reported by Flash. I've taken out the mozPaintCount instrumentation to minimize perturbation. Previously it was showing numbers consistent with the Flash numbers --- we're able to get all the Flash frames onto the screen. That's not surprising given such low CPU usage in the browser process. We're totally gated on plugin-container.exe on this laptop.

In FF3.6 Flash reports ~60FPS (the browser isn't very responsive though) so you'd think we could get the same here, since the Flash code hasn't changed. I'm seeing the plugin main thread at 11.21%, i.e. not quite a full core (12.5%). I'll see if I can use xperf to figure out what might be happening.
It looks like the main Flash thread pauses while it synchronizes with other Flash threads. It pauses, several other Flash threads suddenly jump to high CPU utilization, then the main Flash thread resumes *and* the browser main thread starts running again for a little while in parallel. It looks like maybe Flash does some kind of parallel render, then the browser gets the resulting frame. It doesn't look like plugin-container.exe blocks on the Firefox main thread, although there is a very short dip in Flash main thread CPU usage just after the Firefox thread goes back to sleep.

I need to reprofile with FF3.6 and Chrome to see what the Flash profile looks like there.
plugin-container blocks on firefox-bin during PPluginInstance:Show, which is synchronous.  The overhead of the IPC alone should be O(10us) though, unless firefox-bin happens to be busy when the message is delivered.  Looks like we do a fair amount of stuff from PluginInstanceParent::RecvShow().  Will the chain of InvalidateLayer()s end up painting in a later task?  I forget where that eventually ends up.  We definitely don't want to be repainting while plugins are blocked on us.  Hmm, I can check this.
There are several other ways firefox-bin could be blocking on plugin-container and vice versa, but you wouldn't be hitting them in this benchmark I don't believe if you're not moving the mouse over the plugin frame.
Yeah, invalidate looks good.
Attached image screenshot of CPU usage
This shows what's going on between frames on my machine. In the top graph, the small humps are firefox.exe and the rest is plugin-container.exe. In the bottom graph, the red line is the main thread of plugin-container.exe, and the purplish line is the main thread of firefox.exe. The other lines are other plugin-container threads.

At 3.16 seconds plugin-container goes to sleep completely while firefox.exe is busy painting, but that's an anomaly, it doesn't happen elsewhere in the trace.
I wonder if QUIRK_FLASH_THROTTLE_WMUSER_EVENTS could be relevant here...
Disappointingly, I can't really see any significant improvement in firefox.exe CPU usage with patches 2-8. I'm not sure why, maybe StretchDIBits is slow. Anyway, I need to work on other things for now, so putting the patches here for posterity.
I believe I know why the D3D9 case is slow. We do nothing to avoid an intermediate copy when we can get the image bits directly. I'm working on verifying and testing this suspicion.
I can't really reproduce slow D3D9, but this patch should save us a copy when uploading, roc can maybe see if this helps him!
Comment on attachment 511270 [details] [diff] [review]
Improve ImageLayerD3D9 uploading code

We should take this --- it's clearly the right thing to do.

I still see D3D9 being 5-10fps slower than BasicLayers though.
Attachment #511270 - Flags: review+
Attachment #511270 - Flags: approval2.0+
Bas has a patch that fixes D3D9 that he'll attach here :-)
It turns out roc had a good idea about why this is slower on D3D9 layers. On D3D9 layers we copy the data into a texture during the SetData call. This call blocks the plugin process. We do an early copy of the data so that the GPU can execute it when it isn't busy (i.e. outside of the render loop).

However currently our main usage for image layers is for plugins and in that case this is having a negative performance impact, as well as potentially uploading frames that are never drawn.

This is the lowest-risk fix to the problem which will just cause us to upload during the GetOrCreateTexture call when rendering. I've not been able to find any negative performance impact of this change.
Attachment #511727 - Flags: review?(jmuizelaar)
Attachment #511727 - Flags: review?(jmuizelaar) → review+
http://hg.mozilla.org/mozilla-central/rev/30fe9aba82ee
http://hg.mozilla.org/mozilla-central/rev/db495f146676

I believe these are enough to completely resolve this issue, please re-open if I'm wrong.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Did parts 1-8 ever land? In comment #4 roc suggested part 1 would be important for performance with Aero Glass, is that still true?
Part 1 landed, the rest didn't. I think the patches are good, I just need to find a testcase where they provide a significant win.
Roc pushed part 2 in cedar but I did backout it because of an orange on Windows Mochitest-4:
http://tinderbox.mozilla.org/showlog.cgi?log=Cedar/1305623848.1305624333.18347.gz&fulltext=1

There are two candidates, this bug and bug 647560
I backed it out as well :-) but that was not the cause of the orange, the orange persisted after backout (and the patch passed those tests on try multiple times).
The orange is still there so this patch should be safe to be pushed again in cedar later.
Comment on attachment 510915 [details] [diff] [review]
Part 4: Allow StretchDIBits to be used to draw ARGB32 surfaces with OPERATOR_SOURCE

I think we're going to want this due to bug 647560 making us do more ARGB32->ARGB32 blits.
Attachment #510915 - Flags: review?(jmuizelaar)
Comment on attachment 510917 [details] [diff] [review]
Part 5: Allow Win32 "fast paths" to work with EXTEND_PAD

I don't have a testcase that benefits from this but I think it's clearly a good thing.
Attachment #510917 - Flags: review?(jmuizelaar)
Do these patches make a noticeable difference? I would be surprised if there are situations on modern hardware where using GDI is faster.
You mean for patch 4? Without this patch, an ARGB->ARGB32 composite with a non-DIB destination requires us to do some temporary surface stuff. Avoiding that seems like a win, no?
(In reply to comment #53)
> You mean for patch 4? Without this patch, an ARGB->ARGB32 composite with a
> non-DIB destination requires us to do some temporary surface stuff. Avoiding
> that seems like a win, no?

But do we run into the non-DIB destination in practice?
We do with bug 647560, yes. If a repaint occurs in the chrome UI ThebesLayer, we draw to its ARGB ThebesLayer surface (which has a DIB), and then we do an OPERATOR_SOURCE blit of that layer directly to the ARGB non-DIB HDC that BeginPaint gave us.
Comment on attachment 510915 [details] [diff] [review]
Part 4: Allow StretchDIBits to be used to draw ARGB32 surfaces with OPERATOR_SOURCE

Provided this preserves alpha it looks fine to me.
Attachment #510915 - Flags: review?(jmuizelaar) → review+
Attachment #510917 - Flags: review?(jmuizelaar) → review+
You need to log in before you can comment on or make changes to this bug.