Closed Bug 1135411 Opened 9 years ago Closed 3 years ago

MSIE chalkboard benchmark chokes nightly

Categories

(Core :: Graphics: ImageLib, defect, P3)

37 Branch
x86_64
Windows 7
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox37 - ---
firefox38 - ---
firefox39 - ---

People

(Reporter: mark, Unassigned)

References

()

Details

(Keywords: perf, regression, Whiteboard: gfx-noted)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0
Build ID: 20150221030208

Steps to reproduce:

1. Go to http://ie.microsoft.com/testdrive/Performance/Chalkboard/ in Nightly
2. try to run it


Actual results:

The benchmark tab chokes up the entire system (very high cpu usage), and it looks like the test is going to take 15 minutes or so.
Closing the tab immediately restores system responsiveness to normal.


Expected results:

The test should run to its finish without causing system responsiveness issues.
Troubleshooting info, graphics section:
Graphics
Adapter Description	AMD Radeon HD 6800 Series
Adapter Drivers	aticfx64 aticfx64 aticfx64 aticfx32 aticfx32 aticfx32 atiumd64 atidxx64 atidxx64 atiumdag atidxx32 atidxx32 atiumdva atiumd6a atitmm64
Adapter RAM	1024
ClearType Parameters	Gamma: 2200 Pixel Structure: R ClearType Level: 0 Enhanced Contrast: 200
Device ID	0x6738
Direct2D Enabled	true
DirectWrite Enabled	true (6.2.9200.16492)
Driver Date	11-20-2014
Driver Version	14.501.1003.0
GPU #2 Active	false
GPU Accelerated Windows	1/1 Direct3D 11 (OMTC)
Subsys ID	174b174b
Vendor ID	0x1002
WebGL Renderer	Google Inc. -- ANGLE (AMD Radeon HD 6800 Series Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote	true
AzureCanvasBackend	direct2d 1.1
AzureContentBackend	direct2d 1.1
AzureFallbackCanvasBackend	cairo
AzureSkiaAccelerated	0
Duplication of Bug 786064 ?
(In reply to Alice0775 White from comment #2)
> Duplication of Bug 786064 ?

No, I think something else is going on here. This isn't just bad performance, it completely kills system responsiveness. Of note: FF35.0.1 does not display this behavior - it's still very slow (which would be bug 786064) but it's not crippling and completes in about 60-70 seconds without issues. Nightly however takes a very long time for each frame update with very high CPU usage.

An additional note is that after the test run on nightly, I also got my first BSOD in quite a few years - whatever had happened with the demo in Nightly, it apparently caused my video driver to become unstable and it caused a BCCode 0x3b (SYSTEM_SERVICE_EXCEPTION). My system is in perfect health otherwise. My guess is there is something quite wrong with the d2d1.1/DX11 code.
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

I got two major regression after Bug Bug 786064.

#1 regression
Firefox27
68.65 sec
Firefox28
141.42

#2 regression
Firefox 36 
178.27
Firefox 37
340.77

I will file a new bug about #1 regression.


Here, #2 regression should be this bug
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=4553524f671f&tochange=7ee7e774e19f
via local build
Last good: 6bacacfaf339
First bad: 1cbca597b025
Regressed by: 1cbca597b025	Seth Fowler — Bug 1060869 (Part 4) - Update SurfaceCache prefs to increase the cache size. r=dholbert,tn
Blocks: 1060869
Status: UNCONFIRMED → NEW
Component: Untriaged → ImageLib
Ever confirmed: true
Flags: needinfo?(seth)
Keywords: perf, regression
Product: Firefox → Core
Version: Trunk → 37 Branch
See Also: → 1135548
What's happening here is that we're trying to cache the blackboard image, which is an SVG, but the benchmark is rapidly zooming in and out and so we keep trying to cache it at different scales. Caching is expensive because it makes us allocate a bunch of memory for textures (which gets freed rapidly, but since we're constantly allocating more that's a small comfort) and because drawing with caching is more expensive than drawing without caching (we need to draw at least twice at the same scale to benefit from the caching).

This is a known performance pitfall that we so far haven't gotten around to fixing, but we should. A first pass at a fix would be to remember recent scales we've drawn at and never cache unless we draw at the same scale twice.
No longer blocks: 1060869
I made this not block bug 1060869 because it has no relationship to bug 1060869 at all, other than bug 1060869 changed the parameters of the formula that determines the SurfaceCache's size, which means that some systems which did not cache the blackboard image before do now. However, the bug definitely existed for far longer than that; whether a user would hit it is just a matter of the size of their SurfaceCache, which is a function of their main memory size.
Seth, the end of comment 6 suggests to me that this is a) known and b) not a huge priority, that we could look for a fix and it could either ride trains or request uplift if safe but it's not something that would block release or benefit from release management cycling on it.  However I see this is new to 37 which hasn't shipped yet so the main question then is how do we expect users to be impacted by this regression and is that a risk we can take or should this become a higher priority so we don't ship this perf hit?
(In reply to Lukas Blakk [:lsblakk] use ?needinfo from comment #8)
> Seth, the end of comment 6 suggests to me that this is a) known and b) not a
> huge priority, that we could look for a fix and it could either ride trains
> or request uplift if safe but it's not something that would block release or
> benefit from release management cycling on it.

Right now I think that this is something that isn't so important that it can't just ride the trains. We can reevaluate if we start getting reports of issues on real websites, as opposed to benchmarks.

> However I see this is new to
> 37 which hasn't shipped yet so the main question then is how do we expect
> users to be impacted by this regression and is that a risk we can take or
> should this become a higher priority so we don't ship this perf hit?

Well, it's new in the sense that we tweaked some prefs that made it easier to hit the bad case. The problem is, though, that we can't easily just switch those prefs back without introducing performance problems elsewhere. As usual with caches it's a tradeoff.

As I mentioned above, I think the best bet is to wait and see if we're hitting this on real sites. It'd be possible to put a hack in to mitigate this if the real fix turns out to not be upliftable, but it will harm some other sites, so it's not an easy call.
I agree with comment 8 and comment 9. This doesn't sound like a case that we're likely to hit with real Web content. I would certainly like to see us fix this but I don't think we need relman to track this bug.
Whiteboard: gfx-noted
Flags: needinfo?(seth)
Mark, does this still reproduce in the latest Nightly?
Anthony, this is absolutely still an issue in Nightly-52.

Tested with 
Name 	Firefox
Version 	52.0a1
Build ID 	20160927030200

on an archived version of the test at

https://web.archive.org/web/20130311151859/http://ie.microsoft.com/TESTdrive/Performance/Chalkboard/

Improved hardware on my end (better CPU, better GFX) than when reporting this bug, still struggling and choking and >90s to run through in a clean profile.
Flags: needinfo?(mark)
As an aside, it seems what Seth suggested should be fairly safe and easy to implement with a good win?

> A first pass at a fix would be to remember recent scales we've drawn at and never cache unless we draw at the same scale twice.
I'm definitely reproducing the poor performance (bug 786064) and while we've improved over the last couple years (~3x faster) we still lag severely over the competition (~20x slower). However I am not reproducing the loss system responsiveness while the benchmark runs (this bug). In fact, with e10s the poor performance doesn't even seem to affect other tabs in Firefox.

For reference here are my benchmark results on my system
> 633.33s on Firefox 36.0
> 283.41s on Nightly 52.0a1 20160927030200
>  14.70s on Chrome 53.0.2785.116
>  35.43s on Safari 10.0.1 (12602.2.11)

Can you please confirm that you're still seeing the loss of system responsiveness you originally described or is it just the time to complete the benchmark that you're seeing?
I'm definitely seeing quite the loss of system responsiveness, too, which is likely related to memory allocation. Although not allocated to the Firefox process, my free ram rapidly declines during this test and recovers during "quieter" parts of the test, as well as at the end of the test. We're talking about a few GB of ram. This causes memory pressure for my system which in turn kills responsiveness.

Process explorer also shows GPU dedicated memory use of almost my entire 4GB of VRAM I have available, during the test, and dropping back to and expected ~80MB or so after it completes, which undoubtedly doesn't help.
I am also seeing a pretty severe loss of system responsiveness when running this demo. 

If I recall correctly, this demo performed a lot better (although it still performed pretty poorly) when I tested with the patch from bug 1078994, but the code that the patch enables has long since been defunct. Based on bug 1078994 comment 3, they have no plans on moving forwards with that code anytime soon.

OS: Windows 10
CPU: Core i7-930
RAM: 6 GB
GPU: GTX 660
Running latest Nightly (e10s enabled)
(In reply to Trevor Rowbotham from comment #16)
> If I recall correctly, this demo performed a lot better (although it still
> performed pretty poorly) when I tested with the patch from bug 1078994, but

I don't think trying to address this issue with tiling is a good idea. You won't really address the root cause (caching all these copies of the image), and would likely be opening up a whole different can of worms.

Maybe another approach is to not cache particularly large images, e.g. add a pref to restrict use of the surface cache to images below certain dimensions/bitsize?

The latest nightly seems to be performing quite well with this test case for me now. Webrender also seems to have a large, positive, impact on performance here. GPU memory allocation also seems to be significantly less when Webrender is enabled. I no longer see a noticeable loss in system responsiveness. Great work!

System 1
CPU: Intel Core i7-9700K
GPU: RTX 2080
Resolution: 2560x1440
OS: Windows 10 x64 1809

System 2
CPU: Intel Core i7-2500K
GPU: GTX 770
Resolution: 1280x720
OS: Windows 10 x64 1809

System Nightly 2019-04-24 (Webrender) Nightly 2019-04-24 (D3D11)* Chrome 74 Edge**
System 1 ~3.6 seconds ~22 seconds ~2.7 seconds ~26.5 seconds
System 2 ~7 seconds ~20 seconds ~9.7 seconds ~51.2 seconds

* Nightly with D3D11 still has a few hiccups when it comes to the smoothness of the animation, but performs well otherwise.
** Edge seems to fail to display the chalkboard properly as the image just blinks/flickers the entire time.

Depends on: fixed-by-webrender
Depends on: 1543584
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.