Last Comment Bug 752254 - Massive Performance Problem for users with B2A CLUT display profiles
: Massive Performance Problem for users with B2A CLUT display profiles
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: GFX: Color Management (show other bugs)
: 13 Branch
: x86_64 Windows 7
: -- normal (vote)
: mozilla15
Assigned To: Benoit Girard (:BenWa)
:
: Milan Sreckovic [:milan]
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-05 12:40 PDT by malch
Modified: 2012-05-23 16:12 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch (992 bytes, patch)
2012-05-22 11:31 PDT, Benoit Girard (:BenWa)
jmuizelaar: review+
lukasblakk+bugs: approval‑mozilla‑aurora+
lukasblakk+bugs: approval‑mozilla‑beta-
Details | Diff | Splinter Review

Description malch 2012-05-05 12:40:12 PDT
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0
Build ID: 20120501201020

Steps to reproduce:

Displayed this image on a Q6600 system running Windows 7 64-bit:

http://www.malch.com/nikon/DSD_1314ProPhoto.jpg


Actual results:

Firefox consumed approx 25 seconds of CPU rendering this image. 



Expected results:

It should have completed the task in 2-4 seconds of CPU (and apparently does so on some CPU's).

The problem has been present since FF8 through 14.0a2 (Aurora).

Some have suggested that the problem arises only on CPUs that do NOT support SSE4.1 instruction sets. Apparently, the problem also arises on Linux systems.

However, it has also been reported that the Firefox 12 on Mepis 11 resolves the problem for reasons that are unknown but presumably related to the build parameters.
Comment 1 malch 2012-05-15 18:13:53 PDT
Additional test results:

The following tests were carried out on a Q6600 CPU running 64-bit Windows 7. We used Firefox 13.0 but we have seen very similar performance with all recent versions of Firefox.

Recorded times are in CPU seconds as reported by Windows task manager. These include the time to launch Firefox.

We tested two versions of the same image:

A progressive JPEG available here:

http://www.malch.com/nikon/DSD_1314ProPhoto.jpg

A non-progressive JPEG available here:

http://www.malch.com/nikon/DSD_1314ProPhotoNP.JPG


Test 1
======

Color management off (gfx.color_management.mode=0)

Images loaded from local hard drive

Non-progressive:       3 seconds

Progressive:           4 seconds


Test 2
======

Color management off (gfx.color_management.mode=0)

Images loaded from network

Non-progressive:       11 seconds

Progressive:           11 seconds


Test 3
======

Color management on (gfx.color_management.mode=2)

Images loaded from local hard drive

Non-progressive:       6 seconds

Progressive:           15 seconds


Test 4
======

Color management on (gfx.color_management.mode=2)

Images loaded from network

Non-progressive:       14 seconds

Progressive:           26 seconds


We did experiment with the "Use Hardware Acceleration when Available" option but it did not appear to make any significant difference. 

These experiments appear to suggest two different issues:

1. The CPU times consumed when loading the images over a network appear excessive compared with images loaded from a local hard drive. 7 to 8 seconds of CPU cycles represents an enormous delta.

2. There appears to be a performance issue when applying color management to progressive JPEGs.
Comment 2 malch 2012-05-16 16:33:28 PDT
More data that may explain the slowness rendering progressive JPEG's...

In nsJPEGDecoder::WriteInternal() we monitored the inner loops for progressive and  non-progressive images where calls are made to OutputScanlines(). We also looked at the values of mInfo.output_scanline.

In the non-progressive case, the image appears to be processed in a single pass, as expected. In the progressive case, the image appears to be processed in 7 passes!

This seems high. We expected to 3 (or maybe 4) passes. The additional passes do appear to account for the slowness with progressive images. The overhead is considerable, especially when full color management via qcms is enabled. Others more familiar with the code may need to determine whether the number of passes can be reduced.

Of course, these results do NOT explain:

* The apparent slowness of the image rendering on the Q6600 compared to that reported with other CPU's.

* The large amount of additional CPU consumed when the images are fetched via HTTP compared with time to process the same image fetched from the local file system.

We will continue our efforts to gather more data on these issues and post any significant findings here.

Overall, we think there is significant scope for improved image handling performance and we hope this information is useful.
Comment 3 malch 2012-05-16 22:50:49 PDT
We started making some comparative timings between a Q6600 and an E7500 system. That's a newer CPU of broadly equivalent power and with SSE4 support. When (and only when) color management was enabled, it rendered the test images significantly faster. It seemed as if the qcms color management software was running dramatically faster on the CPU with SSE4. That was surprising because, as far as we know, qcms does not use SSE4.

However, the Q6600 and E7500 environments were not identical. The Q6600 was running 64-bit Windows 7 whereas the E7500 was running 32-bit Windows 7. But then we realized that the Q6600 had a calibrated display whereas the E7500 did not. Hey presto, when we installed a calibration profile on the E7500 it slowed down dramatically. It was still a little faster but maybe around 20%. That's certainly something that could be attributable to the hardware.

Then, of course, we deinstalled the calibration profile from the Q6600. We found that it was necessary to delete the gfx.color_management.display_profile value and remove the profile from Windows control panel. But when we did that, the Q6600 system rendered the image dramatically faster.

We think this means Firefox is applying two transforms to the image data. One to address the ProPhotoRGB ICC embedded in the image, converting it to sRGB (we guess). And a second transform to apply the display profile.

This too is a big surprise. Our own software which uses lcms versus qcms applies a single transform. During setup, it creates a transform to map ProPhotoRGB directly to the display profile. And then it can do the whole job in a single pass. An image editor would normally need to apply both transforms because it needs to maintain a copy of the image in the editors native color space as well as pushing a copy with the display profile applied out to the monitor. Does a browser need to do that?

In any event, Firefox seems to be using 2 passes/transforms. We don't know if that's an oversight, a design decision, or a qcms limitation (although we think the latter is unlikely). 

We will investigate further and also record/post some updated timings for the various scenarios over the next day or so.

We apologize for length of these posts. However, we do think it's justified by the potential rewards which seem to us to be quite considerable.
Comment 4 Robert Kaiser 2012-05-17 05:30:20 PDT
Moving to color management as per recent comments.
Comment 5 malch 2012-05-18 10:37:16 PDT
Gentlemen:

After reviewing many other discussions in bugzilla and reflecting on the issues raised here and elsewhere, we feel:

1. A review of the whole JPEG life cycle (data source, decoding, col mgt, and rendering) would be in order. This is probably a long term issue/project.

2. In the short term, there are a couple of things that would bring FF image performance into the same general ballpark as the competing browsers.

First, drop support for the progressive rendering of progressive JPEG's. Firefox would still handle progressive JPEG's but it would not attempt to render them progressively. Chrome/Safari/IE do this and they render the complete image faster than Firefox can render the first pass. Furthermore, with col mgt and a display profile active, the Firefox progressive rendering stalls because there's not enough CPU cycles available for it to keep up. The current implementation just isn't working.

Progressive rendering with full color management is very CPU intensive. It simply may not be viable until more of these functions have migrated to the GPU (which seems likely in the long term). Meanwhile, emulate the other browsers and render progressive JPEG's non-progressively.

Second, look at combining the two qcms transformations. We think it should be possible to create a transform that maps the embedded ICC directly to the display profile. Thus col mgt can be executed with a single transform of the decompressed image versus the two passes used currently.

3. Absent any comment/feedback from others more actively involved in the code, there's probably not a lot more we can contribute. However, we'll will continue to monitor this thread and respond if/when appropriate.
Comment 6 Benoit Girard (:BenWa) 2012-05-18 13:10:17 PDT
Here's a performance profile:
http://varium.fantasytalesonline.com/cleopatra/?report=AMIfv96xW5AwHpbp3ugsP6aRb9alUwc-4zxWm_cE61rs3cqXwcy3MogixLaerYfku5ux7kyviGR5xZ4Ylf4EMbopDv9_kb_68zKs0Sa1NlkteovXKEgdBMpCYkZ7QRXHXIKZZ7eUevHKGUMDG4VlUdVJ4vAnMRNdEw

The image took less then a second to decode of which 20% of the time was spent in qcms_transform_data_rgb_out_lut_sse2.
Comment 7 malch 2012-05-18 14:13:41 PDT
What CPU and OS?

What image was being loaded?

Was it a progressive JPEG? Our tests suggest that qcms time will be 7X with a progressive image.

And do you have a display profile being applied? Again, we think the time will double if you have display profile active.

We would love to see a profile made using this image:

http://www.malch.com/nikon/DSD_1314ProPhoto.jpg

and, for comparison, its non-progressive equivalent:

http://www.malch.com/nikon/DSD_1314ProPhotoNP.JPG
Comment 8 Benoit Girard (:BenWa) 2012-05-18 16:04:58 PDT
I was interrupted while posting so I didn't add much details:

Computer: OSX 10.7, 2.3 Ghz Quad i7
Image: http://www.malch.com/nikon/DSD_1314ProPhoto.jpg

It would be great if you could send me a profile of your configuration. Progressive decode shouldn't make color management much slower but it could be if the SSE2 paths are being taken.
Comment 9 malch 2012-05-18 17:09:36 PDT
Machine is Q6600 CPU running Windows 7 64-bit with SP1 etc.

There is absolutely no question that progressive JPEG's burn hugely more cycles than their non-progressive equivalents, at least on some machines. We added some extra logging to nsJPEGDecoder::OutputScanlines right after:

      /* Request one scanline.  Returns 0 or 1 scanlines. */    
      if (jpeg_read_scanlines(&mInfo, &sampleRow, 1) != 1) {
        *suspend = true; /* suspend */
        break;
      }

      int malch = mInfo.output_scanline;
      PR_LOG(gJPEGDecoderAccountingLog, PR_LOG_DEBUG, ("malchprocessline = %d", malch));

It logged each scan line in the image and repeated the process seven times.

When processing the non-progressive image, it logged one entry per scanline as expected.

We also confirmed that enabling/disabling the display calibration profile on the system had a dramatic impact on the CPU consumed. Basically it appeared to double the qcms overhead. 

It takes us more than 20 seconds of CPU (and more than 20 seconds elapsed) to render the progressive image with a display profile active. If we remove the display profile it's reduced considerably. 

The Q6600 supports SSE2 and we believe that libjpeg-turbo and qcms are using the SSE2 paths. The Q6600 does not support SSE4. But it does look like your Mac is using those to help with the image -- bzero() and memmove().

We initially thought that this SSE2/SSE4 factor might be significant. But later, we tested a E7500 with SSE4 support and found it was almost as slow, but only when configured with a display calibration profile.

We are able to compile from source using MozBuild and Visual Studio 2008. Our VS2008 edition does not appear to support profiling. If you can point us at a document/tutorial that can help us generate a profile like the one you posted we'd be thrilled and happy to post the results. TIA for any help with that.
Comment 10 malch 2012-05-19 14:22:10 PDT
We have since found:

https://developer.mozilla.org/en/Performance/Profiling_with_the_Built-in_Profiler

http://blog.mozilla.org/vdjeric/2012/03/28/setting-up-snappy-symbolication-server-locally/ 

We have the Profiler and Symbolication Server almost working. However, we cannot get the Profiler to display the symbol names (just the hex addresses). It makes one empty request to the Symbolication Server and gets one empty response. 

The Symbolication Server appears to be working correctly. It checks out with a force hang on About:telemetry.
Comment 11 Benoit Girard (:BenWa) 2012-05-19 17:17:04 PDT
I just forwarded your question to someone that may be able to help. Profiling with a symbol server is still being develop so it may not work yet. We're waiting on a production server to be deployed which will make profiling on windows much simpler.
Comment 12 malch 2012-05-19 19:38:51 PDT
Thanks for the info. We decided to make a clean build starting from:

http://hg.mozilla.org/mozilla-central

Our .mozconfig was:

ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O2 -GAL -arch:SSE2"
ac_add_options --enable-debug-symbols
ac_add_options --enable-profiling
ac_add_options --enable-libjpeg-turbo
ac_add_options --enable-jemalloc 
ac_add_options --disable-activex
ac_add_options --disable-activex-scripting
ac_add_options --disable-debug
ac_add_options --disable-mailnews
ac_add_options --disable-composer
ac_add_options --disable-ldap
ac_add_options --disable-accessibility
ac_add_options --disable-windows-mobile-components

Once built, we made the symbols and fed them into a local Symbolication Server.

This time we got a lot closer.

We had gfx.color_management.mode=2 together with a display profile configured in Windows. And we tested on this progressive image:

http://www.malch.com/nikon/DSD_1314ProPhoto.jpg

The Profiler results are odd to say the least. We're not sure whether to believe them but since we're seeing 30 seconds to render the image versus your one second, something very weird is going on. Maybe the profile is revealing?

In any event, qcms is implicated (no surprise). However, it looks like the video drivers could play a role too. So, we probably should note that this system has a Nvidia 8800GT video card with the latest WHQL drivers.

Anyway, here it is:

http://varium.fantasytalesonline.com/cleopatra/?report=AMIfv95ee0DRwt3Z2dbRq5MBPwWE3wfBZ5_jhTsnTIwq20lpLwdHY5puCT3Dzyh3R88OfVbmmJdGxlC4SyfOVLtw1MWlRmPtuscOVT0_bLKtR5gSextfz7VoCjWw6D_4668O11b9wNAmH5ZmwixZfbXbmGTUo-l1cg
Comment 13 Benoit Girard (:BenWa) 2012-05-19 19:50:35 PDT
Thanks for the profile. That's very useful :D.

If you invert focus on the period that's decoding images and invert the callstack you'll see that you're spending at least 70% of your time in qcms_transform_data. 50% of which is spend within qcms_transform_data_tetra_clut.

Do you have 'gfx.color_management.enablev4' set? I believe we should only be hitting this function if people are enabling the experimental ICC v4 support.
Comment 14 malch 2012-05-19 22:19:05 PDT
Yes, gfx.color_management.enablev4 was set. It does slow things down a tad but not by much. Here's an updated profile with enablev4 turned off:

http://varium.fantasytalesonline.com/cleopatra/?report=AMIfv96WsQEXrFIPwVeBduIw9ctMpoRVQJ6Hb-SslXmJjjQ-OrRdtJpxkmntfvn1XrJOdoxg_kpvYnrMtz60g7cxAyVi_Nt7CVVFkxgCVhzJN2J9r9yLasLT5H917uMWrJFvwrojXdypraDJxLJIl7Cj-tdKrpi-qA

Our immediate questions are:

1. Why qcms_transform_data_rgb_out_lut() versus qcms_transform_data_rgb_out_lut_sse2()? It's using other SSE2 routines in qcms! 

2. And what's with all the time spent in OnProcessNextEvent()? Is that related to the profiler itself or does it suggest a problem in qcms?
Comment 15 Benoit Girard (:BenWa) 2012-05-19 22:43:08 PDT
(In reply to malch from comment #14)
> 2. And what's with all the time spent in OnProcessNextEvent()? Is that
> related to the profiler itself or does it suggest a problem in qcms?

Spending time there is normal since on windows I believe it includes waiting for events and process them I believe? Particularly the ones calling into USER32.dll where we poll for events. Sadly you don't have symbols for there. This should become easier once we have a shared symbol server deployed.

For best profile you should select the region where we were decoding the image and click on it in the breadcrums to focus on the relevant time window.
Comment 16 malch 2012-05-20 07:27:47 PDT
Thanks for the info. Keep in mind that it was decoding/drawing the image for the vast majority of the profiled period. This isn't hard to accomplish when it takes 20-30 seconds to process the image :-(

In any event, we confirmed that qcms is not using qcms_transform_data_rgb_out_lut_sse2(). And that is because it's not using precache. Hence qcms_transform_create() is calling for the non-SSE transforms.

If we remove our display profile from Windows, qcms uses the precache and then qcms_transform_create() calls for the qcms_transform_data_rgb_out_lut_sse2() transform. This is much faster and explain why performance is so much worse with a display profile configured.

We confirmed that our display profile is ICCv2 and not ICCv4. We can post a copy if that would be helpful.
Comment 17 Benoit Girard (:BenWa) 2012-05-20 13:23:31 PDT
Yes please. I'll extract the profile from the image and run some tests on tuesday when I get back and see if we can patch qcms to use see2 for your case.
Comment 18 malch 2012-05-20 14:27:38 PDT
We have placed the display profile here:

http://www.malch.com/nikon/malch.icc

However... the fact that qcms is not using precache seems to be intentional.

In qcms_profile_precache_output_transform() we have:

	/* don't precache since we will use the B2A LUT */
	if (profile->B2A0)
		return;

The malch.icc profile does indeed contain a B2A table and I gather that this is recommended for device profiles. 

We're inclined to think the current behavior is by design and not a defect -- quite possibly for precision and fidelity reasons.

FYI: we did try commenting out the B2A test and qcms did use the SSE2 functions.

We're not absolutely convinced the overall performance problem lies within qcms itself. What we know for a fact is:

* We see generally very poor image performance, at least on the Q6600 CPU.
* It's made worse by progressive images which seem to be subject to 7 passes.
* It's made worse still if a display device profile is applied as well.

In addition, this image apparently renders very rapidly on your Mac. We understand that qcms has produced some excellent benchmark results when tested standalone. We further understand that Chrome uses an essentially unmodified copy of qcms to apply a display profile (when executed with --enable-monitor-profile). And yet it is much faster than Firefox. Chrome also uses libjpeg/libjpeg-turbo.

So we do still worry about the time spent in the message handling. We understand your earlier comments on that but still worry about the possibility that something (most likely associated with screen painting) is generating a huge number of message events that are consuming gobs of CPU cycles. Unfortunately, delving into that code is something we're unlikely to be able to do. But we've certainly seen saturated message loops cause plenty of performance problems in other applications! The might also explain the discrepancy in rendering times depending on whether the image is loaded from disk or the network (see Comment 1 at the top of this thread).
Comment 19 Benoit Girard (:BenWa) 2012-05-20 14:36:46 PDT
(In reply to malch from comment #18)
> We have placed the display profile here:
> 
> http://www.malch.com/nikon/malch.icc
> 
> However... the fact that qcms is not using precache seems to be intentional.
> 
> In qcms_profile_precache_output_transform() we have:
> 
> 	/* don't precache since we will use the B2A LUT */
> 	if (profile->B2A0)
> 		return;
> 
> The malch.icc profile does indeed contain a B2A table and I gather that this
> is recommended for device profiles. 
> 

I wrote that comment. It was my intention that this would only apply if iccv4 mode was selected. As you see the performance of that code path is poor since it doesn't have sse2. That we hit this without iccv4 mode is a bug IMO. I'll see if I can fix the iccv4 mode checks.

> 
> In addition, this image apparently renders very rapidly on your Mac. 

That would be because my display profile doesn't contain a B2A.

> We understand that qcms has produced some excellent benchmark results when
> tested standalone. 

That was for the gamma curve paths. The CLUT paths have never been optimized and have poor performance and is part of the reason they are optional.

> We further understand that Chrome uses an essentially
> unmodified copy of qcms to apply a display profile (when executed with
> --enable-monitor-profile). And yet it is much faster than Firefox. Chrome
> also uses libjpeg/libjpeg-turbo.
> 

I didn't think Chrome had started using qcms quite yet.
Comment 20 malch 2012-05-20 15:27:17 PDT
Interesting. Thank you for the additional background and your efforts with qcms.

If you post any patches here we'll certainly try them out before they land.

As for Chrome... --enable-monitor-profile has been working with the Chromium dev builds for quite a while (> 1 year). Others have said it's in Chrome too. We assume they're using qcms -- it's certainly in their source tree.
Comment 21 malch 2012-05-21 10:04:15 PDT
We made some usability/convenience tweaks to our profiler setup and ran a lengthy series of additional tests with some interesting findings:

1. The profiler reports significant time consumed by nsBaseAppShell::OnProcessNextEvent() when decoding/rendering the test image. However, it produces similar results when profiling other tasks where Firefox performance is excellent. We satisfied ourselves that this is simply an artifact of the way the profiler works on windows. i.e. no real issue to address.

2. We confirmed that the qcms times are really very good when the display profile is disabled (0.67).

3. The qcms times with a display profile enabled that does NOT include a B2A table are also excellent (0.73).

4. The qcms times with a display profile enabled that includes a B2A table are horrible (7.76). We think that many/most device profiles will include a B2A table since this seems to be recommended. Thus any optimizations/gains you can make here would be most welcome.

5. We applied a display profile including a B2A table and compared the qcms times for progressive and non-progressive versions of the image. The respective times were 7.76 and 1.30. It's clear that progressive images greatly amplify the qcms overhead. We next plan to investigate the progressive issues in more detail.
Comment 22 Benoit Girard (:BenWa) 2012-05-22 11:19:30 PDT
I think this is a regression from the support for qcmsv4 experimental support. It appears that we prevent precache if a LUT is found. The expected behaviors is that having qcmsv4 turned off (default) shouldn't have any impact.

I'm working on a patch. If it works I'll nomiate it for uplift into aurora/beta.
Comment 23 Benoit Girard (:BenWa) 2012-05-22 11:31:40 PDT
Created attachment 626102 [details] [diff] [review]
patch

I think we should uplift this to aurora/beta since the risk is very low.
Comment 25 Benoit Girard (:BenWa) 2012-05-22 12:03:19 PDT
Comment on attachment 626102 [details] [diff] [review]
patch

https://hg.mozilla.org/integration/mozilla-inbound/rev/959548388f6e

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 674230
User impact if declined: Performance regression for loading image if the user as a ICCv4 profile. I don't have data but this shouldn't be rare.
Testing completed (on m-c, etc.): Fixes regression locally. Will be on m-c in 24 hours.
Risk to taking this patch (and alternatives if risky): Low risk, we restore precache logic to pre bug 674230 landing.
String or UUID changes made by this patch: none
Comment 26 malch 2012-05-22 21:02:51 PDT
The folks who filed this bug report would like to express their sincere thanks to Benoit Girard for picking up this ball. The excellent work is greatly appreciated.

We applied the patch to a nightly build and tested it this evening. It works! And we have not observed any unwanted side effects at all.

Pushing it to Aurora seems completely appropriate.

Thanks again.
Comment 27 Ed Morley [:emorley] 2012-05-23 05:24:55 PDT
https://hg.mozilla.org/mozilla-central/rev/959548388f6e
Comment 28 Lukas Blakk [:lsblakk] use ?needinfo 2012-05-23 16:01:02 PDT
Comment on attachment 626102 [details] [diff] [review]
patch

Approving for Aurora, denying for Beta as we're now to close to release to take this.

Note You need to log in before you can comment on or make changes to this bug.