Last Comment Bug 500079 - QCMS slows graphic handling, even on systems that can't use it
: QCMS slows graphic handling, even on systems that can't use it
Status: RESOLVED FIXED
: perf
Product: Core
Classification: Components
Component: GFX: Color Management (show other bugs)
: 1.9.1 Branch
: x86 Windows 2000
: -- normal with 3 votes (vote)
: ---
Assigned To: Jeff Muizelaar [:jrmuizel]
:
: Milan Sreckovic [:milan]
Mentors:
http://imagebin.ca/img/wdxkwC.jpg
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-23 16:41 PDT by Steve Snyder
Modified: 2011-10-07 10:08 PDT (History)
18 users (show)
mbeltzner: blocking1.9.2-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
?


Attachments
work in progress improvement to sse code. (4.21 KB, patch)
2009-06-24 08:39 PDT, Jeff Muizelaar [:jrmuizel]
no flags Details | Diff | Splinter Review

Description Steve Snyder 2009-06-23 16:41:34 PDT
QCMS, the color managament system in Firefox 3.5, significantly slows the handling of graphics in the browser, even in environments that cannot derive any benefit from the extra processing.

Moreover, this performance burden is imposed by the default value of configuration item gfx.color_management.mode, a config setting which is not readily available to the casual user.

Rationale to follow.
Comment 1 Steve Snyder 2009-06-23 17:30:33 PDT
The machine used for testing is a Pentium4/2.4Ghz with 1.0GB of RAM.  It is
running the Win2K/SP4 operating system with all Microsfoft updates and fixes
applied.

See the URL above for a JPEG file which demonstrates my assertion that QCMS
imposes a severe burden on the displaying of graphics in Firefix v3.5.  The
testing was done by storing the JPEG file to the local hard disk and simply
loading from there into Firefox.  A couple loads prior to tests was done to
warm up the disk cache with the file.

I came to Firefox 3.5RC2 from Firefox v3.0.11.  Both browser versions used the
the default color management settings.  Color management is not supported in my
environment and I have no ICC profile defined, so I've never had need to adjust
any of the relevant settings.

I was struck in FF 3.5RC by the poor graphics performance.  Running the browser
in a profiler, I discovered that roughly 30% of the total execution time is
spend in function qcms_transform_data_rgb_out_lut_sse().

This is the sequence of action that yields this statistic:

1. Open Firefox
2. Load image from local hard disk
3. Close Firefox

So the overall time measured is not just that of loading the image but includes
the Firefox init and shutdown as well.

This is the breakdown of the top items with gfx.color_management.mode set to
its default value of 2:

26.97% qcms_transform_data_rgb_out_lut_sse()
 2.41% jpeg_idct_islow_sse2()
 1.50% decode_mcu()

And this is the breakdown on the same test with gfx.color_management.mode set
to zero:

 4.45% ycc_rbd_convert_argb()
 3.95% jpeg_idct_islow_sse2()
 2.10% decode_mcu()

The results of both tests runs are shown here:
http://imagebin.ca/img/4SPzzy.jpg

So qcms_transform_data_rgb_out_lut_sse() burns an inordinate amount of CPU time
(11 times that of its non-QCMS competitor above) for an environment that can
derive no benefit from its transformation of pixel data.

But what does it do that's so CPU intensive?  Stay tuned for more!
Comment 2 Steve Snyder 2009-06-23 17:53:50 PDT
The guts of qcms_transform_data_rgb_out_lut_sse() consists of 2 blocks of x86 assembly code, one GCC-specific, the other in Intel/Microsoft format.  As Firefox was built with VS2005, I'll describe the 2nd ASM block of code.

This is it:

__asm {                                                                                         
  mov      eax, mat                                                                            
   mov      ecx, clampMax                                                                       
   mov      edx, floatScaleAddr                                                                 
   mov      ebx, input                                                                          

   movaps   xmm1, [eax]
   movaps   xmm2, [eax + 16]
   movaps   xmm3, [eax + 32]
   movaps   xmm0, [ebx]

   movaps   xmm4, xmm0
   shufps   xmm4, xmm4, 0
   mulps    xmm1, xmm4
   movaps   xmm5, xmm0
   shufps   xmm5, xmm5, 0x55
   mulps    xmm2, xmm5
   movaps   xmm6, xmm0
   shufps   xmm6, xmm6, 0xAA
   mulps    xmm3, xmm6

   addps    xmm2, xmm3
   addps    xmm1, xmm2

   movss    xmm7, [ecx]
   shufps   xmm7, xmm7, 0
   minps    xmm1, xmm7
   xorps    xmm6, xmm6
   maxps    xmm1, xmm6
   movss    xmm5, [edx]
   shufps   xmm5, xmm5, 0
   mulps    xmm1, xmm5
   cvtps2dq xmm1, xmm1
   movdqa   [ebx], xmm1
}

A more informative view can be found here: http://imagebin.ca/img/u2dnMEN.jpg

The lion's share of the execution time can be found at one specific address, where we see the back-to-back register load/store:

   movaps   xmm1, [eax]
   movaps   xmm2, [eax + 16]
   movaps   xmm3, [eax + 32]
   movaps   xmm0, [ebx]
; --> pipeline stall here
   movaps   xmm4, xmm0
   shufps   xmm4, xmm4, 0
   mulps    xmm1, xmm4

In the QCMS-enabled test run above, this "movaps xmm4, xmm0" instruction accounts for 4.52% of the entire test run, not just the execution time of this particular function.
Comment 3 Steve Snyder 2009-06-23 18:04:33 PDT
So that's my observations of the color management system in Firefox v3.5RC2.

For all I know, QCMS may very well be a godsend to the professional photographer.  To the casual user of a web browser, though, it imposes a very high overhead by default, a cost that may not provide any compensating benefits.

The user will not find that gfx.color_management.mode variable amongst the Options dialog boxes.  Should the user learn that graphics performance can be doubled by changing a single setting, s/he will have to click past a scary notice to get to it.

There has to be a better way.
Comment 4 Boris Zbarsky [:bz] (still a bit busy) 2009-06-23 21:41:17 PDT
So offhand, there seem to be two issues:

1) Pipeline stall.  Would be good to fix that.
2) Running qcms in an environment where it provides no benefit.  Jeff, is the
   above really such an environment, and if so, can we detect that somehow?
Comment 5 Steve Snyder 2009-06-24 07:44:53 PDT
Much more than I ever wanted to know about Mozilla's CMS:

This code is called at browser start-up:


gfxWindowsPlatform::GetPlatformCMSOutputProfile()
{
#ifndef MOZ_FT2_FONTS
    WCHAR str[1024+1];
    DWORD size = 1024;

    HDC dc = GetDC(nsnull);
    GetICMProfileW(dc, &size, (LPWSTR)&str);
    ReleaseDC(nsnull, dc);

    qcms_profile* profile =
        qcms_profile_from_path(NS_ConvertUTF16toUTF8(str).get());


That call to GetICMProfileW() returns a profile size of zero bytes.  Despite this, qcms_profile_from_path() is called anyway:


qcms_profile* qcms_profile_from_file(FILE *file)
{
	uint32_t length, remaining_length;
	qcms_profile *profile;
	size_t read_length;
	__be32 length_be;
	void *data;

	fread(&length_be, sizeof(length), 1, file);
	length = be32_to_cpu(length_be);
	if (length > MAX_PROFILE_SIZE)
		return BAD_VALUE_PROFILE;

Above we see that the profile size is checked again - and again we are told that the size is zero bytes.  The too-large condition is checked, but not the nonsensically-small case.  The code then goes on to build an in-memory profile without input from an external set of values (ICC profile).

This code clearly can opt *not* to transform pixel data because it knows that there is no ICC profile info available.  I don't know, maybe the generic in-memory profile is an improvement over sRGB.  If it's not an improvement, then QCMS can be disregard on systems on which no profile data is found.
Comment 6 Jeff Muizelaar [:jrmuizel] 2009-06-24 08:39:35 PDT
Created attachment 384877 [details] [diff] [review]
work in progress improvement to sse code.

This change (gcc only right now) has a noticeable performance improvement. It comes from rescheduling the instructions and using pshufd instead of shufps. You can try porting it to windows (should be pretty straight forward) and see if that helps.
Comment 7 Boris Zbarsky [:bz] (still a bit busy) 2009-06-24 08:45:39 PDT
We might want to split this bug into two separate bugs per comment 4.  Presumably move issue 2 to a different bug, since we have a patch in progress for issue 1 here...

Someone who understands what the color management stuff is doing should evaluate what the deal with issue 2 is, though.  That wouldn't be me.  ;)
Comment 8 Jeff Muizelaar [:jrmuizel] 2009-06-24 09:31:16 PDT
(In reply to comment #1)
> But what does it do that's so CPU intensive?  Stay tuned for more!

The reason it's so CPU intensive is because it's a lot of data. Processing a 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per second which isn't really that many. The image is also ~144MB so it will totally blow out the cache which means we might also be hurt by memory bandwidth. I'm not saying we can't do better, just that there isn't going to be a huge improvement here.

(In reply to comment #5)
> gfxWindowsPlatform::GetPlatformCMSOutputProfile()
> {
> #ifndef MOZ_FT2_FONTS
>     WCHAR str[1024+1];
>     DWORD size = 1024;
> 
>     HDC dc = GetDC(nsnull);
>     GetICMProfileW(dc, &size, (LPWSTR)&str);
>     ReleaseDC(nsnull, dc);
> 
>     qcms_profile* profile =
>         qcms_profile_from_path(NS_ConvertUTF16toUTF8(str).get());
> 
> 
> That call to GetICMProfileW() returns a profile size of zero bytes.  Despite
> this, qcms_profile_from_path() is called anyway:

qcms_profile_from_path() should return without calling qcms_profile_from_file() because fopen will fail on the 0 length path.

> qcms_profile* qcms_profile_from_file(FILE *file)
> {
>     uint32_t length, remaining_length;
>     qcms_profile *profile;

[snip]

> This code clearly can opt *not* to transform pixel data because it knows that
> there is no ICC profile info available.  I don't know, maybe the generic
> in-memory profile is an improvement over sRGB.  If it's not an improvement,
> then QCMS can be disregard on systems on which no profile data is found.

QCMS is still useful on systems that have no profile data. Imagine you have two images of the same thing but encoded in different colorspaces. Both images should still appear the same on your monitor even if we don't have a profile for your monitor. To do this we need to convert the images to a common color space and sRGB is as good as any for that.

However, it would be possible to detect when both the image colorspace and the destination colorspace are sRGB and avoid a conversion in that case. However, that would not help the image you linked to because it's an Adobe RGB image. This is bug 449826.
Comment 9 Steve Snyder 2009-06-24 09:53:14 PDT
(In reply to comment #8)
> (In reply to comment #1)
> > But what does it do that's so CPU intensive?  Stay tuned for more!
> 
> The reason it's so CPU intensive is because it's a lot of data. Processing a
> 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per
> second which isn't really that many. The image is also ~144MB so it will
> totally blow out the cache which means we might also be hurt by memory
> bandwidth. I'm not saying we can't do better, just that there isn't going to be
> a huge improvement here.

That atypically large JPEG file was used to inflate the absolute numbers reported by the profiler, to make it obvious where the work was done.  Still, the relative performance comparison should be sound.

An off-topic question, since we have the code's author here:  why test for SSE2 instead of for SSE?  I did a test compile of transform.c with SSE-only enabled (VS2005: "-arch:SSE") just to see what the compiler would complain about.  It didn't seem to have any problem with the instructions used.
Comment 10 Jeff Muizelaar [:jrmuizel] 2009-06-24 10:37:00 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #1)
> > > But what does it do that's so CPU intensive?  Stay tuned for more!
> > 
> > The reason it's so CPU intensive is because it's a lot of data. Processing a
> > 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per
> > second which isn't really that many. The image is also ~144MB so it will
> > totally blow out the cache which means we might also be hurt by memory
> > bandwidth. I'm not saying we can't do better, just that there isn't going to be
> > a huge improvement here.
> 
> That atypically large JPEG file was used to inflate the absolute numbers
> reported by the profiler, to make it obvious where the work was done.  Still,
> the relative performance comparison should be sound.

My guess as to one of the reasons that the actual jpeg decoding is faster is because the raw amount of data that it has to deal with is less. 

> An off-topic question, since we have the code's author here:  why test for SSE2
> instead of for SSE?  I did a test compile of transform.c with SSE-only enabled
> (VS2005: "-arch:SSE") just to see what the compiler would complain about.  It
> didn't seem to have any problem with the instructions used.

cvtps2dq is an SSE2 instruction.
Comment 11 Samuel Sidler (old account; do not CC) 2009-07-23 16:38:31 PDT
Jeff: Are you still working on this?
Comment 12 Jeff Muizelaar [:jrmuizel] 2009-07-24 11:16:32 PDT
(In reply to comment #11)
> Jeff: Are you still working on this?

I am.
Comment 13 Steve Snyder 2009-08-26 18:36:28 PDT
Mitigated by bug 512865.
Comment 14 Bas Schouten (:bas.schouten) 2011-10-07 09:35:03 PDT
It looks like this was superseded by bug 512865. Re-open if I'm wrong.
Comment 15 Steve Snyder 2011-10-07 10:08:50 PDT
Yes, this bug is dead.

Note You need to log in before you can comment on or make changes to this bug.