Closed
Bug 500079
Opened 15 years ago
Closed 13 years ago
QCMS slows graphic handling, even on systems that can't use it
Categories
(Core :: Graphics: Color Management, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
status1.9.1 | --- | ? |
People
(Reporter: swsnyder, Assigned: jrmuizel)
References
()
Details
(Keywords: perf)
Attachments
(1 file)
4.21 KB,
patch
|
Details | Diff | Splinter Review |
QCMS, the color managament system in Firefox 3.5, significantly slows the handling of graphics in the browser, even in environments that cannot derive any benefit from the extra processing. Moreover, this performance burden is imposed by the default value of configuration item gfx.color_management.mode, a config setting which is not readily available to the casual user. Rationale to follow.
Updated•15 years ago
|
Component: General → GFX: Color Management
Product: Firefox → Core
QA Contact: general → color-management
Version: 3.5 Branch → 1.9.1 Branch
Reporter | ||
Comment 1•15 years ago
|
||
The machine used for testing is a Pentium4/2.4Ghz with 1.0GB of RAM. It is running the Win2K/SP4 operating system with all Microsfoft updates and fixes applied. See the URL above for a JPEG file which demonstrates my assertion that QCMS imposes a severe burden on the displaying of graphics in Firefix v3.5. The testing was done by storing the JPEG file to the local hard disk and simply loading from there into Firefox. A couple loads prior to tests was done to warm up the disk cache with the file. I came to Firefox 3.5RC2 from Firefox v3.0.11. Both browser versions used the the default color management settings. Color management is not supported in my environment and I have no ICC profile defined, so I've never had need to adjust any of the relevant settings. I was struck in FF 3.5RC by the poor graphics performance. Running the browser in a profiler, I discovered that roughly 30% of the total execution time is spend in function qcms_transform_data_rgb_out_lut_sse(). This is the sequence of action that yields this statistic: 1. Open Firefox 2. Load image from local hard disk 3. Close Firefox So the overall time measured is not just that of loading the image but includes the Firefox init and shutdown as well. This is the breakdown of the top items with gfx.color_management.mode set to its default value of 2: 26.97% qcms_transform_data_rgb_out_lut_sse() 2.41% jpeg_idct_islow_sse2() 1.50% decode_mcu() And this is the breakdown on the same test with gfx.color_management.mode set to zero: 4.45% ycc_rbd_convert_argb() 3.95% jpeg_idct_islow_sse2() 2.10% decode_mcu() The results of both tests runs are shown here: http://imagebin.ca/img/4SPzzy.jpg So qcms_transform_data_rgb_out_lut_sse() burns an inordinate amount of CPU time (11 times that of its non-QCMS competitor above) for an environment that can derive no benefit from its transformation of pixel data. But what does it do that's so CPU intensive? Stay tuned for more!
Reporter | ||
Comment 2•15 years ago
|
||
The guts of qcms_transform_data_rgb_out_lut_sse() consists of 2 blocks of x86 assembly code, one GCC-specific, the other in Intel/Microsoft format. As Firefox was built with VS2005, I'll describe the 2nd ASM block of code. This is it: __asm { mov eax, mat mov ecx, clampMax mov edx, floatScaleAddr mov ebx, input movaps xmm1, [eax] movaps xmm2, [eax + 16] movaps xmm3, [eax + 32] movaps xmm0, [ebx] movaps xmm4, xmm0 shufps xmm4, xmm4, 0 mulps xmm1, xmm4 movaps xmm5, xmm0 shufps xmm5, xmm5, 0x55 mulps xmm2, xmm5 movaps xmm6, xmm0 shufps xmm6, xmm6, 0xAA mulps xmm3, xmm6 addps xmm2, xmm3 addps xmm1, xmm2 movss xmm7, [ecx] shufps xmm7, xmm7, 0 minps xmm1, xmm7 xorps xmm6, xmm6 maxps xmm1, xmm6 movss xmm5, [edx] shufps xmm5, xmm5, 0 mulps xmm1, xmm5 cvtps2dq xmm1, xmm1 movdqa [ebx], xmm1 } A more informative view can be found here: http://imagebin.ca/img/u2dnMEN.jpg The lion's share of the execution time can be found at one specific address, where we see the back-to-back register load/store: movaps xmm1, [eax] movaps xmm2, [eax + 16] movaps xmm3, [eax + 32] movaps xmm0, [ebx] ; --> pipeline stall here movaps xmm4, xmm0 shufps xmm4, xmm4, 0 mulps xmm1, xmm4 In the QCMS-enabled test run above, this "movaps xmm4, xmm0" instruction accounts for 4.52% of the entire test run, not just the execution time of this particular function.
Reporter | ||
Comment 3•15 years ago
|
||
So that's my observations of the color management system in Firefox v3.5RC2. For all I know, QCMS may very well be a godsend to the professional photographer. To the casual user of a web browser, though, it imposes a very high overhead by default, a cost that may not provide any compensating benefits. The user will not find that gfx.color_management.mode variable amongst the Options dialog boxes. Should the user learn that graphics performance can be doubled by changing a single setting, s/he will have to click past a scary notice to get to it. There has to be a better way.
Comment 4•15 years ago
|
||
So offhand, there seem to be two issues: 1) Pipeline stall. Would be good to fix that. 2) Running qcms in an environment where it provides no benefit. Jeff, is the above really such an environment, and if so, can we detect that somehow?
Flags: wanted1.9.1.x?
Flags: blocking1.9.2?
Reporter | ||
Comment 5•15 years ago
|
||
Much more than I ever wanted to know about Mozilla's CMS: This code is called at browser start-up: gfxWindowsPlatform::GetPlatformCMSOutputProfile() { #ifndef MOZ_FT2_FONTS WCHAR str[1024+1]; DWORD size = 1024; HDC dc = GetDC(nsnull); GetICMProfileW(dc, &size, (LPWSTR)&str); ReleaseDC(nsnull, dc); qcms_profile* profile = qcms_profile_from_path(NS_ConvertUTF16toUTF8(str).get()); That call to GetICMProfileW() returns a profile size of zero bytes. Despite this, qcms_profile_from_path() is called anyway: qcms_profile* qcms_profile_from_file(FILE *file) { uint32_t length, remaining_length; qcms_profile *profile; size_t read_length; __be32 length_be; void *data; fread(&length_be, sizeof(length), 1, file); length = be32_to_cpu(length_be); if (length > MAX_PROFILE_SIZE) return BAD_VALUE_PROFILE; Above we see that the profile size is checked again - and again we are told that the size is zero bytes. The too-large condition is checked, but not the nonsensically-small case. The code then goes on to build an in-memory profile without input from an external set of values (ICC profile). This code clearly can opt *not* to transform pixel data because it knows that there is no ICC profile info available. I don't know, maybe the generic in-memory profile is an improvement over sRGB. If it's not an improvement, then QCMS can be disregard on systems on which no profile data is found.
Assignee | ||
Comment 6•15 years ago
|
||
This change (gcc only right now) has a noticeable performance improvement. It comes from rescheduling the instructions and using pshufd instead of shufps. You can try porting it to windows (should be pretty straight forward) and see if that helps.
Comment 7•15 years ago
|
||
We might want to split this bug into two separate bugs per comment 4. Presumably move issue 2 to a different bug, since we have a patch in progress for issue 1 here... Someone who understands what the color management stuff is doing should evaluate what the deal with issue 2 is, though. That wouldn't be me. ;)
Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #1) > But what does it do that's so CPU intensive? Stay tuned for more! The reason it's so CPU intensive is because it's a lot of data. Processing a 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per second which isn't really that many. The image is also ~144MB so it will totally blow out the cache which means we might also be hurt by memory bandwidth. I'm not saying we can't do better, just that there isn't going to be a huge improvement here. (In reply to comment #5) > gfxWindowsPlatform::GetPlatformCMSOutputProfile() > { > #ifndef MOZ_FT2_FONTS > WCHAR str[1024+1]; > DWORD size = 1024; > > HDC dc = GetDC(nsnull); > GetICMProfileW(dc, &size, (LPWSTR)&str); > ReleaseDC(nsnull, dc); > > qcms_profile* profile = > qcms_profile_from_path(NS_ConvertUTF16toUTF8(str).get()); > > > That call to GetICMProfileW() returns a profile size of zero bytes. Despite > this, qcms_profile_from_path() is called anyway: qcms_profile_from_path() should return without calling qcms_profile_from_file() because fopen will fail on the 0 length path. > qcms_profile* qcms_profile_from_file(FILE *file) > { > uint32_t length, remaining_length; > qcms_profile *profile; [snip] > This code clearly can opt *not* to transform pixel data because it knows that > there is no ICC profile info available. I don't know, maybe the generic > in-memory profile is an improvement over sRGB. If it's not an improvement, > then QCMS can be disregard on systems on which no profile data is found. QCMS is still useful on systems that have no profile data. Imagine you have two images of the same thing but encoded in different colorspaces. Both images should still appear the same on your monitor even if we don't have a profile for your monitor. To do this we need to convert the images to a common color space and sRGB is as good as any for that. However, it would be possible to detect when both the image colorspace and the destination colorspace are sRGB and avoid a conversion in that case. However, that would not help the image you linked to because it's an Adobe RGB image. This is bug 449826.
Reporter | ||
Comment 9•15 years ago
|
||
(In reply to comment #8) > (In reply to comment #1) > > But what does it do that's so CPU intensive? Stay tuned for more! > > The reason it's so CPU intensive is because it's a lot of data. Processing a > 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per > second which isn't really that many. The image is also ~144MB so it will > totally blow out the cache which means we might also be hurt by memory > bandwidth. I'm not saying we can't do better, just that there isn't going to be > a huge improvement here. That atypically large JPEG file was used to inflate the absolute numbers reported by the profiler, to make it obvious where the work was done. Still, the relative performance comparison should be sound. An off-topic question, since we have the code's author here: why test for SSE2 instead of for SSE? I did a test compile of transform.c with SSE-only enabled (VS2005: "-arch:SSE") just to see what the compiler would complain about. It didn't seem to have any problem with the instructions used.
Assignee | ||
Comment 10•15 years ago
|
||
(In reply to comment #9) > (In reply to comment #8) > > (In reply to comment #1) > > > But what does it do that's so CPU intensive? Stay tuned for more! > > > > The reason it's so CPU intensive is because it's a lot of data. Processing a > > 6000x6000 pixel image on 2.4Ghz machine gives only 67 cycles per pixel per > > second which isn't really that many. The image is also ~144MB so it will > > totally blow out the cache which means we might also be hurt by memory > > bandwidth. I'm not saying we can't do better, just that there isn't going to be > > a huge improvement here. > > That atypically large JPEG file was used to inflate the absolute numbers > reported by the profiler, to make it obvious where the work was done. Still, > the relative performance comparison should be sound. My guess as to one of the reasons that the actual jpeg decoding is faster is because the raw amount of data that it has to deal with is less. > An off-topic question, since we have the code's author here: why test for SSE2 > instead of for SSE? I did a test compile of transform.c with SSE-only enabled > (VS2005: "-arch:SSE") just to see what the compiler would complain about. It > didn't seem to have any problem with the instructions used. cvtps2dq is an SSE2 instruction.
Comment 11•15 years ago
|
||
Jeff: Are you still working on this?
Assignee | ||
Comment 12•15 years ago
|
||
(In reply to comment #11) > Jeff: Are you still working on this? I am.
Reporter | ||
Comment 13•15 years ago
|
||
Mitigated by bug 512865.
Updated•15 years ago
|
Flags: blocking1.9.2? → blocking1.9.2-
Comment 14•13 years ago
|
||
It looks like this was superseded by bug 512865. Re-open if I'm wrong.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 15•13 years ago
|
||
Yes, this bug is dead.
You need to log in
before you can comment on or make changes to this bug.
Description
•