Closed
Bug 444661
Opened 16 years ago
Closed 16 years ago
Decrease Performance Penalty of color management linear interpolations
Categories
(Core :: Graphics: Color Management, defect)
Core
Graphics: Color Management
Tracking
()
RESOLVED
FIXED
People
(Reporter: bholley, Assigned: bholley)
References
Details
Attachments
(1 file, 4 obsolete files)
35.67 KB,
patch
|
vlad
:
review+
vlad
:
superreview+
|
Details | Diff | Splinter Review |
Initial performance analysis (viewing a jpg of several megs with an embedded color profile from the local disk) for firefox color management on my mac indicated that roughly 8.2% of firefox's CPU time was spent in cmsLinearInterpLUT16. As such, I figured it would be a good candidate for optimization. Closer inspection revealed that the function does in fact have hand-tuned inline assembly, but that it's for windows (masm) only. I spent some time porting the assembly to GAS assembly (slightly non trivial due to register allocation issues relating to the use of ebx as a PIC register by gcc), and finally had analogous working assembly for mac and linux. Running it through Shark, I was surprised to find that the performance was almost twice as bad (16.1% as compared to 8.2%). I had a feeling that this had to do with the div instructions, and in fact it did: 80ish% of the samples recorded by shark were recorded in the instructions immediately following the div instructions, which makes sense since apparently div is not pipelined. Commenting out the divs brings performance to a nice 7.1% and results in pretty trippy color rendering. ;-) This means that the assembly is actually to blame for poor performance on windows, not on mac and linux. This makes sense, because the performance numbers stuart were talking about were a bit worse than the ones I was seeing. I've been doing some reading (http://blogs.msdn.com/devdev/archive/2005/12/12/502980.aspx) on some heuristics to do division by constants with shifts and adds. I hope to be able to replace the divs with a hand-tuned solution. After that, I want to look into MMX/SSE/SIMD to see if I can speed things up that way. The function is currently called once for each 2-byte value, so hopefully we can get performance speedups that way (I would imagine I'd have to rearchitect the call structure a bit). One concern is that the code currently consults a LUT for each value, which may limit the amount of SIMD optimizations we can do (we still need N arbitrary memory accesses for N values no matter what instruction set we use).
Assignee | ||
Comment 1•16 years ago
|
||
Assignee | ||
Comment 2•16 years ago
|
||
CC-ing schrep regarding the assembly ninjas
Assignee | ||
Comment 3•16 years ago
|
||
Michael and Steve - Schrep mentioned that you guys were good at assembly optimization. I'm working on optimizing this function for color management at the moment, so I was hoping that one or both of you could review it once it's done and give suggestions for things I might have missed. I'll let you know when it's done.
Comment 4•16 years ago
|
||
There is code for two programs to provide the assembler code for constant integer division via MUL + SHR on page 136 of the AMD K10 Software Optimization Guide. The code for unsigned division by 65535 is: MOV EAX, 080008001h MUL dividend SHR EDX, 15 And for signed division: MOV EAX, 080008001h IMUL dividend MOV EAX, dividend ADD EDX, EAX SAR EDX, 15 SHR EAX, 31 ADD EDX, EAX How is the LUT table generated? Can we precompute (y1 - y0) efficiently?
Assignee | ||
Comment 5•16 years ago
|
||
Having finished porting their crappy assembly, I've been focusing more on the intent of the function. I _think_ the division can actually (and should actually) just be done with a right-shift by 16, and indeed it looks right when I do that. However, since color management is all about the fine points, if it's off by even a little bit it's no good. I'll dig deeper into that one tomorrow. One question though - How does the performance of MUL compare to the performance of single and double-precision shifts (I know multipliers are slow in verilog, but I don't know about how they work on an x86)? If I can do something with one or two shifts as opposed to a multiplication, is it worth it, or is div the only major culprit? Good idea on precomputing the difference in the LUT. I'll look into that tomorrow.
Comment 6•16 years ago
|
||
It should be interesting to examine the generated code from the build. If I compile this snippet of code using VS2005 and -O2, main() { int a; scanf("%d", &a); printf("%d\n", a/65535); } I get: ; 5 : int a; ; 6 : ; 7 : scanf("%d", &a); lea eax, DWORD PTR _a$[esp+4] push eax push OFFSET ??_C@_02DPKJAMEF@?$CFd?$AA@ call _scanf ; 8 : printf("%d\n", a/65535); mov ecx, DWORD PTR _a$[esp+12] mov eax, -2147450879 ; 80008001H imul ecx add edx, ecx sar edx, 15 ; 0000000fH mov ecx, edx shr ecx, 31 ; 0000001fH add ecx, edx push ecx push OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@ call _printf so VS2005 at -O2 does the multiply, add and double shift to replace the division. I'd guess that GCC does the same thing at -O2 but someone would have to test that.
Yep: ... call L_scanf$stub movl -12(%ebp), %ecx movl $-2147450879, %edx movl %ecx, %eax imull %edx leal LC1-"L00000000001$pb"(%ebx), %eax movl %eax, (%esp) addl %ecx, %edx sarl $15, %edx sarl $31, %ecx subl %ecx, %edx movl %edx, 4(%esp) call L_printf$stub ...
(Also, in the patch, don't blow away the MSC_VER version! But do fix it, along with a gcc version :)
Comment 9•16 years ago
|
||
(In reply to comment #5) > Having finished porting their crappy assembly, I've been focusing more on the > intent of the function. I _think_ the division can actually (and should > actually) just be done with a right-shift by 16, and indeed it looks right when > I do that. However, since color management is all about the fine points, if > it's off by even a little bit it's no good. I'll dig deeper into that one > tomorrow. > > One question though - How does the performance of MUL compare to the > performance of single and double-precision shifts (I know multipliers are slow > in verilog, but I don't know about how they work on an x86)? If I can do > something with one or two shifts as opposed to a multiplication, is it worth > it, or is div the only major culprit? On the Pentium 4, imul takes from 10 to 14 cycles. Shifts take from 1 to 4 cycles. On Core, imul takes 3 to 4 cycles and shifts take one cycle. On AMD processors, imul takes 3 on k10 (maybe a cycle or two more on k8), and shifts take one cycle. I don't have the SOGs for processors older than k8. If the arguments to the three ToFixedDomain calls can be guaranteed to be positive, then passing an unsigned argument instead of the current signed argument would improve performance (compare the generated code for synthetic signed and unsigned division in post #4. It may be better making some C code improvements with -O2 optimization instead of going with inline assembler. -GL optimization would allow for inter-routine optimization and -PGO could optimize the two conditionals. If you want to go SIMD, using intrinsics would make maintenance easier and allow the compiler to do the instruction scheduling.
Comment 10•16 years ago
|
||
I ran a program to look for differences in dividing by 2^16 instead of 2^16 - 1 and found a bunch of one-off differences. #include "stdio.h" main() { int a; for (a = -20000000; a < 20000000; a++) if ( ((a + 32767)/65535) != ((a + 32767)/65536)) printf("%d %d %d\n", a, (a + 32767)/65535, ((a + 32767)/65536)); }
Comment 11•16 years ago
|
||
One last comment for the evening: doing SIMD code is frequently difficult when you have table lookups as is done with the LutTable because there are no table lookup SIMD instructions. Sometimes it is faster to compute table results on the fly in parallel to implement an SIMD solution.
Assignee | ||
Comment 12•16 years ago
|
||
Vlad - Don't worry, just a working patch to test the idea. Michael - Unfortunately, the LUTs are read straight out of the ICC profile, so computing them is impossible. I've started to think the assembly optimization isn't going to take us very far. Fundamentally, the problem is that we're running this function a lot, not that it's particularly inefficient. SIMD would probably be a good boost, but the need to look stuff up in the table makes that pretty difficult. As a result, I've been looking into caching. My first attempt involved setting up a 256-entry cache table for each LUT to store the results for different input values (proof of concept patch attached). This doubled the performance on one of my test images (a NASA photo with a lot of black space and a bright yellow sun), but had more or less no benefit on a more realistic picture (mountains and rivers). I then tried making the cache bigger and covering the entire range of input values (rather than mapping 256->1). This doubled performance on the mountain photo, and almost trippled it on the space photo. Given that the LUTs are per-color profile, I'm now looking into simply precomputing the entire range of values for the system profile and for sRGB. This should take care of most common cases, and will hopefully prevent color management from having any noticeable performance hit for day to day browsing. Code to follow.
Comment 13•16 years ago
|
||
Where do the ICC profiles come from or where do they live?
Assignee | ||
Comment 14•16 years ago
|
||
The destination color profile is read in either from a file (if it's specified in about:config) or from memory (queried from the windowing system). The source color profile is either created from known paremeters (if it's sRGB) or created from the image data of images that have embedded source profiles.
Comment 15•16 years ago
|
||
Perhaps you could try changing the following: a1 = ToFixedDomain(dif * rest); a1 = ToFixedDomain((- dif) * rest); to: a1 = ToFixedDomain((unsigned int) (dif * rest)); a1 = ToFixedDomain((unsigned int) ((- dif) * rest)); as "rest" is positive and dif is positive based on the logic of the if statement.
Assignee | ||
Comment 16•16 years ago
|
||
renaming the bug to be more general (since we're considering caching mechanisms also).
Summary: Optimize Inline Assembly For cmsLinearInterpLUT16 → Decrease Performance Penalty of cmsLinearInterpLUT16
Assignee | ||
Comment 17•16 years ago
|
||
I've written a patch that does precaching for the system output profile and sRGB (as an input profile) that seems to have pretty good results. I've thrown it up on the tryserver and will hopefully post it tomorrow morning once I see the results and fix some whitespacing issues. I realized that I did something dumb to hide another big perf issue for color management. When I was making my test image, I grabbed a high-res photo off the internet and slapped a random profile from my system folder onto it as its profile. Unfortunately, I now realized that the profile I used was in fact my primary display profile. The composition of the two matrices yielded an identity matrix, which LCMS was smart enough to optimize away. It now looks like there's an equally significant performance it in MAT3evalW. This might be harder to optimize (oh how i wish we had a GPU), but hopefully the perf gains from this will get us into reasonable territory. I'll look more into that tomorrow.
Assignee | ||
Comment 18•16 years ago
|
||
again changing the title of the bug to reflect its more general nature.
Summary: Decrease Performance Penalty of cmsLinearInterpLUT16 → Decrease Performance Penalty of color management linear interpolations
Assignee | ||
Comment 19•16 years ago
|
||
I just realized that the matrix multiplications are probably a perfect candidate for optimization with SIMD intrinsics. I'll look into that tomorrow.
Comment 20•16 years ago
|
||
For best performance with SSE2, you'd want to change the vectors from three doublewords to four doublewords and then try to double quadword align the vectors, whether malloc'd or allocated on the stack. The "W" flavor of the matrix arithmetic uses 64-bit scaled integer arithmetic and there should be enough support in SSE2 for this using the packed quadword multiply, add and shift instructions. Double-precision floating point instructions could be used but then there would be conversion overhead. Can these operations be streamed or are they strictly executed one at a time?
Assignee | ||
Comment 21•16 years ago
|
||
I've made a separate bug for the matrix optimization. See bug 445552.
Assignee | ||
Comment 22•16 years ago
|
||
Adding a proposed patch for the precaching strategy. I did my best to stick to lcms' whitespace style, but it's pretty much impossible since lcms is horribly inconsistent. This patch must be applied on top of the patch for bug 444829, which hasn't landed but has review and will be landed soon. Flagging joe for review.
Attachment #328992 -
Attachment is obsolete: true
Attachment #329122 -
Attachment is obsolete: true
Attachment #329898 -
Flags: review?(joe)
Comment 23•16 years ago
|
||
What's our test coverage like for lcms?
Assignee | ||
Comment 24•16 years ago
|
||
Joe - no idea. Vlad, Any insight on this?
Comment 25•16 years ago
|
||
I ran a build with and without this change and it's about 6% faster on 84 MB of random jpeg images. This is off the 3.0 source code base. I had to make a minor change to a file as it wouldn't apply on 3.0. Hardware is a 2.5 Ghz Penryn DC. OS is Windows XP.
Comment 26•16 years ago
|
||
Two comments: 1) I'd suggest building cmsintrp.c with -O2 on Windows. I think that this is built with -O1 by default. -O2 gets the multiply/shift optimization for the divisions and also inlines the multiply/shift instead of calling a function. 2) I'd suggest using an unsigned int cast in the last two calls to ToFixedDomain in cmsLinearInterpLUT1.
Comment 27•16 years ago
|
||
(In reply to comment #26) > 2) I'd suggest using an unsigned int cast in the last two calls to > ToFixedDomain in cmsLinearInterpLUT1. I assume you mean cmsLinearInterpLUT16, but further -- how come?
Comment 28•16 years ago
|
||
See comment #4. (In reply to comment #27) > (In reply to comment #26) > > > 2) I'd suggest using an unsigned int cast in the last two calls to > > ToFixedDomain in cmsLinearInterpLUT1. > > I assume you mean cmsLinearInterpLUT16, but further -- how come?
Comment 29•16 years ago
|
||
Comment on attachment 329898 [details] [diff] [review] proposed precache patch First comment: This looks good in principle, but needs some tweaking in implementation. >+ LCMSBOOL status = >+ cmsPrecacheProfile(gCMSOutputProfile, CMS_PRECACHE_LILUT16_REVERSE); >+ LCMSBOOL status = >+ cmsPrecacheProfile(gCMSsRGBProfile, CMS_PRECACHE_LIF_FORWARD); It'd be nice if, given that precaching the transformations is such a win, this happened automatically instead of having to be opted-in. >+// Type specifier for precaches >+typedef enum { >+ >+ // Precache the results of cmsLinearIntrpLUT16 on inverse (output) gamma tables >+ CMS_PRECACHE_LILUT16_REVERSE, >+ >+ // Precache the results of cmsLinearIntrpFixed on forward (input) gamma tables >+ CMS_PRECACHE_LIF_FORWARD >+ >+ } LCMSPRECACHETYPE; >+#define PRECACHE_TYPE_COUNT ((CMS_PRECACHE_LIF_FORWARD - CMS_PRECACHE_LILUT16_REVERSE) + 1) Please make PRECACHE_TYPE_COUNT the last member of the enum, and specify the values (0, 1) for _REVERSE and _FORWARD. However, I'm not sure if this type will be necessary. See below. >+ // Different types of precaches require different structures. We use a union >+ // to handle them with the same code when we can. >+ union { >+ LCMSPRECACHELILUT16IMPL LILUT16; >+ LCMSPRECACHELIFIMPL LIF; >+ } Impl; Having this union seems to just make actually referring to the precached data difficult. I'd prefer to see two different reference-counted precache structures, one for input and one for output. They're never mixed, from what I can tell, so having two separate types should only make things cleaner. And, so long as you name things nicely, you can still use your PRECACHE_ADDREF and PRECACHE_RELEASE macros. This is going to require a lot of code changes, which is the main reason for the r-. >+// Mozilla found this function to be almost twice as fast as the >+// hand-optimized assembly, mostly due to the slowness of the div >+// instructions in the assembler version. The assembly has thus been removed. This comment should be in the checkin comment or ChangeLog, not the source. >+ InVect.n[VX] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[0][In[0]]; >+ InVect.n[VY] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[1][In[1]]; >+ InVect.n[VZ] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[2][In[2]]; This function does a lot of this mixing of VX, VY, VZ and 0, 1, 2. I know you're just following convention, but it's kind of ugly. If you can find a way to use named indices instead of hardcoded numbers, that'd be nice. :) >+++ b/modules/lcms/src/cmsprecache.c >@@ -0,0 +1,178 @@ >+// Little cms >+// Copyright (C) 2008 Mozilla Foundation >+#define CMS_DEBUG 1 Do we want this in checked-in code? Also, please address mmoy's comments in comment 26. And is LittleCMS actively maintained upstream? Is it possible to get the upstream developers to look this over? (Ideally we could get it pushed upstream.) Finally, before any of your LCMS patches are applied, I want some form of CMS testing added. If we're going to enable it by default, especially. At minimum, I'd like tests that see whether CMS does anything (i.e., does this jpeg with a profile embedded look different than this same jpeg without a profile), but ideally it'd test the quality of the CM transformation. That will require a file with a non-sRGB profile, and an equivalent file in sRGB, which we then test for equality. Since we're using the fixed-point LCMS implementation, this should be reasonably solid, but we'll have to see what happens with all the vagrancies of compiler optimization and CPU type. The rest of the code looks pretty good. I've checked it over (keeping in mind I don't know LCMS code too well) and it seems to handle failure/etc fairly well. And I've tested it too -- seems to work nicely!
Attachment #329898 -
Flags: review?(joe) → review-
Assignee | ||
Comment 30•16 years ago
|
||
>It'd be nice if, given that precaching the transformations is such a win, this >happened automatically instead of having to be opted-in. I don't think we want to precache just any profile that comes through LCMS. Embedded profiles are read from images at render time, and it would be wasteful in both time and space to generate a precache for a profile with a short lifetime. Moreover, LCMS doesn't know whether a profile is an input profile or an output profile (or both) at creation time, so it wouldn't know the appropriate precache to create. Finally, other lcms apps (GIMP, inkscape, etc) would likely object to such a big increase in their memory footprint when color management is not a bottleneck for them. >Please make PRECACHE_TYPE_COUNT the last member of the enum Any particular reason to do this? I was under the impression that both were accepted conventions and it doesn't look like lcms sets a precedent in either direction. >Having this union seems to just make actually referring to the precached data >difficult. I'd prefer to see two different reference-counted precache >structures, one for input and one for output. They're never mixed, from what I >can tell, so having two separate types should only make things cleaner. And, so >long as you name things nicely, you can still use your PRECACHE_ADDREF and >PRECACHE_RELEASE macros. I disagree. The union does add the annoyance of having to type things like Impl.LIF. in data structure access. However, it adds a generality to the precaching code that makes it more likely to be accepted upstream. Right now, we're precaching the results of two very specific functions (cmsLinearInterpLUT16 and cmsLinearInterpFixed) because they happen to lie in the critical path for gecko (one of them is used for our input transform and the other is used for our output transform). However, lcms is a library with with dozens of other functions which can be called instead depending on function parameters. As such, I wanted to make the precaching code as general as possible such that most of it would serve as a framework for other precaching operations. This means that we don't want to pollute the profile structure with a line for each type of precache, and and we want adding a new precache to involve adding as few hooks as possible. For example, right now the profile closing code is agnostic to the number and type of precaches attached to the profile. This is only possible when we store the different types of caches in an array, which is in turn possible only with unions, wasted space, or dangerous type casting.
Assignee | ||
Comment 31•16 years ago
|
||
I finally got around to updating the precache patch. This should allay some of joe's concerns, and it also adopts a more structured naming convention that helps for later patches. Flagging joe for review. LCMS still lacks a testing framework, and I promise I'll write one before any of this code gets checked in. I'm just hoping to stabilize this patch to ease things for the patches I have on top of it.
Attachment #329898 -
Attachment is obsolete: true
Attachment #332296 -
Flags: review?(joe)
Assignee | ||
Comment 32•16 years ago
|
||
oh as a side note - mmoy's first concern is being addressed in bug 449130
Comment 33•16 years ago
|
||
Comment on attachment 332296 [details] [diff] [review] updated precache patch r+ as long as we get some testing in. Also, please check upstream to see if they have any requests.
Attachment #332296 -
Flags: review?(joe) → review+
Assignee | ||
Comment 34•16 years ago
|
||
roger that - pinged the upstream maintainer
Assignee | ||
Comment 35•16 years ago
|
||
Comment on attachment 332296 [details] [diff] [review] updated precache patch flagging vlad for sr, since I think he already read the patch
Attachment #332296 -
Flags: superreview?(vladimir)
Assignee | ||
Comment 36•16 years ago
|
||
I found a bug in the precaching patch where it fails in a crappy way when there aren't any TRCs in the profile to precache. This should make it fail silently and still work, just slower. Flagging joe for r and vlad for sr. This one should be quick - just diff it against the previous patch.
Attachment #332296 -
Attachment is obsolete: true
Attachment #333655 -
Flags: superreview?(vladimir)
Attachment #333655 -
Flags: review?(joe)
Attachment #332296 -
Flags: superreview?(vladimir)
Comment on attachment 333655 [details] [diff] [review] updated patch to fix a bug I found with the test suite Looks fine.
Attachment #333655 -
Flags: superreview?(vladimir)
Attachment #333655 -
Flags: superreview+
Attachment #333655 -
Flags: review?(joe)
Attachment #333655 -
Flags: review+
Assignee | ||
Comment 38•16 years ago
|
||
pushed in ba60f155cc44
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Product: Core → Core Graveyard
Assignee | ||
Updated•15 years ago
|
Component: GFX → GFX: Color Management
Product: Core Graveyard → Core
QA Contact: general → color-management
You need to log in
before you can comment on or make changes to this bug.
Description
•