Closed Bug 444661 Opened 11 years ago Closed 11 years ago

Decrease Performance Penalty of color management linear interpolations

Categories

(Core :: GFX: Color Management, defect)

defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(1 file, 4 obsolete files)

Initial performance analysis (viewing a jpg of several megs with an embedded color profile from the local disk) for firefox color management on my mac indicated that roughly 8.2% of firefox's CPU time was spent in cmsLinearInterpLUT16. As such, I figured it would be a good candidate for optimization.

Closer inspection revealed that the function does in fact have hand-tuned inline assembly, but that it's for windows (masm) only. I spent some time porting the assembly to GAS assembly (slightly non trivial due to register allocation issues relating to the use of ebx as a PIC register by gcc), and finally had analogous working assembly for mac and linux. Running it through Shark, I was surprised to find that the performance was almost twice as bad (16.1% as compared to 8.2%). I had a feeling that this had to do with the div instructions, and in fact it did: 80ish% of the samples recorded by shark were recorded in the instructions immediately following the div instructions, which makes sense since apparently div is not pipelined. Commenting out the divs brings performance to a nice 7.1% and results in pretty trippy color rendering. ;-)

This means that the assembly is actually to blame for poor performance on windows, not on mac and linux. This makes sense, because the performance numbers stuart were talking about were a bit worse than the ones I was seeing.

I've been doing some reading (http://blogs.msdn.com/devdev/archive/2005/12/12/502980.aspx) on some heuristics to do division by constants with shifts and adds. I hope to be able to replace the divs with a hand-tuned solution. 

After that, I want to look into MMX/SSE/SIMD to see if I can speed things up that way. The function is currently called once for each 2-byte value, so hopefully we can get performance speedups that way (I would imagine I'd have to rearchitect the call structure a bit). One concern is that the code currently consults a LUT for each value, which may limit the amount of SIMD optimizations we can do (we still need N arbitrary memory accesses for N values no matter what instruction set we use).
CC-ing schrep regarding the assembly ninjas
Michael and Steve - Schrep mentioned that you guys were good at assembly optimization. I'm working on optimizing this function for color management at the moment, so I was hoping that one or both of you could review it once it's done and give suggestions for things I might have missed. I'll let you know when it's done.
There is code for two programs to provide the assembler code for constant integer division via MUL + SHR on page 136 of the AMD K10 Software Optimization Guide. The code for unsigned division by 65535 is:

MOV EAX, 080008001h
MUL dividend
SHR EDX, 15

And for signed division:

MOV EAX, 080008001h
IMUL dividend
MOV EAX, dividend
ADD EDX, EAX
SAR EDX, 15
SHR EAX, 31
ADD EDX, EAX

How is the LUT table generated? Can we precompute (y1 - y0) efficiently?
Having finished porting their crappy assembly, I've been focusing more on the intent of the function. I _think_ the division can actually (and should actually) just be done with a right-shift by 16, and indeed it looks right when I do that. However, since color management is all about the fine points, if it's off by even a little bit it's no good. I'll dig deeper into that one tomorrow.

One question though - How does the performance of MUL compare to the performance of single and double-precision shifts (I know multipliers are slow in verilog, but I don't know about how they work on an x86)? If I can do something with one or two shifts as opposed to a multiplication, is it worth it, or is div the only major culprit?

Good idea on precomputing the difference in the LUT. I'll look into that tomorrow.
It should be interesting to examine the generated code from the build. If I compile this snippet of code using VS2005 and -O2,

main()
{
  int a;

  scanf("%d", &a);
  printf("%d\n", a/65535);

}

I get:

; 5    :   int a;
; 6    :
; 7    :   scanf("%d", &a);

        lea     eax, DWORD PTR _a$[esp+4]
        push    eax
        push    OFFSET ??_C@_02DPKJAMEF@?$CFd?$AA@
        call    _scanf

; 8    :   printf("%d\n", a/65535);

        mov     ecx, DWORD PTR _a$[esp+12]
        mov     eax, -2147450879                        ; 80008001H
        imul    ecx
        add     edx, ecx
        sar     edx, 15                                 ; 0000000fH
        mov     ecx, edx
        shr     ecx, 31                                 ; 0000001fH
        add     ecx, edx
        push    ecx
        push    OFFSET ??_C@_03PMGGPEJJ@?$CFd?6?$AA@
        call    _printf

so VS2005 at -O2 does the multiply, add and double shift to replace the division. I'd guess that GCC does the same thing at -O2 but someone would have to test that.
Yep:

...
        call    L_scanf$stub
        movl    -12(%ebp), %ecx
        movl    $-2147450879, %edx
        movl    %ecx, %eax
        imull   %edx
        leal    LC1-"L00000000001$pb"(%ebx), %eax
        movl    %eax, (%esp)
        addl    %ecx, %edx
        sarl    $15, %edx
        sarl    $31, %ecx
        subl    %ecx, %edx
        movl    %edx, 4(%esp)
        call    L_printf$stub
...
(Also, in the patch, don't blow away the MSC_VER version!  But do fix it, along with a gcc version :)
(In reply to comment #5)
> Having finished porting their crappy assembly, I've been focusing more on the
> intent of the function. I _think_ the division can actually (and should
> actually) just be done with a right-shift by 16, and indeed it looks right when
> I do that. However, since color management is all about the fine points, if
> it's off by even a little bit it's no good. I'll dig deeper into that one
> tomorrow.
> 
> One question though - How does the performance of MUL compare to the
> performance of single and double-precision shifts (I know multipliers are slow
> in verilog, but I don't know about how they work on an x86)? If I can do
> something with one or two shifts as opposed to a multiplication, is it worth
> it, or is div the only major culprit?

On the Pentium 4, imul takes from 10 to 14 cycles. Shifts take from 1 to 4 cycles. On Core, imul takes 3 to 4 cycles and shifts take one cycle. On AMD processors, imul takes 3 on k10 (maybe a cycle or two more on k8), and shifts take one cycle. I don't have the SOGs for processors older than k8.

If the arguments to the three ToFixedDomain calls can be guaranteed to be positive, then passing an unsigned argument instead of the current signed argument would improve performance (compare the generated code for synthetic signed and unsigned division in post #4.

It may be better making some C code improvements with -O2 optimization instead of going with inline assembler. -GL optimization would allow for inter-routine optimization and -PGO could optimize the two conditionals.

If you want to go SIMD, using intrinsics would make maintenance easier and allow the compiler to do the instruction scheduling.
I ran a program to look for differences in dividing by 2^16 instead of 2^16 - 1 and found a bunch of one-off differences.

#include "stdio.h"

main()
{
  int a;

  for (a = -20000000; a < 20000000; a++)
    if ( ((a + 32767)/65535) != ((a + 32767)/65536))
      printf("%d %d %d\n", a, (a + 32767)/65535, ((a + 32767)/65536));
}
One last comment for the evening: doing SIMD code is frequently difficult when you have table lookups as is done with the LutTable because there are no table lookup SIMD instructions. Sometimes it is faster to compute table results on the fly in parallel to implement an SIMD solution.
Vlad - Don't worry, just a working patch to test the idea.

Michael - Unfortunately, the LUTs are read straight out of the ICC profile, so computing them is impossible.

I've started to think the assembly optimization isn't going to take us very far. Fundamentally, the problem is that we're running this function a lot, not that it's particularly inefficient. SIMD would probably be a good boost, but the need to look stuff up in the table makes that pretty difficult.

As a result, I've been looking into caching. My first attempt involved setting up a 256-entry cache table for each LUT to store the results for different input values (proof of concept patch attached). This doubled the performance on one of my test images (a NASA photo with a lot of black space and a bright yellow sun), but had more or less no benefit on a more realistic picture (mountains and rivers). I then tried making the cache bigger and covering the entire range of input values (rather than mapping 256->1). This doubled performance on the mountain photo, and almost trippled it on the space photo.

Given that the LUTs are per-color profile, I'm now looking into simply precomputing the entire range of values for the system profile and for sRGB. This should take care of most common cases, and will hopefully prevent color management from having any noticeable performance hit for day to day browsing. Code to follow.
Where do the ICC profiles come from or where do they live?
The destination color profile is read in either from a file (if it's specified in about:config) or from memory (queried from the windowing system).

The source color profile is either created from known paremeters (if it's sRGB) or created from the image data of images that have embedded source profiles.
Perhaps you could try changing the following:

a1 = ToFixedDomain(dif * rest);
a1 = ToFixedDomain((- dif) * rest);

to:

a1 = ToFixedDomain((unsigned int) (dif * rest));
a1 = ToFixedDomain((unsigned int) ((- dif) * rest));

as "rest" is positive and dif is positive based on the logic of the if statement.
renaming the bug to be more general (since we're considering caching mechanisms also).
Summary: Optimize Inline Assembly For cmsLinearInterpLUT16 → Decrease Performance Penalty of cmsLinearInterpLUT16
Depends on: 444829
I've written a patch that does precaching for the system output profile and sRGB (as an input profile) that seems to have pretty good results. I've thrown it up on the tryserver and will hopefully post it tomorrow morning once I see the results and fix some whitespacing issues.

I realized that I did something dumb to hide another big perf issue for color management. When I was making my test image, I grabbed a high-res photo off the internet and slapped a random profile from my system folder onto it as its profile. Unfortunately, I now realized that the profile I used was in fact my primary display profile. The composition of the two matrices yielded an identity matrix, which LCMS was smart enough to optimize away. It now looks like there's an equally significant performance it in MAT3evalW. This might be harder to optimize (oh how i wish we had a GPU), but hopefully the perf gains from this will get us into reasonable territory. I'll look more into that tomorrow.
again changing the title of the bug to reflect its more general nature.
Summary: Decrease Performance Penalty of cmsLinearInterpLUT16 → Decrease Performance Penalty of color management linear interpolations
I just realized that the matrix multiplications are probably a perfect candidate for optimization with SIMD intrinsics. I'll look into that tomorrow.
For best performance with SSE2, you'd want to change the vectors from three doublewords to four doublewords and then try to double quadword align the vectors, whether malloc'd or allocated on the stack.

The "W" flavor of the matrix arithmetic uses 64-bit scaled integer arithmetic and there should be enough support in SSE2 for this using the packed quadword multiply, add and shift instructions. Double-precision floating point instructions could be used but then there would be conversion overhead.

Can these operations be streamed or are they strictly executed one at a time?
I've made a separate bug for the matrix optimization. See bug 445552.
Attached patch proposed precache patch (obsolete) — Splinter Review
Adding a proposed patch for the precaching strategy. I did my best to stick to lcms' whitespace style, but it's pretty much impossible since lcms is horribly inconsistent.

This patch must be applied on top of the patch for bug 444829, which hasn't landed but has review and will be landed soon.

Flagging joe for review.
Attachment #328992 - Attachment is obsolete: true
Attachment #329122 - Attachment is obsolete: true
Attachment #329898 - Flags: review?(joe)
What's our test coverage like for lcms?
Joe - no idea. Vlad, Any insight on this?
I ran a build with and without this change and it's about 6% faster on 84 MB of random jpeg images. This is off the 3.0 source code base. I had to make a minor change to a file as it wouldn't apply on 3.0. Hardware is a 2.5 Ghz Penryn DC. OS is Windows XP.
Two comments:

1) I'd suggest building cmsintrp.c with -O2 on Windows. I think that this is built with -O1 by default. -O2 gets the multiply/shift optimization for the divisions and also inlines the multiply/shift instead of calling a function.

2) I'd suggest using an unsigned int cast in the last two calls to ToFixedDomain in cmsLinearInterpLUT1.
(In reply to comment #26)

> 2) I'd suggest using an unsigned int cast in the last two calls to
> ToFixedDomain in cmsLinearInterpLUT1.

I assume you mean cmsLinearInterpLUT16, but further -- how come?
See comment #4.

(In reply to comment #27)
> (In reply to comment #26)
> 
> > 2) I'd suggest using an unsigned int cast in the last two calls to
> > ToFixedDomain in cmsLinearInterpLUT1.
> 
> I assume you mean cmsLinearInterpLUT16, but further -- how come?


Comment on attachment 329898 [details] [diff] [review]
proposed precache patch

First comment: This looks good in principle, but needs some tweaking in implementation.


>+        LCMSBOOL status = 
>+            cmsPrecacheProfile(gCMSOutputProfile, CMS_PRECACHE_LILUT16_REVERSE);

>+        LCMSBOOL status = 
>+            cmsPrecacheProfile(gCMSsRGBProfile, CMS_PRECACHE_LIF_FORWARD);

It'd be nice if, given that precaching the transformations is such a win, this happened automatically instead of having to be opted-in.

>+// Type specifier for precaches
>+typedef enum {
>+
>+             // Precache the results of cmsLinearIntrpLUT16 on inverse (output) gamma tables
>+             CMS_PRECACHE_LILUT16_REVERSE,   
>+
>+             // Precache the results of cmsLinearIntrpFixed on forward (input) gamma tables
>+             CMS_PRECACHE_LIF_FORWARD 
>+
>+             } LCMSPRECACHETYPE;
>+#define PRECACHE_TYPE_COUNT ((CMS_PRECACHE_LIF_FORWARD - CMS_PRECACHE_LILUT16_REVERSE) + 1)

Please make PRECACHE_TYPE_COUNT the last member of the enum, and specify the values (0, 1) for _REVERSE and _FORWARD.

However, I'm not sure if this type will be necessary. See below.

>+               // Different types of precaches require different structures. We use a union
>+               // to handle them with the same code when we can.
>+               union {
>+                     LCMSPRECACHELILUT16IMPL LILUT16;
>+                     LCMSPRECACHELIFIMPL     LIF;
>+                     } Impl;

Having this union seems to just make actually referring to the precached data difficult. I'd prefer to see two different reference-counted precache structures, one for input and one for output. They're never mixed, from what I can tell, so having two separate types should only make things cleaner. And, so long as you name things nicely, you can still use your PRECACHE_ADDREF and PRECACHE_RELEASE macros.

This is going to require a lot of code changes, which is the main reason for the r-.

>+// Mozilla found this function to be almost twice as fast as the
>+// hand-optimized assembly, mostly due to the slowness of the div
>+// instructions in the assembler version. The assembly has thus been removed.

This comment should be in the checkin comment or ChangeLog, not the source.
 
>+              InVect.n[VX] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[0][In[0]];
>+              InVect.n[VY] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[1][In[1]];
>+              InVect.n[VZ] = MatShaper->L2_LIF_Precache->Impl.LIF.Cache[2][In[2]];

This function does a lot of this mixing of VX, VY, VZ and 0, 1, 2. I know you're just following convention, but it's kind of ugly. If you can find a way to use named indices instead of hardcoded numbers, that'd be nice. :)

>+++ b/modules/lcms/src/cmsprecache.c
>@@ -0,0 +1,178 @@
>+//  Little cms
>+//  Copyright (C) 2008 Mozilla Foundation

>+#define CMS_DEBUG 1

Do we want this in checked-in code?

Also, please address mmoy's comments in comment 26. And is LittleCMS actively maintained upstream? Is it possible to get the upstream developers to look this over? (Ideally we could get it pushed upstream.)

Finally, before any of your LCMS patches are applied, I want some form of CMS testing added. If we're going to enable it by default, especially. At minimum, I'd like tests that see whether CMS does anything (i.e., does this jpeg with a profile embedded look different than this same jpeg without a profile), but ideally it'd test the quality of the CM transformation. That will require a file with a non-sRGB profile, and an equivalent file in sRGB, which we then test for equality. Since we're using the fixed-point LCMS implementation, this should be reasonably solid, but we'll have to see what happens with all the vagrancies of compiler optimization and CPU type.

The rest of the code looks pretty good. I've checked it over (keeping in mind I don't know LCMS code too well) and it seems to handle failure/etc fairly well. And I've tested it too -- seems to work nicely!
Attachment #329898 - Flags: review?(joe) → review-
>It'd be nice if, given that precaching the transformations is such a win, this
>happened automatically instead of having to be opted-in.

I don't think we want to precache just any profile that comes through LCMS. Embedded profiles are read from images at render time, and it would be wasteful in both time and space to generate a precache for a profile with a short lifetime. Moreover, LCMS doesn't know whether a profile is an input profile or an output profile (or both) at creation time, so it wouldn't know the appropriate precache to create. Finally, other lcms apps (GIMP, inkscape, etc) would likely object to such a big increase in their memory footprint when color management is not a bottleneck for them.

>Please make PRECACHE_TYPE_COUNT the last member of the enum

Any particular reason to do this? I was under the impression that both were accepted conventions and it doesn't look like lcms sets a precedent in either direction.

>Having this union seems to just make actually referring to the precached data
>difficult. I'd prefer to see two different reference-counted precache
>structures, one for input and one for output. They're never mixed, from what I
>can tell, so having two separate types should only make things cleaner. And, so
>long as you name things nicely, you can still use your PRECACHE_ADDREF and
>PRECACHE_RELEASE macros.

I disagree. The union does add the annoyance of having to type things like Impl.LIF. in data structure access. However, it adds a generality to the precaching code that makes it more likely to be accepted upstream. Right now, we're precaching the results of two very specific functions (cmsLinearInterpLUT16 and cmsLinearInterpFixed) because they happen to lie in the critical path for gecko (one of them is used for our input transform and the other is used for our output transform). However, lcms is a library with with dozens of other functions which can be called instead depending on function parameters. As such, I wanted to make the precaching code as general as possible such that most of it would serve as a framework for other precaching operations. This means that we don't want to pollute the profile structure with a line for each type of precache, and and we want adding a new precache to involve adding as few hooks as possible. For example, right now the profile closing code is agnostic to the number and type of precaches attached to the profile. This is only possible when we store the different types of caches in an array, which is in turn possible only with unions, wasted space, or dangerous type casting.
Attached patch updated precache patch (obsolete) — Splinter Review
I finally got around to updating the precache patch. This should allay some of joe's concerns, and it also adopts a more structured naming convention that helps for later patches.

Flagging joe for review. LCMS still lacks a testing framework, and I promise I'll write one before any of this code gets checked in. I'm just hoping to stabilize this patch to ease things for the patches I have on top of it.
Attachment #329898 - Attachment is obsolete: true
Attachment #332296 - Flags: review?(joe)
oh as a side note - mmoy's first concern is being addressed in bug 449130
Comment on attachment 332296 [details] [diff] [review]
updated precache patch

r+ as long as we get some testing in. Also, please check upstream to see if they have any requests.
Attachment #332296 - Flags: review?(joe) → review+
roger that - pinged the upstream maintainer
Comment on attachment 332296 [details] [diff] [review]
updated precache patch

flagging vlad for sr, since I think he already read the patch
Attachment #332296 - Flags: superreview?(vladimir)
I found a bug in the precaching patch where it fails in a crappy way when there aren't any TRCs in the profile to precache. This should make it fail silently and still work, just slower.

Flagging joe for r and vlad for sr. This one should be quick - just diff it against the previous patch.
Attachment #332296 - Attachment is obsolete: true
Attachment #333655 - Flags: superreview?(vladimir)
Attachment #333655 - Flags: review?(joe)
Attachment #332296 - Flags: superreview?(vladimir)
Comment on attachment 333655 [details] [diff] [review]
updated patch to fix a bug I found with the test suite

Looks fine.
Attachment #333655 - Flags: superreview?(vladimir)
Attachment #333655 - Flags: superreview+
Attachment #333655 - Flags: review?(joe)
Attachment #333655 - Flags: review+
pushed in ba60f155cc44
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
Component: GFX → GFX: Color Management
Product: Core Graveyard → Core
QA Contact: general → color-management
You need to log in before you can comment on or make changes to this bug.