Closed Bug 445552 Opened 14 years ago Closed 14 years ago

Optimize Matrix Multiplication and General xform path For Color Management

Categories

(Core :: GFX: Color Management, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(6 files, 6 obsolete files)

11.26 KB, patch
Details | Diff | Splinter Review
50.07 KB, patch
Details | Diff | Splinter Review
6.58 KB, patch
vlad
: review+
Details | Diff | Splinter Review
3.28 KB, patch
vlad
: review+
Details | Diff | Splinter Review
23.68 KB, patch
vlad
: review+
Details | Diff | Splinter Review
1.38 KB, patch
vlad
: review+
Details | Diff | Splinter Review
After implementing the precache strategy for LUT interpolation (see bug 444661), I discovered that there's also about a 10% performance hit coming from MAT3evalW. We'd like to get this number down significantly, hopefully with some sort of SIMD/SSE2 optimization.
Assignee: nobody → bholley
Status: NEW → ASSIGNED
CCing mmoy and snyder
(continued from the discussion on bug 444661)

>For best performance with SSE2, you'd want to change the vectors from three
>doublewords to four doublewords and then try to double quadword align the
>vectors, whether malloc'd or allocated on the stack.

Do you mean putting zeros in the fourth slot? I'm a bit confused as to what you mean by increasing the dimension of the vector.

>Can these operations be streamed or are they strictly executed one at a time?

This may be possible with some re-architecting of the code. What would the advantage to doing so? Given that the matrix multiplication involves multiplying 3 32 bit numbers, we're already at 96 bits of somewhat parallel processing in the function. How would processing multiple multiplications at once help things?
I don't think that you have to put zeros in the fourth slot - just allocate enough space. You'd need to do this for alignment anyways unless you're reading smaller chunks. You can read a doubleword with MOVD, a quadword with MOVQ or a double quadword with MOVDQA or MOVDQU. MOVDQA is the aligned version and MOVDQU is the unaligned version. MOVDQA is the better performer. Both read 16 bytes so 16 bytes should be allocated.

The reason for streaming is so that you can do other work while an SIMD instruction is waiting to complete. It would probably be easiest to try this without streaming first.

The operations aren't quite as easy if you're going to follow the scalar code because the integer arithmetic is done in quadwords so you can't just multiply and add the three doublewords in SIMD registers - you have to do the multiply, add and shift in quadwords which means that you'll need two sets of registers and a little shuffling of doublewords.
I wrote an SSE2 implementation of the fixed-point code, which brought the synthetic benchmarks from 20% to 14% (this is with precaching enabled so the interpolation is free). This isn't as good as I was hoping for, so I've been working on a solution that uses 32-bit floating point numbers. This is tricky since LCMS doesn't really have the necessary framework to do this efficiently. Thus, I spent some time to implement a floating point path through lcms. It still needs a bit of debugging, but it looks like it alone brings the synthetic benchmarks down to 13% without sse2 optimization and with some inefficient code (I think I tested it with asserts on). Once I get that debugged, I'll post the patch and then work an sse2 implementation of the matrix arithmetic and conversion.
renaming the bug to represent the stuff I've been working on
Summary: Optimize Matrix Multiplication For Color Management (MAT3evalW) → Optimize Matrix Multiplication and General xform path For Color Management
Status update: I've made good progress in getting the performance hit down on color management in various ways. The path is pretty strongly optimized in SSE2 assembly, which is represented in the above patches. The patches must be applied in order.

With everything but the top patch applied, talos regressions are down to about 4% on linux and 7% on windows. I'm not sure why windows is slower. The build for that is here (https://build.mozilla.org/tryserver-builds/2008-08-07_14:03-bholley@mozilla.com-bholleyuberpatch/), and I've been meaning to take a look at the generated code but haven't had time.

The last patch was somewhat troublesome. It should make things faster since it removes two memory loads, but for some reason it makes things about 10 talos points slower on both platforms. The only thing I can think of might be the re-ordering of the shuffle instructions having some impact on reordering, but I'm not sure. MMoy, can you weigh in on this?

I have some prototype patches (not posted here) that suggest other ways to improve performance and hopefully get the windows numbers under 5%. For one, right now RGBs with alpha aren't fast-pathed simply because I didn't have time to deal with the special case of 32-bit pixels instead of 24-bit pixels. The numbers suggest that this may increase performance by close to 10 points. Furthermore, right now we don't do any detection to see if an embedded profile is actually just sRGB and, if it is, to use the precache data. Numbers suggest that this may increase performance as well. I hope to get full versions of those patches running early this weekend or early next week.

I'd appreciate any feedback on the patches and how to make them faster. I haven't gone through and made them super-clean yet, but they should be passable.
I was unable to apply the first patch and was wondering if you could
diff your changes into one patch.

I do have some comments on your fourth patch:

1) You have a code path for NON_WINDOWS and Windows (the #else path). Using
   SSE2 is a little more complex.

   SSE2 on Windows is supported on Pentium 4 processors and up and on AMD K8
   processors and up. So a runtime check on the processor to determine if it
   supports SSE2 is needed. If runtime support isn't available, then scaler
   code has to be used. The C Runtime has a global that can be used to check
   runtime capabilities and it is __sse2_available. SSE2 support is available
   if this variable is set to 1. It's used by the C Runtime to do SSE2 optimization
   in memcpy. Note that Windows has been supported on other platforms in the past
   including Alpha and PowerPC. It is best to check for Win32 and x86.
   (I'm using __sse2_available in one of my patches and it works in the positive
   way - I want to check it in the negative way on a Pentium 3 that doesn't have
   SSE2).

   SSE2 on Mac OSX Intel is always available.

   SSE2 on Linux processors is available on Pentium 4 and AMD K8 processors
   and above. Processor capabilities have to be determined at runtime by
   sniffing capabilities.

   Other platforms (Alpha, SPARC, AIX, Itanium, ARM, etc.) may or may not
   support SSE2. It's probably safest to look for Windows on x86, Mac OSX
   on x86 and Linux on x86 and use scaler code for everything else.

2) In general, it is better to use intrinsics instead of inline assembler
   unless you absolutely need complete control of the instructions and
   their order and even then, you don't have complete control.

   The advantages to intrinsics are:

   A) Portability. You can use common code for Linux, Mac OSX and Windows.
      Win32 supports inline assembler but Windows x64 does not support
      inline assembler. Intrinsics are usually easier to read for those
      unfamiliar with inline assembler and they're easier to read than
      two sets of code, one for the MSVC format and one for the GCC format.

   B) The compiler has the flexibility to schedule instructions. This can
      result in optimizations that aren't available with inline assembler.
      In general, GCC is more flexible in compiler scheduling with inline
      assembler. MSVC is pretty rigid. C variables are always read from
      memory so when you load a C variable, the value is written from a
      register to memory (if it's in a register) and then read back from
      memory. With intrinsics, the C values in registers can be used
      directly.

   C) Type checking by the compiler.

   D) Maintenance costs - it's easier to maintain one code set.

3) Most SSE2 and SSE instructions can be broadly categorized as integer
   or floating point. It is generally best to use floating point
   instructions on floating point datatypes and integer instructions
   on integer datatypes if at all possible.

   So instead of using MOVDQA, use MOVAPS for single-precision floating
   point. Instead of PSHUFD, use SHUFPS if possible. XORPS or ANDNPS
   can be used instead of PXOR.

4) In the comment: Extract the floating point representation of 65535/65536 
   and load it into 4 slots, the comment seems to indicate that you are
   loading constants but the comment on the load of xmm1 is: Move the first
   matrix column to xmm1. xmm2 is similar.

5) To load a vector of constants, it's usually more efficient to load a
   vector of constants than it is to load the first vector element and
   then propagate the first element to the other three.
mmoy - I'll look into your suggestions in more detail in a bit, but I wanted to quickly let you know that I'd forgotten to mention that these patches must be applied on top of the precaching patch in bug 444661. Let me know if it still gives you trouble (it might depending on when the patch for bug 449681 lands, after which point these patches will need to be rebased).
Attached patch updated float path patch (obsolete) — Splinter Review
Added an update float path patch. This isn't so much of an optimization itself as it is a platform for other optimizations. Flagging vlad for review. If all looks good, I'll check it in tomorrow morning.
Attachment #333039 - Attachment is obsolete: true
Attachment #334056 - Flags: review?(vladimir)
Comment on attachment 334056 [details] [diff] [review]
updated float path patch

Looks fine; "FLOATPATH" feels a little weird to me (I would've called it FLOATMATH or something), but looks good.
Attachment #334056 - Flags: review?(vladimir) → review+
picked a more appropriate flag. This is a trivial change so not flagging it for re-review. I'll be landing this shortly
Attachment #334056 - Attachment is obsolete: true
float patch pushed in 6d3fca48dd13
another infrastructure patch - flagging vlad for review. After this the assembly stuff should apply.
Attachment #333040 - Attachment is obsolete: true
Attachment #334126 - Flags: review?(vladimir)
Alignment on 16 bytes on the stack is usually done with __declspec(align(16)) for MSVC and __attribute__ ((aligned (16))) for gcc.
mmoy - nothing similar on the heap though right? I know it's possible with memalign, but dealing with the different allocators proved to be more of a pain than simply hacking it manually.
also, mmoy - given the weird semantics of shufps, is it better to have two shufps instructions or one pshufd instruction?
I think that you'd just do a shufps xmmn, xmmn to get similar semantics to pshufd. If you need it in another register, you'd just do a movdqa to a spare register and then do a shufps xmmn, xmmn to move things to where you want them.

On modern intel processors, movdqa takes one cycle. The pshufd and shufps instructions take one to two cycles depending on the processor.
is there a reason you'd suggest movdqa over movaps?
Old age.

You're right, it's movaps.
aligned space patch pushed in fd996aff506b
Attached patch update floating point sse2 patch (obsolete) — Splinter Review
Adding an updated patch for floating point sse2 stuff. Flagging vlad for review. If you get a chance to review this in the next couple of hours vlad that would be sweet.

A few comments:
1-mmoy - I agree that the portability of intrinsics makes them a desirable option. However, at this point it's an extra variable that I don't have time to deal with before a2, since there are several issues involved (compiler support, whether it's as fast as the straight assembler, and me personally learning the intrinsic syntax). I hope to investigate it after we get alpha 2 out the door with color management on.

2-there are several optimizations not in the patch. For one, there are a couple of mutually exclusive ways to reduce memory accesses (shoving the vector into the free matrix slots, shoving the constants into the free matrix slots), and also loading 4 copies of the constants at a time. I want to land this first so that it's easier to obtain perf numbers for each other different options. Additionally, this still doesn't include rgba fastpathing and srgb detection. Those are in the pipe.
Attachment #333041 - Attachment is obsolete: true
Attachment #334195 - Flags: review?(vladimir)
added the patch to fastpath rgba. This must be applied on top of the sse2 patch. Flagging vlad for review.
Attachment #334197 - Flags: review?(vladimir)
there was some unexpected windows bustage as a result of using a macro for the CPUID assembler (the assembler gets spit out as one line, but MASM wants newlines between instructions). Obsoleting the old patch and flagging vlad for review on this one.
Attachment #334195 - Attachment is obsolete: true
Attachment #334201 - Flags: review?(vladimir)
Attachment #334195 - Flags: review?(vladimir)
fixed some silliness on my part after switching from a macro to a function.
Attachment #334201 - Attachment is obsolete: true
Attachment #334203 - Flags: review?(vladimir)
Attachment #334201 - Flags: review?(vladimir)
That assembly doesn't quite do what it should for windows.
Attachment #334206 - Flags: review?(vladimir)
Comment on attachment 334203 [details] [diff] [review]
updated sse2 patch (again)


Ok, with comments:

+// Determine if we can build with SSE2 (this was partly copied from jmorecfg.h in
+// mozilla/jpeg)
 // -------------------------------------------------------------------------
+#if defined(XP_WIN32) && defined(_M_IX86) && !defined(__GNUC__)

No XP_WIN32 -- that's a mozilla-only #define.

>+#define SSE2_EDX_MASK (1UL << 26)
>+static LCMSBOOL SSE2Available() {
>+       DWORD a, b, c, d;
>+       DWORD function = 0x00000001;
>+
>+// If we don't have compile-time support, we don't have runtime support
>+#ifndef HAVE_SSE2_INTEL_MNEMONICS
>+       return FALSE;
>+#else
>+       /* We have CPUID macros defined if we have sse2 mnemonics. */
>+       LCMSCPUID(function, &a, &b, &c, &d);
>+       if (d & SSE2_EDX_MASK)
>+              return TRUE;
>+       return FALSE;
>+#endif
>+
>+}

This isn't used critically in the code, but you should still cache the results in a local static var and avoid running CPUID again if you've already ran it once.

>+#ifdef NON_WINDOWS

#ifdef __GNUC__ please
Attachment #334203 - Flags: review?(vladimir) → review+
this has all since landed in 88831bfc0a4d, fd996aff506b, and 6d3fca48dd13.

Resolving as fixed.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment on attachment 334109 [details] [diff] [review]
floating point patch renaming flag to cmsFLAGS_FLOATSHAPER

In lcms.h, #ifdef USE_INLINE:
>+LCMS_INLINE int     FromFloatDomain(FLOAT a)  { return (int) (a * 65536.0f + 0.5f); }   

In cmsmtrx.c, #ifndef USE_INLINE:
>+int FromFloatDomain(Float a)
>+{
>+    return (int) (a * 65536f); 
>+}   

Shouldn't these two versions of FromFloatDomain be the same?
(In reply to comment #32)
> Shouldn't these two versions of FromFloatDomain be the same?


Probably. Not sure if this is the source of the issues in bug 465088 though, since LCMS_INLINE is always defined AFAIK.
(In reply to comment #33)
> Probably. Not sure if this is the source of the issues in bug 465088 though,
> since LCMS_INLINE is always defined AFAIK.

I presumed it was unrelated; I just happened to notice while looking into that.
filed bug 465111 for this issue
Product: Core → Core Graveyard
Component: GFX → GFX: Color Management
Product: Core Graveyard → Core
QA Contact: general → color-management
You need to log in before you can comment on or make changes to this bug.