Closed Bug 445552 Opened 17 years ago Closed 17 years ago

Optimize Matrix Multiplication and General xform path For Color Management

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: bholley, Assigned: bholley)

References

Details

Attachments

(6 files, 6 obsolete files)

patch to add a floating point path through the cms transform 17 years ago Bobby Holley (:bholley) 39.58 KB, patch		Details \| Diff \| Splinter Review
patch adding some optimizations on top of the float path 17 years ago Bobby Holley (:bholley) 6.04 KB, patch		Details \| Diff \| Splinter Review
patch optimizing the matrix multiplication and clamping in SSE2 assembly 17 years ago Bobby Holley (:bholley) 17.06 KB, patch		Details \| Diff \| Splinter Review
patch attempting to avoid extra loads for the floating point constants by stuffing them into unused matrix slots. This made things slower. :( 17 years ago Bobby Holley (:bholley) 11.26 KB, patch		Details \| Diff \| Splinter Review
updated float path patch 17 years ago Bobby Holley (:bholley) 50.05 KB, patch	vlad : review+	Details \| Diff \| Splinter Review
floating point patch renaming flag to cmsFLAGS_FLOATSHAPER 17 years ago Bobby Holley (:bholley) 50.07 KB, patch		Details \| Diff \| Splinter Review
patch to set up aligned space for assembly 17 years ago Bobby Holley (:bholley) 6.58 KB, patch	vlad : review+	Details \| Diff \| Splinter Review
update floating point sse2 patch 17 years ago Bobby Holley (:bholley) 23.45 KB, patch		Details \| Diff \| Splinter Review
updated patch to let rgba pngs go through the sse2 code 17 years ago Bobby Holley (:bholley) 3.28 KB, patch	vlad : review+	Details \| Diff \| Splinter Review
small fix from previous sse2 patch 17 years ago Bobby Holley (:bholley) 23.53 KB, patch		Details \| Diff \| Splinter Review
updated sse2 patch (again) 17 years ago Bobby Holley (:bholley) 23.68 KB, patch	vlad : review+	Details \| Diff \| Splinter Review
msvc fix for the updated sse2 patch 17 years ago Rob Arnold [:robarnold] 1.38 KB, patch	vlad : review+	Details \| Diff \| Splinter Review

Bobby Holley (:bholley)

Assignee

Description

•

17 years ago

After implementing the precache strategy for LUT interpolation (see bug 444661), I discovered that there's also about a 10% performance hit coming from MAT3evalW. We'd like to get this number down significantly, hopefully with some sort of SIMD/SSE2 optimization.

Bobby Holley (:bholley)

Assignee

Updated

•

17 years ago

Assignee: nobody → bholley

Bobby Holley (:bholley)

Assignee

Updated

•

17 years ago

Status: NEW → ASSIGNED

Bobby Holley (:bholley)

Assignee

Comment 1

•

17 years ago

CCing mmoy and snyder

Bobby Holley (:bholley)

Assignee

Comment 2

•

17 years ago

(continued from the discussion on bug 444661) >For best performance with SSE2, you'd want to change the vectors from three >doublewords to four doublewords and then try to double quadword align the >vectors, whether malloc'd or allocated on the stack. Do you mean putting zeros in the fourth slot? I'm a bit confused as to what you mean by increasing the dimension of the vector. >Can these operations be streamed or are they strictly executed one at a time? This may be possible with some re-architecting of the code. What would the advantage to doing so? Given that the matrix multiplication involves multiplying 3 32 bit numbers, we're already at 96 bits of somewhat parallel processing in the function. How would processing multiple multiplications at once help things?

Michael Moy

Comment 3

•

17 years ago

I don't think that you have to put zeros in the fourth slot - just allocate enough space. You'd need to do this for alignment anyways unless you're reading smaller chunks. You can read a doubleword with MOVD, a quadword with MOVQ or a double quadword with MOVDQA or MOVDQU. MOVDQA is the aligned version and MOVDQU is the unaligned version. MOVDQA is the better performer. Both read 16 bytes so 16 bytes should be allocated. The reason for streaming is so that you can do other work while an SIMD instruction is waiting to complete. It would probably be easiest to try this without streaming first. The operations aren't quite as easy if you're going to follow the scalar code because the integer arithmetic is done in quadwords so you can't just multiply and add the three doublewords in SIMD registers - you have to do the multiply, add and shift in quadwords which means that you'll need two sets of registers and a little shuffling of doublewords.

Bobby Holley (:bholley)

Assignee

Comment 4

•

17 years ago

I wrote an SSE2 implementation of the fixed-point code, which brought the synthetic benchmarks from 20% to 14% (this is with precaching enabled so the interpolation is free). This isn't as good as I was hoping for, so I've been working on a solution that uses 32-bit floating point numbers. This is tricky since LCMS doesn't really have the necessary framework to do this efficiently. Thus, I spent some time to implement a floating point path through lcms. It still needs a bit of debugging, but it looks like it alone brings the synthetic benchmarks down to 13% without sse2 optimization and with some inefficient code (I think I tested it with asserts on). Once I get that debugged, I'll post the patch and then work an sse2 implementation of the matrix arithmetic and conversion.

Bobby Holley (:bholley)

Assignee

Comment 5

•

17 years ago

renaming the bug to represent the stuff I've been working on

Summary: Optimize Matrix Multiplication For Color Management (MAT3evalW) → Optimize Matrix Multiplication and General xform path For Color Management

Bobby Holley (:bholley)

Assignee

Comment 6

•

17 years ago

Attached patch patch to add a floating point path through the cms transform (obsolete) — Details — Splinter Review

Bobby Holley (:bholley)

Assignee

Comment 7

•

17 years ago

Attached patch patch adding some optimizations on top of the float path (obsolete) — Details — Splinter Review

Bobby Holley (:bholley)

Assignee

Comment 8

•

17 years ago

Attached patch patch optimizing the matrix multiplication and clamping in SSE2 assembly (obsolete) — Details — Splinter Review

Bobby Holley (:bholley)

Assignee

Comment 9

•

17 years ago

Attached patch patch attempting to avoid extra loads for the floating point constants by stuffing them into unused matrix slots. This made things slower. :( — Details — Splinter Review

Bobby Holley (:bholley)

Assignee

Comment 10

•

17 years ago

Status update: I've made good progress in getting the performance hit down on color management in various ways. The path is pretty strongly optimized in SSE2 assembly, which is represented in the above patches. The patches must be applied in order. With everything but the top patch applied, talos regressions are down to about 4% on linux and 7% on windows. I'm not sure why windows is slower. The build for that is here (https://build.mozilla.org/tryserver-builds/2008-08-07_14:03-bholley@mozilla.com-bholleyuberpatch/), and I've been meaning to take a look at the generated code but haven't had time. The last patch was somewhat troublesome. It should make things faster since it removes two memory loads, but for some reason it makes things about 10 talos points slower on both platforms. The only thing I can think of might be the re-ordering of the shuffle instructions having some impact on reordering, but I'm not sure. MMoy, can you weigh in on this? I have some prototype patches (not posted here) that suggest other ways to improve performance and hopefully get the windows numbers under 5%. For one, right now RGBs with alpha aren't fast-pathed simply because I didn't have time to deal with the special case of 32-bit pixels instead of 24-bit pixels. The numbers suggest that this may increase performance by close to 10 points. Furthermore, right now we don't do any detection to see if an embedded profile is actually just sRGB and, if it is, to use the precache data. Numbers suggest that this may increase performance as well. I hope to get full versions of those patches running early this weekend or early next week. I'd appreciate any feedback on the patches and how to make them faster. I haven't gone through and made them super-clean yet, but they should be passable.

Michael Moy

Comment 11

•

17 years ago

I was unable to apply the first patch and was wondering if you could diff your changes into one patch. I do have some comments on your fourth patch: 1) You have a code path for NON_WINDOWS and Windows (the #else path). Using SSE2 is a little more complex. SSE2 on Windows is supported on Pentium 4 processors and up and on AMD K8 processors and up. So a runtime check on the processor to determine if it supports SSE2 is needed. If runtime support isn't available, then scaler code has to be used. The C Runtime has a global that can be used to check runtime capabilities and it is __sse2_available. SSE2 support is available if this variable is set to 1. It's used by the C Runtime to do SSE2 optimization in memcpy. Note that Windows has been supported on other platforms in the past including Alpha and PowerPC. It is best to check for Win32 and x86. (I'm using __sse2_available in one of my patches and it works in the positive way - I want to check it in the negative way on a Pentium 3 that doesn't have SSE2). SSE2 on Mac OSX Intel is always available. SSE2 on Linux processors is available on Pentium 4 and AMD K8 processors and above. Processor capabilities have to be determined at runtime by sniffing capabilities. Other platforms (Alpha, SPARC, AIX, Itanium, ARM, etc.) may or may not support SSE2. It's probably safest to look for Windows on x86, Mac OSX on x86 and Linux on x86 and use scaler code for everything else. 2) In general, it is better to use intrinsics instead of inline assembler unless you absolutely need complete control of the instructions and their order and even then, you don't have complete control. The advantages to intrinsics are: A) Portability. You can use common code for Linux, Mac OSX and Windows. Win32 supports inline assembler but Windows x64 does not support inline assembler. Intrinsics are usually easier to read for those unfamiliar with inline assembler and they're easier to read than two sets of code, one for the MSVC format and one for the GCC format. B) The compiler has the flexibility to schedule instructions. This can result in optimizations that aren't available with inline assembler. In general, GCC is more flexible in compiler scheduling with inline assembler. MSVC is pretty rigid. C variables are always read from memory so when you load a C variable, the value is written from a register to memory (if it's in a register) and then read back from memory. With intrinsics, the C values in registers can be used directly. C) Type checking by the compiler. D) Maintenance costs - it's easier to maintain one code set. 3) Most SSE2 and SSE instructions can be broadly categorized as integer or floating point. It is generally best to use floating point instructions on floating point datatypes and integer instructions on integer datatypes if at all possible. So instead of using MOVDQA, use MOVAPS for single-precision floating point. Instead of PSHUFD, use SHUFPS if possible. XORPS or ANDNPS can be used instead of PXOR. 4) In the comment: Extract the floating point representation of 65535/65536 and load it into 4 slots, the comment seems to indicate that you are loading constants but the comment on the load of xmm1 is: Move the first matrix column to xmm1. xmm2 is similar. 5) To load a vector of constants, it's usually more efficient to load a vector of constants than it is to load the first vector element and then propagate the first element to the other three.

Bobby Holley (:bholley)

Assignee

Comment 12

•

17 years ago

mmoy - I'll look into your suggestions in more detail in a bit, but I wanted to quickly let you know that I'd forgotten to mention that these patches must be applied on top of the precaching patch in bug 444661. Let me know if it still gives you trouble (it might depending on when the patch for bug 449681 lands, after which point these patches will need to be rebased).

Bobby Holley (:bholley)

Assignee

Comment 13

•

17 years ago

Attached patch updated float path patch (obsolete) — Details — Splinter Review

Added an update float path patch. This isn't so much of an optimization itself as it is a platform for other optimizations. Flagging vlad for review. If all looks good, I'll check it in tomorrow morning.

Attachment #333039 - Attachment is obsolete: true

Attachment #334056 - Flags: review?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 14

•

17 years ago

Comment on attachment 334056 [details] [diff] [review] updated float path patch Looks fine; "FLOATPATH" feels a little weird to me (I would've called it FLOATMATH or something), but looks good.

Attachment #334056 - Flags: review?(vladimir) → review+

Bobby Holley (:bholley)

Assignee

Comment 15

•

17 years ago

Attached patch floating point patch renaming flag to cmsFLAGS_FLOATSHAPER — Details — Splinter Review

picked a more appropriate flag. This is a trivial change so not flagging it for re-review. I'll be landing this shortly

Attachment #334056 - Attachment is obsolete: true

Bobby Holley (:bholley)

Assignee

Comment 16

•

17 years ago

float patch pushed in 6d3fca48dd13

Bobby Holley (:bholley)

Assignee

Comment 17

•

17 years ago

Attached patch patch to set up aligned space for assembly — Details — Splinter Review

another infrastructure patch - flagging vlad for review. After this the assembly stuff should apply.

Attachment #333040 - Attachment is obsolete: true

Attachment #334126 - Flags: review?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

17 years ago

Attachment #334126 - Flags: review?(vladimir) → review+

Michael Moy

Comment 18

•

17 years ago

Alignment on 16 bytes on the stack is usually done with __declspec(align(16)) for MSVC and __attribute__ ((aligned (16))) for gcc.

Bobby Holley (:bholley)

Assignee

Comment 19

•

17 years ago

mmoy - nothing similar on the heap though right? I know it's possible with memalign, but dealing with the different allocators proved to be more of a pain than simply hacking it manually.

Bobby Holley (:bholley)

Assignee

Comment 20

•

17 years ago

also, mmoy - given the weird semantics of shufps, is it better to have two shufps instructions or one pshufd instruction?

Michael Moy

Comment 21

•

17 years ago

I think that you'd just do a shufps xmmn, xmmn to get similar semantics to pshufd. If you need it in another register, you'd just do a movdqa to a spare register and then do a shufps xmmn, xmmn to move things to where you want them. On modern intel processors, movdqa takes one cycle. The pshufd and shufps instructions take one to two cycles depending on the processor.

Bobby Holley (:bholley)

Assignee

Comment 22

•

17 years ago

is there a reason you'd suggest movdqa over movaps?

Michael Moy

Comment 23

•

17 years ago

Old age. You're right, it's movaps.

Bobby Holley (:bholley)

Assignee

Comment 24

•

17 years ago

aligned space patch pushed in fd996aff506b

Bobby Holley (:bholley)

Assignee

Comment 25

•

17 years ago

Attached patch update floating point sse2 patch (obsolete) — Details — Splinter Review

Adding an updated patch for floating point sse2 stuff. Flagging vlad for review. If you get a chance to review this in the next couple of hours vlad that would be sweet. A few comments: 1-mmoy - I agree that the portability of intrinsics makes them a desirable option. However, at this point it's an extra variable that I don't have time to deal with before a2, since there are several issues involved (compiler support, whether it's as fast as the straight assembler, and me personally learning the intrinsic syntax). I hope to investigate it after we get alpha 2 out the door with color management on. 2-there are several optimizations not in the patch. For one, there are a couple of mutually exclusive ways to reduce memory accesses (shoving the vector into the free matrix slots, shoving the constants into the free matrix slots), and also loading 4 copies of the constants at a time. I want to land this first so that it's easier to obtain perf numbers for each other different options. Additionally, this still doesn't include rgba fastpathing and srgb detection. Those are in the pipe.

Attachment #333041 - Attachment is obsolete: true

Attachment #334195 - Flags: review?(vladimir)

Bobby Holley (:bholley)

Assignee

Comment 26

•

17 years ago

Attached patch updated patch to let rgba pngs go through the sse2 code — Details — Splinter Review

added the patch to fastpath rgba. This must be applied on top of the sse2 patch. Flagging vlad for review.

Attachment #334197 - Flags: review?(vladimir)

Bobby Holley (:bholley)

Assignee

Comment 27

•

17 years ago

Attached patch small fix from previous sse2 patch (obsolete) — Details — Splinter Review

there was some unexpected windows bustage as a result of using a macro for the CPUID assembler (the assembler gets spit out as one line, but MASM wants newlines between instructions). Obsoleting the old patch and flagging vlad for review on this one.

Attachment #334195 - Attachment is obsolete: true

Attachment #334201 - Flags: review?(vladimir)

Attachment #334195 - Flags: review?(vladimir)

Bobby Holley (:bholley)

Assignee

Comment 28

•

17 years ago

Attached patch updated sse2 patch (again) — Details — Splinter Review

fixed some silliness on my part after switching from a macro to a function.

Attachment #334201 - Attachment is obsolete: true

Attachment #334203 - Flags: review?(vladimir)

Attachment #334201 - Flags: review?(vladimir)

Rob Arnold [:robarnold]

Comment 29

•

17 years ago

Attached patch msvc fix for the updated sse2 patch — Details — Splinter Review

That assembly doesn't quite do what it should for windows.

Attachment #334206 - Flags: review?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 30

•

17 years ago

Comment on attachment 334203 [details] [diff] [review] updated sse2 patch (again) Ok, with comments: +// Determine if we can build with SSE2 (this was partly copied from jmorecfg.h in +// mozilla/jpeg) // ------------------------------------------------------------------------- +#if defined(XP_WIN32) && defined(_M_IX86) && !defined(__GNUC__) No XP_WIN32 -- that's a mozilla-only #define. >+#define SSE2_EDX_MASK (1UL << 26) >+static LCMSBOOL SSE2Available() { >+ DWORD a, b, c, d; >+ DWORD function = 0x00000001; >+ >+// If we don't have compile-time support, we don't have runtime support >+#ifndef HAVE_SSE2_INTEL_MNEMONICS >+ return FALSE; >+#else >+ /* We have CPUID macros defined if we have sse2 mnemonics. */ >+ LCMSCPUID(function, &a, &b, &c, &d); >+ if (d & SSE2_EDX_MASK) >+ return TRUE; >+ return FALSE; >+#endif >+ >+} This isn't used critically in the code, but you should still cache the results in a local static var and avoid running CPUID again if you've already ran it once. >+#ifdef NON_WINDOWS #ifdef __GNUC__ please

Attachment #334203 - Flags: review?(vladimir) → review+

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

17 years ago

Attachment #334206 - Flags: review?(vladimir) → review+

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

17 years ago

Attachment #334197 - Flags: review?(vladimir) → review+

Bobby Holley (:bholley)

Assignee

Comment 31

•

17 years ago

this has all since landed in 88831bfc0a4d, fd996aff506b, and 6d3fca48dd13. Resolving as fixed.

Bobby Holley (:bholley)

Assignee

Updated

•

17 years ago

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 32

•

17 years ago

Comment on attachment 334109 [details] [diff] [review] floating point patch renaming flag to cmsFLAGS_FLOATSHAPER In lcms.h, #ifdef USE_INLINE: >+LCMS_INLINE int FromFloatDomain(FLOAT a) { return (int) (a * 65536.0f + 0.5f); } In cmsmtrx.c, #ifndef USE_INLINE: >+int FromFloatDomain(Float a) >+{ >+ return (int) (a * 65536f); >+} Shouldn't these two versions of FromFloatDomain be the same?

Bobby Holley (:bholley)

Assignee

Comment 33

•

17 years ago

(In reply to comment #32) > Shouldn't these two versions of FromFloatDomain be the same? Probably. Not sure if this is the source of the issues in bug 465088 though, since LCMS_INLINE is always defined AFAIK.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 34

•

17 years ago

(In reply to comment #33) > Probably. Not sure if this is the source of the issues in bug 465088 though, > since LCMS_INLINE is always defined AFAIK. I presumed it was unrelated; I just happened to notice while looking into that.

Bobby Holley (:bholley)

Assignee

Comment 35

•

17 years ago

filed bug 465111 for this issue

Nobody; OK to take it and work on it

Updated

•

16 years ago

Product: Core → Core Graveyard

Bobby Holley (:bholley)

Assignee

Updated

•

16 years ago

Component: GFX → GFX: Color Management

Product: Core Graveyard → Core

QA Contact: general → color-management

You need to log in before you can comment on or make changes to this bug.