Closed Bug 512865 Opened 15 years ago Closed 15 years ago

QCMS: improve SSE2 performance, add SSE support

Categories

(Core :: Graphics: Color Management, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: swsnyder, Assigned: swsnyder)

References

Details

(Keywords: perf)

Attachments

(3 files, 5 obsolete files)

This patch greatly improves the performance of QCMS transformations on x86 & x86_64 systems.  Some notes:

0. On 32-bit x86 systems it does runtime selection between non-SIMD, SSE, and SSE2 code paths.

1. On x86_64 systems the SSE2 code path is always taken.  The non-SIMD and SSE code paths are left intact, but contemporary versions of the GCC and MSVC compilers will see that they cannot be reached and optimize them away.

2. The execution of the SSE2 code path is reduced by 67%, relative to the original Intel/Microsoft formatted ASM code.  The relative performance is seen on a Pentium4 (Northwood) 2.4GHz CPU with DDR1 RAM.

3. The SSE code path provides a 80% reduction in execution time, relative to the non-SIMD code path.  The relative performance is seen on a Pentium3 (Coppermine) 1.26GHz CPU with SDRAM.

4. The patch includes a GCC-specific modification to the Makefile.  This is to enable the generation of SSE/SSE2 instructions on systems where those instructions are not supported natively.  At optimization levels below -O3 GCC does not initiate the generation of SIMD instructions itself, leaving the SIMD code confined to the SIMD code paths.  MSVC is much more accomodating about generating code that will not run on the build machine.

5. All code is via intrinsics common to the GCC, MSVC and Intel compilers.  Versions tested GCC=4.3.2, MSVC=2005/SP1, Intel=10.1.

The Mercurial-generated patch isn't pretty to look at, as it interleaves the added and subtracted lines. Let me know if you want a much more readable plain-old-fashioned diff patch.
Attachment #396934 - Flags: review?(jmuizelaar)
Applies SIMD performance improvements to the 1.9.1 branch.  Patched against Firefox 3.5.3 /Seamonkey 2.0b1 source.
Assignee: nobody → swsnyder
Attachment #396937 - Flags: review?(jmuizelaar)
Steve,

Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is generated for all QCMS source code.  So, the generated binary won't work on non-SSE2 CPU such as Pentium 3.

So, you should add a condition for MacOS X intel and Unix x86_64 in Makefile to add -msse2 to CFLAGS.
(In reply to comment #2)
> Steve,
> 
> Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is
> generated for all QCMS source code.  So, the generated binary won't work on
> non-SSE2 CPU such as Pentium 3.

I concede that this might be  problem for future versions of of GCC, but for versions through v4.3.2 it is not.  The automatic generation of vector instructions requires optimization level -O3 (actually, it requires -ftree-vectorize which is one of the few differences between -O2 and -O3).  Now that I think of it, adding -fno-tree-vectorize to the make file rather that -O3 would have been cleaner.  Still, my point is that the -O2 precludes the automatic generation of SSE2 instructions where they are not explicitly asked for via compiler intrinsics.

I can't test any Mac-related modifications.
> to the make file rather that -O3

edit: "to the make file rather than -O2"
(In reply to comment #0)
> Created an attachment (id=396934) [details]
> Improves SSE/SSE2 transforms - trunk
> 
> This patch greatly improves the performance of QCMS transformations on x86 &
> x86_64 systems.  Some notes:
> 
> 0. On 32-bit x86 systems it does runtime selection between non-SIMD, SSE, and
> SSE2 code paths.
> 
> 1. On x86_64 systems the SSE2 code path is always taken.  The non-SIMD and SSE
> code paths are left intact, but contemporary versions of the GCC and MSVC
> compilers will see that they cannot be reached and optimize them away.
> 
> 2. The execution of the SSE2 code path is reduced by 67%, relative to the
> original Intel/Microsoft formatted ASM code.  The relative performance is seen
> on a Pentium4 (Northwood) 2.4GHz CPU with DDR1 RAM.
> 

Why is it faster? i.e. where do the performance improvements come from?

I'm worried about the negative performance impact of using intrinsics instead of gcc/msvc inline asm. gcc's code generation seems to vary from good to really poor when compiling intrinsics.
> Why is it faster? i.e. where do the performance improvements come from?

1. For SSE systems (Pentium3) it is going from non-SIMD to SIMD.  Even with MSVC's less-than-steller MMX code gen it is still a no-brainer to go the SIMD route.

2. Generally, there is much less work done within the loops.  No shuffling of max and scaling values on each pixel; reduced pointer dereferencing.

3. XMM registers can retain values across individual pixels. (The original code treats every pixel as a unique, stand-alone item.  Any set-up must be done over and over again.)

4. The compiler can optimize the entire function, taking all register usage into account.  This as opposed to having to skip over the ASM sections when considering register allocation.

I strongly recommend a quick benchmarking session if you have doubts.  The relative performance improvement is not ambiguous.
I've split the transforms out into separate files to fix the -msse/-msse2 problems.

I've also tested this on OS X and get a great speed up there too.
Attachment #401959 - Flags: review?(jmuizelaar)
Rewrites the the cpu detection to be more consistent with the previous style. Most importantly it uses sse_version_available instead simd_available which makes more sense for a function returning a number.
Attachment #401959 - Attachment is obsolete: true
Attachment #402138 - Flags: review?(jmuizelaar)
Attachment #401959 - Flags: review?(jmuizelaar)
Comment on attachment 402138 [details] [diff] [review]
change to cpu detection to match current style

>+void qcms_transform_data_rgb_out_lut_sse1(qcms_transform *transform,
>+                                          unsigned char *src,
>+                                          unsigned char *dest,
>+                                          size_t length)
>+{

[snip]

>+
>+    /* deref *transform now to avoid it in loop */
>+    const uint32_t *igtbl_r = (uint32_t*)transform->input_gamma_table_r;
>+    const uint32_t *igtbl_g = (uint32_t*)transform->input_gamma_table_g;
>+    const uint32_t *igtbl_b = (uint32_t*)transform->input_gamma_table_b;

Why is the type igtlb_[rgb] uint32_t * instead of float *?

>+
>+    /* deref *transform now to avoid it in loop */
>+    const uint8_t *otdata_r = &transform->output_table_r->data[0];
>+    const uint8_t *otdata_g = &transform->output_table_g->data[0];
>+    const uint8_t *otdata_b = &transform->output_table_b->data[0];
>+

[snip]

>+
>+    /* CYA */
>+    if (!length)
>+        return;
>+
>+    /* one pixel is handled outside of the loop */
>+    length--;
>+
>+    /* setup for transforming 1st pixel */
>+    vec_r = _mm_load_ss( (float *)&igtbl_r[src[0]] );
>+    vec_g = _mm_load_ss( (float *)&igtbl_g[src[1]] );
>+    vec_b = _mm_load_ss( (float *)&igtbl_b[src[2]] );

These function call shouldn't have the extra spaces around the arguments.


>+    src += 3;
>+
>+    /* transform all but final pixel */
>+
>+    for (i=0; i<length; i++)
>+    {
>+        /* position values from gamma tables */
>+        vec_r = _mm_shuffle_ps(vec_r, vec_r, 0);

[snip]

>+
>+        /* load for next loop while store completes */
>+        vec_r = _mm_load_ss( (float *)&igtbl_r[src[0]] );
>+        vec_g = _mm_load_ss( (float *)&igtbl_g[src[1]] );
>+        vec_b = _mm_load_ss( (float *)&igtbl_b[src[2]] );
>+        src += 3;

Same here and all other places like this.
Attachment #402138 - Flags: review?(jmuizelaar) → review-
> Why is the type igtlb_[rgb] uint32_t * instead of float *?

I don't have a justification for this, probably a relic of (my) prior versions of the code.  Certainly, it should be float, which also allows for the removing of the "(float *)" casts when the values are actually used.

> These function call shouldn't have the extra spaces around the arguments.

OK.  Seemed more readable to me, but I don't have a commitment to any particular style.

Should I resubmit the patches with these nits picked?  If so, where does that leave the patches you've previously posted?
Attached patch Version with nits picked (obsolete) — Splinter Review
(In reply to comment #10)
> Should I resubmit the patches with these nits picked?  If so, where does that
> leave the patches you've previously posted?

There were some build problems with last patch I posted, so I just picked the nits while I was at it. Does this patch look good to you?
Attachment #396934 - Attachment is obsolete: true
Attachment #402138 - Attachment is obsolete: true
Attachment #402370 - Flags: review?
Attachment #396934 - Flags: review?(jmuizelaar)
Attachment #402370 - Flags: review? → review?(swsnyder)
(In reply to comment #11)
> Created an attachment (id=402370) [details]
> Version with nits picked
> 
> (In reply to comment #10)
> > Should I resubmit the patches with these nits picked?  If so, where does that
> > leave the patches you've previously posted?
> 
> There were some build problems with last patch I posted, so I just picked the
> nits while I was at it. Does this patch look good to you?

Yes, it does look good.  I've never formally approved a review request, and don't know what the procedure is.  Informally, though, I do approve of it.
(In reply to comment #12)
> (In reply to comment #11)
> > Created an attachment (id=402370) [details] [details]
> > Version with nits picked
> > 
> > (In reply to comment #10)
> > > Should I resubmit the patches with these nits picked?  If so, where does that
> > > leave the patches you've previously posted?
> > 
> > There were some build problems with last patch I posted, so I just picked the
> > nits while I was at it. Does this patch look good to you?
> 
> Yes, it does look good.  I've never formally approved a review request, and
> don't know what the procedure is.  Informally, though, I do approve of it.

Just change the '?' to a '+' on the details page of the patch. I don't think there is much formal procedure beyond that. The informal approval is probably sufficient anyways.
Attachment #402370 - Flags: review?(swsnyder) → review+
http://hg.mozilla.org/mozilla-central/rev/cb4f078cc8cb
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
and thanks for the patch Steve.
Flags: wanted1.9.2?
Linux debug builds are failing to complete -CreateProfile. Requires bug 518634 to get symbols in a stack.
Depends on: 518634
Looks like a crash on exit, fwiw:

python leaktest.py -- -CreateProfile default
...
nsStringStats
 => mAllocCount:          21661
 => mReallocCount:         3839
 => mFreeCount:           14382  --  LEAKED 7279 !!!
 => mShareCount:          18663
 => mAdoptCount:           2111
 => mAdoptFreeCount:       2108  --  LEAKED 3 !!!
TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:00:11.722302
TEST-UNEXPECTED-FAIL | automation.py | application crashed (minidump found)
program finished with exit code 255

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1253821487.1253825353.14401.gz&fulltext=1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Anyone have any idea why we would be crashing on the leak test boxes and not the unit test boxes?
Leak test boxes are debug builds, FWIW.
I can't seem to reproduce this locally...
It just occurred to me that this could likely be caused by unaligned stack loads. I'll investigate more when I get the chance.
This might fix the problem, I've pushed it to see.

http://hg.mozilla.org/mozilla-central/rev/8a43f01f1d64
Attachment #402370 - Attachment is obsolete: true
Looks like it did.
So this is fixed, right?
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
This patch has broken cross compilation for me. To test target CPU we should check TARGET_CPU instead of OS_TEST.
Attachment #405492 - Flags: review?(swsnyder)
Having the perf improvement on SSE systems is obviously great, but does this code still work on older non-SSE systems?

(I'm not sure which branches still support non-SSE, or even which branches these changes will be applied to... hence the question.)
(In reply to comment #28)
> Having the perf improvement on SSE systems is obviously great, but does this
> code still work on older non-SSE systems?
> 
> (I'm not sure which branches still support non-SSE, or even which branches
> these changes will be applied to... hence the question.)

There is a plain C version that will work on all platforms and CPUs
(In reply to comment #29)
> (In reply to comment #28)
> > Having the perf improvement on SSE systems is obviously great, but does this
> > code still work on older non-SSE systems?
> > 
> > (I'm not sure which branches still support non-SSE, or even which branches
> > these changes will be applied to... hence the question.)
> 
> There is a plain C version that will work on all platforms and CPUs

Great! Thanks for the info, Jeff.
(In reply to comment #27)
> Created an attachment (id=405492) [details]
> Fixed crosscompilation on Linux using mingw.
> 
> This patch has broken cross compilation for me. To test target CPU we should
> check TARGET_CPU instead of OS_TEST.

What are the differing values of these macros in mingw?

I get the same value in x86 Windows (VS2005) of "i686" for both macros.

In x86_64 Linux (GCC v4.3.1) I get "x86_64" for both macros.

So... what does TARGET_CPU convey that OS_TEST does not?
Depends on: 521549
I'm crosscompiling on Linux 64-bit to Windows 32-bit using mingw. My TARGET_CPU is "i486", but OS_TEST is "AMD Phenom(tm) 9550 Quad-Core Processor", just as `uname -p` says.
(In reply to comment #32)
> I'm crosscompiling on Linux 64-bit to Windows 32-bit using mingw. My TARGET_CPU
> is "i486", but OS_TEST is "AMD Phenom(tm) 9550 Quad-Core Processor", just as
> `uname -p` says.

Jacek, you should file new bug to fix configure.in.  The build config expects that OS_TEST is i?86 when x86 target.
Depends on: 522440
Follow-up patch posted in Bug 523584.
Attachment #396937 - Flags: review?(jmuizelaar)
(In reply to comment #33)
> Jacek, you should file new bug to fix configure.in.  The build config expects
> that OS_TEST is i?86 when x86 target.

I've filled bug 526302. Thanks.
(In reply to comment #24)
> Created an attachment (id=404079) [details]
> Add __force_align_arg_pointer__ to align the stack
> 
> This might fix the problem, I've pushed it to see.
> 
> http://hg.mozilla.org/mozilla-central/rev/8a43f01f1d64

FYI, __force_align_arg_pointer__ is not valid for GCC versions prior to v4.2.
> FYI, __force_align_arg_pointer__ is not valid for GCC versions prior to v4.2.

I filed bug 528131.  This check is needed to support SSE2 on pixman. (Bug 488851)
Attachment #405492 - Attachment is obsolete: true
Attachment #405492 - Flags: review?(swsnyder)
Flags: wanted1.9.2?
You need to log in before you can comment on or make changes to this bug.