Note: There are a few cases of duplicates in user autocompletion which are being worked on.

QCMS: improve SSE2 performance, add SSE support

RESOLVED FIXED

Status

()

Core
GFX: Color Management
RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: Steve Snyder, Assigned: Steve Snyder)

Tracking

({perf})

Trunk
x86
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 5 obsolete attachments)

(Assignee)

Description

8 years ago
Created attachment 396934 [details] [diff] [review]
Improves SSE/SSE2 transforms - trunk

This patch greatly improves the performance of QCMS transformations on x86 & x86_64 systems.  Some notes:

0. On 32-bit x86 systems it does runtime selection between non-SIMD, SSE, and SSE2 code paths.

1. On x86_64 systems the SSE2 code path is always taken.  The non-SIMD and SSE code paths are left intact, but contemporary versions of the GCC and MSVC compilers will see that they cannot be reached and optimize them away.

2. The execution of the SSE2 code path is reduced by 67%, relative to the original Intel/Microsoft formatted ASM code.  The relative performance is seen on a Pentium4 (Northwood) 2.4GHz CPU with DDR1 RAM.

3. The SSE code path provides a 80% reduction in execution time, relative to the non-SIMD code path.  The relative performance is seen on a Pentium3 (Coppermine) 1.26GHz CPU with SDRAM.

4. The patch includes a GCC-specific modification to the Makefile.  This is to enable the generation of SSE/SSE2 instructions on systems where those instructions are not supported natively.  At optimization levels below -O3 GCC does not initiate the generation of SIMD instructions itself, leaving the SIMD code confined to the SIMD code paths.  MSVC is much more accomodating about generating code that will not run on the build machine.

5. All code is via intrinsics common to the GCC, MSVC and Intel compilers.  Versions tested GCC=4.3.2, MSVC=2005/SP1, Intel=10.1.

The Mercurial-generated patch isn't pretty to look at, as it interleaves the added and subtracted lines. Let me know if you want a much more readable plain-old-fashioned diff patch.
Attachment #396934 - Flags: review?(jmuizelaar)
(Assignee)

Comment 1

8 years ago
Created attachment 396937 [details] [diff] [review]
Nearly the same as above, but applies to 1.9.1 branch

Applies SIMD performance improvements to the 1.9.1 branch.  Patched against Firefox 3.5.3 /Seamonkey 2.0b1 source.
Assignee: nobody → swsnyder
Attachment #396937 - Flags: review?(jmuizelaar)
Steve,

Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is generated for all QCMS source code.  So, the generated binary won't work on non-SSE2 CPU such as Pentium 3.

So, you should add a condition for MacOS X intel and Unix x86_64 in Makefile to add -msse2 to CFLAGS.
(Assignee)

Comment 3

8 years ago
(In reply to comment #2)
> Steve,
> 
> Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is
> generated for all QCMS source code.  So, the generated binary won't work on
> non-SSE2 CPU such as Pentium 3.

I concede that this might be  problem for future versions of of GCC, but for versions through v4.3.2 it is not.  The automatic generation of vector instructions requires optimization level -O3 (actually, it requires -ftree-vectorize which is one of the few differences between -O2 and -O3).  Now that I think of it, adding -fno-tree-vectorize to the make file rather that -O3 would have been cleaner.  Still, my point is that the -O2 precludes the automatic generation of SSE2 instructions where they are not explicitly asked for via compiler intrinsics.

I can't test any Mac-related modifications.
(Assignee)

Comment 4

8 years ago
> to the make file rather that -O3

edit: "to the make file rather than -O2"
(In reply to comment #0)
> Created an attachment (id=396934) [details]
> Improves SSE/SSE2 transforms - trunk
> 
> This patch greatly improves the performance of QCMS transformations on x86 &
> x86_64 systems.  Some notes:
> 
> 0. On 32-bit x86 systems it does runtime selection between non-SIMD, SSE, and
> SSE2 code paths.
> 
> 1. On x86_64 systems the SSE2 code path is always taken.  The non-SIMD and SSE
> code paths are left intact, but contemporary versions of the GCC and MSVC
> compilers will see that they cannot be reached and optimize them away.
> 
> 2. The execution of the SSE2 code path is reduced by 67%, relative to the
> original Intel/Microsoft formatted ASM code.  The relative performance is seen
> on a Pentium4 (Northwood) 2.4GHz CPU with DDR1 RAM.
> 

Why is it faster? i.e. where do the performance improvements come from?

I'm worried about the negative performance impact of using intrinsics instead of gcc/msvc inline asm. gcc's code generation seems to vary from good to really poor when compiling intrinsics.
(Assignee)

Comment 6

8 years ago
> Why is it faster? i.e. where do the performance improvements come from?

1. For SSE systems (Pentium3) it is going from non-SIMD to SIMD.  Even with MSVC's less-than-steller MMX code gen it is still a no-brainer to go the SIMD route.

2. Generally, there is much less work done within the loops.  No shuffling of max and scaling values on each pixel; reduced pointer dereferencing.

3. XMM registers can retain values across individual pixels. (The original code treats every pixel as a unique, stand-alone item.  Any set-up must be done over and over again.)

4. The compiler can optimize the entire function, taking all register usage into account.  This as opposed to having to skip over the ASM sections when considering register allocation.

I strongly recommend a quick benchmarking session if you have doubts.  The relative performance improvement is not ambiguous.
Created attachment 401959 [details] [diff] [review]
split improved sse/sse2 transforms out into separate files

I've split the transforms out into separate files to fix the -msse/-msse2 problems.

I've also tested this on OS X and get a great speed up there too.
Attachment #401959 - Flags: review?(jmuizelaar)
Created attachment 402138 [details] [diff] [review]
change to cpu detection to match current style

Rewrites the the cpu detection to be more consistent with the previous style. Most importantly it uses sse_version_available instead simd_available which makes more sense for a function returning a number.
Attachment #401959 - Attachment is obsolete: true
Attachment #402138 - Flags: review?(jmuizelaar)
Attachment #401959 - Flags: review?(jmuizelaar)
Comment on attachment 402138 [details] [diff] [review]
change to cpu detection to match current style

>+void qcms_transform_data_rgb_out_lut_sse1(qcms_transform *transform,
>+                                          unsigned char *src,
>+                                          unsigned char *dest,
>+                                          size_t length)
>+{

[snip]

>+
>+    /* deref *transform now to avoid it in loop */
>+    const uint32_t *igtbl_r = (uint32_t*)transform->input_gamma_table_r;
>+    const uint32_t *igtbl_g = (uint32_t*)transform->input_gamma_table_g;
>+    const uint32_t *igtbl_b = (uint32_t*)transform->input_gamma_table_b;

Why is the type igtlb_[rgb] uint32_t * instead of float *?

>+
>+    /* deref *transform now to avoid it in loop */
>+    const uint8_t *otdata_r = &transform->output_table_r->data[0];
>+    const uint8_t *otdata_g = &transform->output_table_g->data[0];
>+    const uint8_t *otdata_b = &transform->output_table_b->data[0];
>+

[snip]

>+
>+    /* CYA */
>+    if (!length)
>+        return;
>+
>+    /* one pixel is handled outside of the loop */
>+    length--;
>+
>+    /* setup for transforming 1st pixel */
>+    vec_r = _mm_load_ss( (float *)&igtbl_r[src[0]] );
>+    vec_g = _mm_load_ss( (float *)&igtbl_g[src[1]] );
>+    vec_b = _mm_load_ss( (float *)&igtbl_b[src[2]] );

These function call shouldn't have the extra spaces around the arguments.


>+    src += 3;
>+
>+    /* transform all but final pixel */
>+
>+    for (i=0; i<length; i++)
>+    {
>+        /* position values from gamma tables */
>+        vec_r = _mm_shuffle_ps(vec_r, vec_r, 0);

[snip]

>+
>+        /* load for next loop while store completes */
>+        vec_r = _mm_load_ss( (float *)&igtbl_r[src[0]] );
>+        vec_g = _mm_load_ss( (float *)&igtbl_g[src[1]] );
>+        vec_b = _mm_load_ss( (float *)&igtbl_b[src[2]] );
>+        src += 3;

Same here and all other places like this.
Attachment #402138 - Flags: review?(jmuizelaar) → review-
(Assignee)

Comment 10

8 years ago
> Why is the type igtlb_[rgb] uint32_t * instead of float *?

I don't have a justification for this, probably a relic of (my) prior versions of the code.  Certainly, it should be float, which also allows for the removing of the "(float *)" casts when the values are actually used.

> These function call shouldn't have the extra spaces around the arguments.

OK.  Seemed more readable to me, but I don't have a commitment to any particular style.

Should I resubmit the patches with these nits picked?  If so, where does that leave the patches you've previously posted?
Created attachment 402370 [details] [diff] [review]
Version with nits picked

(In reply to comment #10)
> Should I resubmit the patches with these nits picked?  If so, where does that
> leave the patches you've previously posted?

There were some build problems with last patch I posted, so I just picked the nits while I was at it. Does this patch look good to you?
Attachment #396934 - Attachment is obsolete: true
Attachment #402138 - Attachment is obsolete: true
Attachment #402370 - Flags: review?
Attachment #396934 - Flags: review?(jmuizelaar)
Attachment #402370 - Flags: review? → review?(swsnyder)
(Assignee)

Comment 12

8 years ago
(In reply to comment #11)
> Created an attachment (id=402370) [details]
> Version with nits picked
> 
> (In reply to comment #10)
> > Should I resubmit the patches with these nits picked?  If so, where does that
> > leave the patches you've previously posted?
> 
> There were some build problems with last patch I posted, so I just picked the
> nits while I was at it. Does this patch look good to you?

Yes, it does look good.  I've never formally approved a review request, and don't know what the procedure is.  Informally, though, I do approve of it.
(In reply to comment #12)
> (In reply to comment #11)
> > Created an attachment (id=402370) [details] [details]
> > Version with nits picked
> > 
> > (In reply to comment #10)
> > > Should I resubmit the patches with these nits picked?  If so, where does that
> > > leave the patches you've previously posted?
> > 
> > There were some build problems with last patch I posted, so I just picked the
> > nits while I was at it. Does this patch look good to you?
> 
> Yes, it does look good.  I've never formally approved a review request, and
> don't know what the procedure is.  Informally, though, I do approve of it.

Just change the '?' to a '+' on the details page of the patch. I don't think there is much formal procedure beyond that. The informal approval is probably sufficient anyways.
(Assignee)

Updated

8 years ago
Attachment #402370 - Flags: review?(swsnyder) → review+
http://hg.mozilla.org/mozilla-central/rev/cb4f078cc8cb
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
and thanks for the patch Steve.
Flags: wanted1.9.2?
Linux debug builds are failing to complete -CreateProfile. Requires bug 518634 to get symbols in a stack.
Depends on: 518634
Looks like a crash on exit, fwiw:

python leaktest.py -- -CreateProfile default
...
nsStringStats
 => mAllocCount:          21661
 => mReallocCount:         3839
 => mFreeCount:           14382  --  LEAKED 7279 !!!
 => mShareCount:          18663
 => mAdoptCount:           2111
 => mAdoptFreeCount:       2108  --  LEAKED 3 !!!
TEST-UNEXPECTED-FAIL | automation.py | Exited with code 1 during test run
INFO | automation.py | Application ran for: 0:00:11.722302
TEST-UNEXPECTED-FAIL | automation.py | application crashed (minidump found)
program finished with exit code 255

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1253821487.1253825353.14401.gz&fulltext=1
Created attachment 402718 [details]
Stack for Linux crash

Updated

8 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've backed this out now to investigate:
http://hg.mozilla.org/mozilla-central/rev/42ccb2cdeac0
Anyone have any idea why we would be crashing on the leak test boxes and not the unit test boxes?
Leak test boxes are debug builds, FWIW.
I can't seem to reproduce this locally...
It just occurred to me that this could likely be caused by unaligned stack loads. I'll investigate more when I get the chance.
Created attachment 404079 [details] [diff] [review]
Add __force_align_arg_pointer__ to align the stack

This might fix the problem, I've pushed it to see.

http://hg.mozilla.org/mozilla-central/rev/8a43f01f1d64
Attachment #402370 - Attachment is obsolete: true
Looks like it did.
So this is fixed, right?
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED

Comment 27

8 years ago
Created attachment 405492 [details] [diff] [review]
Fixed crosscompilation on Linux using mingw.

This patch has broken cross compilation for me. To test target CPU we should check TARGET_CPU instead of OS_TEST.

Updated

8 years ago
Attachment #405492 - Flags: review?(swsnyder)
Having the perf improvement on SSE systems is obviously great, but does this code still work on older non-SSE systems?

(I'm not sure which branches still support non-SSE, or even which branches these changes will be applied to... hence the question.)
(In reply to comment #28)
> Having the perf improvement on SSE systems is obviously great, but does this
> code still work on older non-SSE systems?
> 
> (I'm not sure which branches still support non-SSE, or even which branches
> these changes will be applied to... hence the question.)

There is a plain C version that will work on all platforms and CPUs
(In reply to comment #29)
> (In reply to comment #28)
> > Having the perf improvement on SSE systems is obviously great, but does this
> > code still work on older non-SSE systems?
> > 
> > (I'm not sure which branches still support non-SSE, or even which branches
> > these changes will be applied to... hence the question.)
> 
> There is a plain C version that will work on all platforms and CPUs

Great! Thanks for the info, Jeff.
(Assignee)

Comment 31

8 years ago
(In reply to comment #27)
> Created an attachment (id=405492) [details]
> Fixed crosscompilation on Linux using mingw.
> 
> This patch has broken cross compilation for me. To test target CPU we should
> check TARGET_CPU instead of OS_TEST.

What are the differing values of these macros in mingw?

I get the same value in x86 Windows (VS2005) of "i686" for both macros.

In x86_64 Linux (GCC v4.3.1) I get "x86_64" for both macros.

So... what does TARGET_CPU convey that OS_TEST does not?

Updated

8 years ago
Depends on: 521549

Comment 32

8 years ago
I'm crosscompiling on Linux 64-bit to Windows 32-bit using mingw. My TARGET_CPU is "i486", but OS_TEST is "AMD Phenom(tm) 9550 Quad-Core Processor", just as `uname -p` says.
(In reply to comment #32)
> I'm crosscompiling on Linux 64-bit to Windows 32-bit using mingw. My TARGET_CPU
> is "i486", but OS_TEST is "AMD Phenom(tm) 9550 Quad-Core Processor", just as
> `uname -p` says.

Jacek, you should file new bug to fix configure.in.  The build config expects that OS_TEST is i?86 when x86 target.

Updated

8 years ago
Depends on: 522440
(Assignee)

Comment 34

8 years ago
Follow-up patch posted in Bug 523584.
Attachment #396937 - Flags: review?(jmuizelaar)

Comment 35

8 years ago
(In reply to comment #33)
> Jacek, you should file new bug to fix configure.in.  The build config expects
> that OS_TEST is i?86 when x86 target.

I've filled bug 526302. Thanks.
(Assignee)

Comment 36

8 years ago
(In reply to comment #24)
> Created an attachment (id=404079) [details]
> Add __force_align_arg_pointer__ to align the stack
> 
> This might fix the problem, I've pushed it to see.
> 
> http://hg.mozilla.org/mozilla-central/rev/8a43f01f1d64

FYI, __force_align_arg_pointer__ is not valid for GCC versions prior to v4.2.
> FYI, __force_align_arg_pointer__ is not valid for GCC versions prior to v4.2.

I filed bug 528131.  This check is needed to support SSE2 on pixman. (Bug 488851)

Updated

8 years ago
Attachment #405492 - Attachment is obsolete: true
Attachment #405492 - Flags: review?(swsnyder)
Flags: wanted1.9.2?
You need to log in before you can comment on or make changes to this bug.