Closed Bug 512865 Opened 15 years ago Closed 15 years ago

QCMS: improve SSE2 performance, add SSE support

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: swsnyder, Assigned: swsnyder)

References

Details

(Keywords: perf)

Attachments

(3 files, 5 obsolete files)

Improves SSE/SSE2 transforms - trunk 15 years ago Steve Snyder 39.81 KB, patch		Details \| Diff \| Splinter Review
Nearly the same as above, but applies to 1.9.1 branch 15 years ago Steve Snyder 36.76 KB, patch		Details \| Diff \| Splinter Review
split improved sse/sse2 transforms out into separate files 15 years ago Jeff Muizelaar [:jrmuizel] 40.77 KB, patch		Details \| Diff \| Splinter Review
change to cpu detection to match current style 15 years ago Jeff Muizelaar [:jrmuizel] 40.39 KB, patch	jrmuizel : review-	Details \| Diff \| Splinter Review
Version with nits picked 15 years ago Jeff Muizelaar [:jrmuizel] 40.23 KB, patch	swsnyder : review+	Details \| Diff \| Splinter Review
Stack for Linux crash 15 years ago Nick Thomas [:nthomas] (UTC+12) 27.51 KB, text/plain		Details
Add __force_align_arg_pointer__ to align the stack 15 years ago Jeff Muizelaar [:jrmuizel] 42.06 KB, patch		Details \| Diff \| Splinter Review
Fixed crosscompilation on Linux using mingw. 15 years ago Jacek Caban 575 bytes, patch		Details \| Diff \| Splinter Review

Steve Snyder

Assignee

Description

•

15 years ago

Attached patch Improves SSE/SSE2 transforms - trunk (obsolete) — Details — Splinter Review

This patch greatly improves the performance of QCMS transformations on x86 & x86_64 systems. Some notes: 0. On 32-bit x86 systems it does runtime selection between non-SIMD, SSE, and SSE2 code paths. 1. On x86_64 systems the SSE2 code path is always taken. The non-SIMD and SSE code paths are left intact, but contemporary versions of the GCC and MSVC compilers will see that they cannot be reached and optimize them away. 2. The execution of the SSE2 code path is reduced by 67%, relative to the original Intel/Microsoft formatted ASM code. The relative performance is seen on a Pentium4 (Northwood) 2.4GHz CPU with DDR1 RAM. 3. The SSE code path provides a 80% reduction in execution time, relative to the non-SIMD code path. The relative performance is seen on a Pentium3 (Coppermine) 1.26GHz CPU with SDRAM. 4. The patch includes a GCC-specific modification to the Makefile. This is to enable the generation of SSE/SSE2 instructions on systems where those instructions are not supported natively. At optimization levels below -O3 GCC does not initiate the generation of SIMD instructions itself, leaving the SIMD code confined to the SIMD code paths. MSVC is much more accomodating about generating code that will not run on the build machine. 5. All code is via intrinsics common to the GCC, MSVC and Intel compilers. Versions tested GCC=4.3.2, MSVC=2005/SP1, Intel=10.1. The Mercurial-generated patch isn't pretty to look at, as it interleaves the added and subtracted lines. Let me know if you want a much more readable plain-old-fashioned diff patch.

Attachment #396934 - Flags: review?(jmuizelaar)

Steve Snyder

Assignee

Comment 1

•

15 years ago

Attached patch Nearly the same as above, but applies to 1.9.1 branch — Details — Splinter Review

Applies SIMD performance improvements to the 1.9.1 branch. Patched against Firefox 3.5.3 /Seamonkey 2.0b1 source.

Assignee: nobody → swsnyder

Attachment #396937 - Flags: review?(jmuizelaar)

Makoto Kato [:m_kato]

Comment 2

•

15 years ago

Steve, Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is generated for all QCMS source code. So, the generated binary won't work on non-SSE2 CPU such as Pentium 3. So, you should add a condition for MacOS X intel and Unix x86_64 in Makefile to add -msse2 to CFLAGS.

Steve Snyder

Assignee

Comment 3

•

15 years ago

(In reply to comment #2) > Steve, > > Although you add -msse2 to CFALGS, I think that this change causes SSE2 code is > generated for all QCMS source code. So, the generated binary won't work on > non-SSE2 CPU such as Pentium 3. I concede that this might be problem for future versions of of GCC, but for versions through v4.3.2 it is not. The automatic generation of vector instructions requires optimization level -O3 (actually, it requires -ftree-vectorize which is one of the few differences between -O2 and -O3). Now that I think of it, adding -fno-tree-vectorize to the make file rather that -O3 would have been cleaner. Still, my point is that the -O2 precludes the automatic generation of SSE2 instructions where they are not explicitly asked for via compiler intrinsics. I can't test any Mac-related modifications.

Steve Snyder

Assignee

Comment 4

•

15 years ago

> to the make file rather that -O3 edit: "to the make file rather than -O2"