Open Bug 523584 Opened 15 years ago Updated 2 years ago

QCMS: further reduction in execution time

Tracking

()

Status:

NEW

People

(Reporter: swsnyder, Unassigned)

Details

Attachments

(4 files, 6 obsolete files)

Transform 2 pixels per loop 15 years ago Steve Snyder 32.57 KB, patch		Details \| Diff \| Splinter Review
Peripheral changes in prep for following patch 15 years ago Steve Snyder 13.31 KB, patch	jrmuizel : review+	Details \| Diff \| Splinter Review
Transform 2 pixels per loop 15 years ago Steve Snyder 20.81 KB, patch	jrmuizel : review+	Details \| Diff \| Splinter Review
Remove the 3 shuffle instructions from processing 14 years ago Steve Snyder 24.73 KB, patch		Details \| Diff \| Splinter Review
1/3: preparation for subsequent patches 14 years ago Steve Snyder 17.49 KB, patch	jrmuizel : review+	Details \| Diff \| Splinter Review
2/3: transform 2 pixels per loop 14 years ago Steve Snyder 25.16 KB, patch	jrmuizel : review+	Details \| Diff \| Splinter Review
3/3: remove need for shuffle instructions 14 years ago Steve Snyder 38.18 KB, patch		Details \| Diff \| Splinter Review
3/3: remove need for shuffle instructions (bug fix) 14 years ago Steve Snyder 38.54 KB, patch	swsnyder : review? jrmuizel	Details \| Diff \| Splinter Review
3/3: Update no-shuffle-instructions patch for Gecko 8+ 13 years ago Steve Snyder 43.03 KB, patch		Details \| Diff \| Splinter Review
Update no-shuffle-instructions patch for Gecko 8+ 13 years ago Steve Snyder 43.25 KB, patch	swsnyder : review? jrmuizel	Details \| Diff \| Splinter Review

Steve Snyder

Reporter

Description

•

15 years ago

Attached patch Transform 2 pixels per loop (obsolete) — Details — Splinter Review

[This is a follow-up patch to bug 512865.]

The attached patch reduces execution time on the SIMD code paths.  It does this by transforming 2 pixels per loop (as opposed to the single pixel/loop in the trunk code).  This allows for pointers in registers to be used longer before being forced out for other needs.

Execution times seen, relative to trunk code:

0.961 = Win32 (VS2005/SP1), Pentium3 (Coppermine), SDRAM
0.773 = Win32 (VS2005/SP1), Pentium4 (Northwood), DDR1
0.806 = Linux x86 (GCC v4.3.1), Pentium4 (Northwood), DDR1
0.957 = Linux x86_64 (GCC v4.3.1), Core2 (Penryn), DDR2

Not much benefit seen on Pentium3 (due to smaller trace cache?).  Also not much benefit on x86_64, where values could already be kept long-term in the larger register set.

However, a ~20% improvement is seen in the Pentium4 test cases.  I expect that this improvement will hold true for all post-Pentium4 processors running in a 32-bit environment.

Also, a few peripheral changes, not strictly related to the 2pixel/loop enhancement:

0. Replaced the "input" and "output" pointers with the single "indices" pointer, the better to communicate its actual purpose.

1. Added a few "const" declarations.  This is to better convey to the compiler that the relevant input values will not be changing during processing.

2. For RGBA functions, removed unneeded imtermediate storage of alpha byte (changed src-->temp-->dest to src-->dest).

Attachment #407533 - Flags: review?(jmuizelaar)

Steve Snyder

Reporter

Updated

•

15 years ago

Assignee: nobody → swsnyder

Steve Snyder

Reporter

Comment 1

•

15 years ago

Jeff: One more thing, a housekeeping note.

There is now so much duplicate code within and between the transform-sse*.c files that I wonder if maybe putting the code into macros would be justified.  The bulk of the SIMD code could be moved into a header file for inclusion by both source files.  Doing so would sacrifice some clarity to gain maintainability.

Just a thought.

Jeff Muizelaar [:jrmuizel]

Comment 2

•

15 years ago

(In reply to comment #0)
> Also, a few peripheral changes, not strictly related to the 2pixel/loop
> enhancement:
> 
> 0. Replaced the "input" and "output" pointers with the single "indices"
> pointer, the better to communicate its actual purpose.
> 
> 1. Added a few "const" declarations.  This is to better convey to the compiler
> that the relevant input values will not be changing during processing.
> 
> 2. For RGBA functions, removed unneeded imtermediate storage of alpha byte
> (changed src-->temp-->dest to src-->dest).

It would be better if these changes were split out into separate patches. It makes review and bisection easier.

Steve Snyder

Reporter

Comment 3

•

15 years ago

Attached patch Peripheral changes in prep for following patch (obsolete) — Details — Splinter Review

Some housekeeping in preparation for the following 2-pixels-per-loop patch:

1. Consolidate "input" and "output" pointers into pointer "indices".

2. Add several "const" declarations to better communicate to compilers the unchanging nature of the values being pointed to.

Attachment #407533 - Attachment is obsolete: true

Attachment #409543 - Flags: review?(jmuizelaar)

Attachment #407533 - Flags: review?(jmuizelaar)

Steve Snyder

Reporter

Comment 4

•

15 years ago

Attached patch Transform 2 pixels per loop (obsolete) — Details — Splinter Review

Processes 2 pixels per loop instead of the current single pixel.  

The ~20% reduction in execution time seen with the Pentium4 is likely common to all post-Pentium3 x86 CPUs running in a 32-bit environment.

The "32-bit" weasle-word is because the larger register count of x86 processors in 64-bit mode removes the register pressure that this patch addresses.

Attachment #409544 - Flags: review?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 5

•

15 years ago

(In reply to comment #3)
> Created an attachment (id=409543) [details]
> Peripheral changes in prep for following patch
> 
> Some housekeeping in preparation for the following 2-pixels-per-loop patch:
> 
> 1. Consolidate "input" and "output" pointers into pointer "indices".

I don't understand what indices means in this context. Maybe we can come up with a better name?

Steve Snyder

Reporter

Comment 6

•

15 years ago

(In reply to comment #5)
> I don't understand what indices means in this context. Maybe we can come up
> with a better name?

Well, the calculated values are indices into the output arrays.  I'm not committed to the name.

Steve Snyder

Reporter

Comment 7

•

14 years ago

Attached patch Remove the 3 shuffle instructions from processing (obsolete) — Details — Splinter Review

This patch builds on the earlier two-pixels-per-loop processing.

The change here is to remove the 3 shuffle instructions used in calc'ing the RGB output values, resulting in faster processing of each pixel.

Steve Snyder

Reporter

Comment 8

•

14 years ago

Performance improvement from removing the shuffle instructions from per-pixel processings.  These times are relative to the double-pixel reductions shown above:

0.873 = Pentium3/x86, SDR/133MHz, VS2005/SP1, WinXP
0.895 = Pentium4/x86, DDR1/533MHz, GCC/4.4.2, Fedora12
0.940 = Pentium4/x86, DDR1/533MHz, VS2005/SP1, WinXP
0.956 = Core2/x86, DDR2/800MHz, VS2005/SP1, Win2K(VMWare)
0.000 = Core2/x86_64, DDR2/800MHz, GCC/4.3.2, Fedora10

Obviously this patch benefits oldest hardware the most, where the shufps instruction is the most expensive.  On the other end of the spectrum, it seems that those instructions are essentially "free" in a contemporary 64-bit environment.

The combination of two-pixels-per-loop processing and the removal of the shuffle instructions from those loops will be welcomed by those running the QCMS code on Pentium3/Pentium4 CPUs.

The performance improvement comes at the cost of 9KB of additional RAM use on x86/x86_64 platforms.

Jeff Muizelaar [:jrmuizel]

Comment 9

•

14 years ago

(In reply to comment #8)
> Performance improvement from removing the shuffle instructions from per-pixel
> processings.  These times are relative to the double-pixel reductions shown
> above:
> 
> 0.873 = Pentium3/x86, SDR/133MHz, VS2005/SP1, WinXP
> 0.895 = Pentium4/x86, DDR1/533MHz, GCC/4.4.2, Fedora12
> 0.940 = Pentium4/x86, DDR1/533MHz, VS2005/SP1, WinXP
> 0.956 = Core2/x86, DDR2/800MHz, VS2005/SP1, Win2K(VMWare)
> 0.000 = Core2/x86_64, DDR2/800MHz, GCC/4.3.2, Fedora10
> 
> Obviously this patch benefits oldest hardware the most, where the shufps
> instruction is the most expensive.  On the other end of the spectrum, it seems
> that those instructions are essentially "free" in a contemporary 64-bit
> environment.
> 
> The combination of two-pixels-per-loop processing and the removal of the
> shuffle instructions from those loops will be welcomed by those running the
> QCMS code on Pentium3/Pentium4 CPUs.
> 
> The performance improvement comes at the cost of 9KB of additional RAM use on
> x86/x86_64 platforms.

I'm not thrilled by the ram overhead but I'll think about it. I wonder if it's worth specializing the code for older processors.

Another optimization that I thought of is to premultiply the 65535. into the matrix or the lookup table to remove an extra multiply in the hotloop.

Jeff Muizelaar [:jrmuizel]

Comment 10

•

14 years ago

Sorry the review here is taking so long. I'll try to get to it next week.

Steve Snyder

Reporter

Comment 11

•

14 years ago

(In reply to comment #10)
> Sorry the review here is taking so long. I'll try to get to it next week.

That's OK; I'm not sitting on the Inbox waiting for news.

Jeff Muizelaar [:jrmuizel]

Comment 12

•

14 years ago

Comment on attachment 409544 [details] [diff] [review]
Transform 2 pixels per loop

indices isn't used for input anymore so how about just calling it output? or perhaps better output_linear? Other than that this looks fine.

Attachment #409544 - Flags: review?(jmuizelaar) → review+

Jeff Muizelaar [:jrmuizel]

Updated

•

14 years ago

Attachment #409543 - Flags: review?(jmuizelaar) → review+

Jeff Muizelaar [:jrmuizel]

Comment 13

•

14 years ago

(In reply to comment #12)
> (From update of attachment 409544 [details] [diff] [review])
> indices isn't used for input anymore so how about just calling it output? or
> perhaps better output_linear? Other than that this looks fine.

This comment was meant for attachment 409543 [details] [diff] [review].

Jeff Muizelaar [:jrmuizel]

Comment 14

•

14 years ago

Comment on attachment 409544 [details] [diff] [review]
Transform 2 pixels per loop

>+    /* 2 pixels are transformed in each loop */
>+    even = length >> 1;
>+    odd  = length  & 1;

I don't think we need even and odd variables.
You should just be able to use i<length/2 in the loop header. The compiler should produce the same code and then just use length % 2
in the 'odd' condition.

Steve Snyder

Reporter

Comment 15

•

14 years ago

Attached patch 1/3: preparation for subsequent patches — Details — Splinter Review

Modified per Comments #12 & #13.

Attachment #409543 - Attachment is obsolete: true

Attachment #432451 - Flags: review?(jmuizelaar)

Steve Snyder

Reporter

Comment 16

•

14 years ago

Attached patch 2/3: transform 2 pixels per loop — Details — Splinter Review

Modified per Comment #14.

> I don't think we need even and odd variables.

Probably not.  The "even=length>>1" business was an instinctive aversion to doing anything inside a loop that can be done outside it.  I.e. a fear of having the compiler divide length by 2 on each iteration.  Likely that's not an issue with contemporary compilers.

Anyway, it's gone.

Attachment #409544 - Attachment is obsolete: true

Attachment #432452 - Flags: review?(jmuizelaar)

Steve Snyder

Reporter

Comment 17

•

14 years ago

Attached patch 3/3: remove need for shuffle instructions (obsolete) — Details — Splinter Review

No changes in functionality or behavior from previous patch.  Re-diffed to patch cleanly after modifying earlier patches per comments.

Attachment #426822 - Attachment is obsolete: true

Attachment #432453 - Flags: review?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Updated

•

14 years ago

Attachment #432452 - Attachment is patch: true

Attachment #432452 - Attachment mime type: application/octet-stream → text/plain

Jeff Muizelaar [:jrmuizel]

Updated

•

14 years ago

Attachment #432451 - Flags: review?(jmuizelaar) → review+

Jeff Muizelaar [:jrmuizel]

Updated

•

14 years ago

Attachment #432452 - Flags: review?(jmuizelaar) → review+

Jeff Walden [:Waldo]

Comment 18

•

14 years ago

Is the current step here to finish reviews, or is it to push patches?  This bug feels a bit neglected.

Jeff Muizelaar [:jrmuizel]

Comment 19

•

14 years ago

(In reply to comment #18)
> Is the current step here to finish reviews, or is it to push patches?  This bug
> feels a bit neglected.

The next step is to decide whether the performance improvement is worth the increase in memory usage. I haven't really heard any complaints about qcms's performance lately, so I'm not sure how to make that call.

Steve Snyder

Reporter

Comment 20

•

14 years ago

Attached patch 3/3: remove need for shuffle instructions (bug fix) — Details — Splinter Review

Updated patch fixes protection faults on 64-bit builds.

The bug was that the header file was #ifdef'd to 32-bit x86 while the code in transform-sse2.c is unconditional.  The result was that the SSE2 transform code expected an array of floats at run time that was not defined at build time.

I added 64-bit symbols defined by the Microsoft and Gnu compilers.  The analogous definition for Sun's compiler is left as an exercise to the user.

Attachment #432453 - Attachment is obsolete: true

Attachment #432453 - Flags: review?(jmuizelaar)

Steve Snyder

Reporter

Updated

•

14 years ago

Attachment #484046 - Flags: review+

Steve Snyder

Reporter

Updated

•

14 years ago

Attachment #484046 - Flags: review+ → review?(jmuizelaar)

Steve Snyder

Reporter

Comment 21

•

13 years ago

Attached patch 3/3: Update no-shuffle-instructions patch for Gecko 8+ (obsolete) — Details — Splinter Review

This has the same functionality as the previous no-shuffle patch (attachment 484046 [details] [diff] [review]), but patches cleanly against the new QCMS code in Gecko v8.0.

The older patch is retained for versions of Gecko prior to 8.0.

Benoit Girard (:BenWa)

Comment 22

•

13 years ago

These patches are worthwhile. Are we simply waiting on reviews for them?

Steve Snyder

Reporter

Comment 23

•

13 years ago

(In reply to Benoit Girard (:BenWa) from comment #22)
> These patches are worthwhile. Are we simply waiting on reviews for them?

Well, comment #19 pretty much reflects the status of these patches.  While the double-pixel patch is cost-free, the elimination of the shuffle instructions comes at a cost of 9KB of RAM used, which Jeff seemed ambivalent about.  (A cost that has since risen to 12KB with the addition of the greyscale input gamma table in the latest code - sorry, Jeff!)

Regarding recent complaints about QCMS performance (or lack thereof): I'm in no position to gauge the satisfaction of the Mozilla-using public.  I can say, though, that the introduction of libjpeg-turbo's faster decoding makes the QCMS code the bottleneck in JPEG display on x86/x86_64 systems.

Steve Snyder

Reporter

Comment 24

•

13 years ago

Attached patch Update no-shuffle-instructions patch for Gecko 8+ — Details — Splinter Review

In copying the aligned memory allocation routines I also copied a bug in them (Bug 692922).  This updated patch applies that bug fix.

Attachment #565582 - Attachment is obsolete: true

Attachment #565741 - Flags: review?(jmuizelaar)

Benoit Girard (:BenWa)

Updated

•

12 years ago

QA Contact: color-management → jmuizelaar

Jeff Muizelaar [:jrmuizel]

Comment 25

•

12 years ago

I've put a prototype of a fixed point transform patch up at bug 802109

BugBot [:suhaib / :marco/ :calixte]

Comment 26

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:aosmond, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: swsnyder → nobody

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Updated

•

2 years ago

Severity: normal → N/A

Type: defect → enhancement

Flags: needinfo?(aosmond)

You need to log in before you can comment on or make changes to this bug.