Open Bug 523584 Opened 15 years ago Updated 2 years ago

QCMS: further reduction in execution time

Categories

(Core :: Graphics: Color Management, enhancement)

x86
All
enhancement

Tracking

()

People

(Reporter: swsnyder, Unassigned)

Details

Attachments

(4 files, 6 obsolete files)

Attached patch Transform 2 pixels per loop (obsolete) — Splinter Review
[This is a follow-up patch to bug 512865.]

The attached patch reduces execution time on the SIMD code paths.  It does this by transforming 2 pixels per loop (as opposed to the single pixel/loop in the trunk code).  This allows for pointers in registers to be used longer before being forced out for other needs.

Execution times seen, relative to trunk code:

0.961 = Win32 (VS2005/SP1), Pentium3 (Coppermine), SDRAM
0.773 = Win32 (VS2005/SP1), Pentium4 (Northwood), DDR1
0.806 = Linux x86 (GCC v4.3.1), Pentium4 (Northwood), DDR1
0.957 = Linux x86_64 (GCC v4.3.1), Core2 (Penryn), DDR2

Not much benefit seen on Pentium3 (due to smaller trace cache?).  Also not much benefit on x86_64, where values could already be kept long-term in the larger register set.

However, a ~20% improvement is seen in the Pentium4 test cases.  I expect that this improvement will hold true for all post-Pentium4 processors running in a 32-bit environment.

Also, a few peripheral changes, not strictly related to the 2pixel/loop enhancement:

0. Replaced the "input" and "output" pointers with the single "indices" pointer, the better to communicate its actual purpose.

1. Added a few "const" declarations.  This is to better convey to the compiler that the relevant input values will not be changing during processing.

2. For RGBA functions, removed unneeded imtermediate storage of alpha byte (changed src-->temp-->dest to src-->dest).
Attachment #407533 - Flags: review?(jmuizelaar)
Assignee: nobody → swsnyder
Jeff: One more thing, a housekeeping note.

There is now so much duplicate code within and between the transform-sse*.c files that I wonder if maybe putting the code into macros would be justified.  The bulk of the SIMD code could be moved into a header file for inclusion by both source files.  Doing so would sacrifice some clarity to gain maintainability.

Just a thought.
(In reply to comment #0)
> Also, a few peripheral changes, not strictly related to the 2pixel/loop
> enhancement:
> 
> 0. Replaced the "input" and "output" pointers with the single "indices"
> pointer, the better to communicate its actual purpose.
> 
> 1. Added a few "const" declarations.  This is to better convey to the compiler
> that the relevant input values will not be changing during processing.
> 
> 2. For RGBA functions, removed unneeded imtermediate storage of alpha byte
> (changed src-->temp-->dest to src-->dest).

It would be better if these changes were split out into separate patches. It makes review and bisection easier.
Some housekeeping in preparation for the following 2-pixels-per-loop patch:

1. Consolidate "input" and "output" pointers into pointer "indices".

2. Add several "const" declarations to better communicate to compilers the unchanging nature of the values being pointed to.
Attachment #407533 - Attachment is obsolete: true
Attachment #409543 - Flags: review?(jmuizelaar)
Attachment #407533 - Flags: review?(jmuizelaar)
Attached patch Transform 2 pixels per loop (obsolete) — Splinter Review
Processes 2 pixels per loop instead of the current single pixel.  

The ~20% reduction in execution time seen with the Pentium4 is likely common to all post-Pentium3 x86 CPUs running in a 32-bit environment.

The "32-bit" weasle-word is because the larger register count of x86 processors in 64-bit mode removes the register pressure that this patch addresses.
Attachment #409544 - Flags: review?(jmuizelaar)
(In reply to comment #3)
> Created an attachment (id=409543) [details]
> Peripheral changes in prep for following patch
> 
> Some housekeeping in preparation for the following 2-pixels-per-loop patch:
> 
> 1. Consolidate "input" and "output" pointers into pointer "indices".

I don't understand what indices means in this context. Maybe we can come up with a better name?
(In reply to comment #5)
> I don't understand what indices means in this context. Maybe we can come up
> with a better name?

Well, the calculated values are indices into the output arrays.  I'm not committed to the name.
This patch builds on the earlier two-pixels-per-loop processing.

The change here is to remove the 3 shuffle instructions used in calc'ing the RGB output values, resulting in faster processing of each pixel.
Performance improvement from removing the shuffle instructions from per-pixel processings.  These times are relative to the double-pixel reductions shown above:

0.873 = Pentium3/x86, SDR/133MHz, VS2005/SP1, WinXP
0.895 = Pentium4/x86, DDR1/533MHz, GCC/4.4.2, Fedora12
0.940 = Pentium4/x86, DDR1/533MHz, VS2005/SP1, WinXP
0.956 = Core2/x86, DDR2/800MHz, VS2005/SP1, Win2K(VMWare)
0.000 = Core2/x86_64, DDR2/800MHz, GCC/4.3.2, Fedora10

Obviously this patch benefits oldest hardware the most, where the shufps instruction is the most expensive.  On the other end of the spectrum, it seems that those instructions are essentially "free" in a contemporary 64-bit environment.

The combination of two-pixels-per-loop processing and the removal of the shuffle instructions from those loops will be welcomed by those running the QCMS code on Pentium3/Pentium4 CPUs.

The performance improvement comes at the cost of 9KB of additional RAM use on x86/x86_64 platforms.
(In reply to comment #8)
> Performance improvement from removing the shuffle instructions from per-pixel
> processings.  These times are relative to the double-pixel reductions shown
> above:
> 
> 0.873 = Pentium3/x86, SDR/133MHz, VS2005/SP1, WinXP
> 0.895 = Pentium4/x86, DDR1/533MHz, GCC/4.4.2, Fedora12
> 0.940 = Pentium4/x86, DDR1/533MHz, VS2005/SP1, WinXP
> 0.956 = Core2/x86, DDR2/800MHz, VS2005/SP1, Win2K(VMWare)
> 0.000 = Core2/x86_64, DDR2/800MHz, GCC/4.3.2, Fedora10
> 
> Obviously this patch benefits oldest hardware the most, where the shufps
> instruction is the most expensive.  On the other end of the spectrum, it seems
> that those instructions are essentially "free" in a contemporary 64-bit
> environment.
> 
> The combination of two-pixels-per-loop processing and the removal of the
> shuffle instructions from those loops will be welcomed by those running the
> QCMS code on Pentium3/Pentium4 CPUs.
> 
> The performance improvement comes at the cost of 9KB of additional RAM use on
> x86/x86_64 platforms.

I'm not thrilled by the ram overhead but I'll think about it. I wonder if it's worth specializing the code for older processors.

Another optimization that I thought of is to premultiply the 65535. into the matrix or the lookup table to remove an extra multiply in the hotloop.
Sorry the review here is taking so long. I'll try to get to it next week.
(In reply to comment #10)
> Sorry the review here is taking so long. I'll try to get to it next week.

That's OK; I'm not sitting on the Inbox waiting for news.
Comment on attachment 409544 [details] [diff] [review]
Transform 2 pixels per loop

indices isn't used for input anymore so how about just calling it output? or perhaps better output_linear? Other than that this looks fine.
Attachment #409544 - Flags: review?(jmuizelaar) → review+
Attachment #409543 - Flags: review?(jmuizelaar) → review+
(In reply to comment #12)
> (From update of attachment 409544 [details] [diff] [review])
> indices isn't used for input anymore so how about just calling it output? or
> perhaps better output_linear? Other than that this looks fine.

This comment was meant for attachment 409543 [details] [diff] [review].
Comment on attachment 409544 [details] [diff] [review]
Transform 2 pixels per loop

>+    /* 2 pixels are transformed in each loop */
>+    even = length >> 1;
>+    odd  = length  & 1;

I don't think we need even and odd variables.
You should just be able to use i<length/2 in the loop header. The compiler should produce the same code and then just use length % 2
in the 'odd' condition.
Modified per Comments #12 & #13.
Attachment #409543 - Attachment is obsolete: true
Attachment #432451 - Flags: review?(jmuizelaar)
Modified per Comment #14.

> I don't think we need even and odd variables.

Probably not.  The "even=length>>1" business was an instinctive aversion to doing anything inside a loop that can be done outside it.  I.e. a fear of having the compiler divide length by 2 on each iteration.  Likely that's not an issue with contemporary compilers.

Anyway, it's gone.
Attachment #409544 - Attachment is obsolete: true
Attachment #432452 - Flags: review?(jmuizelaar)
No changes in functionality or behavior from previous patch.  Re-diffed to patch cleanly after modifying earlier patches per comments.
Attachment #426822 - Attachment is obsolete: true
Attachment #432453 - Flags: review?(jmuizelaar)
Attachment #432452 - Attachment is patch: true
Attachment #432452 - Attachment mime type: application/octet-stream → text/plain
Attachment #432451 - Flags: review?(jmuizelaar) → review+
Attachment #432452 - Flags: review?(jmuizelaar) → review+
Is the current step here to finish reviews, or is it to push patches?  This bug feels a bit neglected.
(In reply to comment #18)
> Is the current step here to finish reviews, or is it to push patches?  This bug
> feels a bit neglected.

The next step is to decide whether the performance improvement is worth the increase in memory usage. I haven't really heard any complaints about qcms's performance lately, so I'm not sure how to make that call.
Updated patch fixes protection faults on 64-bit builds.

The bug was that the header file was #ifdef'd to 32-bit x86 while the code in transform-sse2.c is unconditional.  The result was that the SSE2 transform code expected an array of floats at run time that was not defined at build time.

I added 64-bit symbols defined by the Microsoft and Gnu compilers.  The analogous definition for Sun's compiler is left as an exercise to the user.
Attachment #432453 - Attachment is obsolete: true
Attachment #432453 - Flags: review?(jmuizelaar)
Attachment #484046 - Flags: review+
Attachment #484046 - Flags: review+ → review?(jmuizelaar)
This has the same functionality as the previous no-shuffle patch (attachment 484046 [details] [diff] [review]), but patches cleanly against the new QCMS code in Gecko v8.0.

The older patch is retained for versions of Gecko prior to 8.0.
These patches are worthwhile. Are we simply waiting on reviews for them?
(In reply to Benoit Girard (:BenWa) from comment #22)
> These patches are worthwhile. Are we simply waiting on reviews for them?

Well, comment #19 pretty much reflects the status of these patches.  While the double-pixel patch is cost-free, the elimination of the shuffle instructions comes at a cost of 9KB of RAM used, which Jeff seemed ambivalent about.  (A cost that has since risen to 12KB with the addition of the greyscale input gamma table in the latest code - sorry, Jeff!)

Regarding recent complaints about QCMS performance (or lack thereof): I'm in no position to gauge the satisfaction of the Mozilla-using public.  I can say, though, that the introduction of libjpeg-turbo's faster decoding makes the QCMS code the bottleneck in JPEG display on x86/x86_64 systems.
In copying the aligned memory allocation routines I also copied a bug in them (Bug 692922).  This updated patch applies that bug fix.
Attachment #565582 - Attachment is obsolete: true
Attachment #565741 - Flags: review?(jmuizelaar)
QA Contact: color-management → jmuizelaar
I've put a prototype of a fixed point transform patch up at bug 802109

The bug assignee didn't login in Bugzilla in the last 7 months.
:aosmond, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: swsnyder → nobody
Flags: needinfo?(aosmond)
Severity: normal → N/A
Type: defect → enhancement
Flags: needinfo?(aosmond)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: