Closed Bug 416157 Opened 16 years ago Closed 14 years ago

Add JPEG SSE2 color processing for ycc_rgb_convert_argb

Categories

(Core :: Graphics, defect, P2)

x86
Windows XP
defect

Tracking

()

RESOLVED DUPLICATE of bug 573948

People

(Reporter: mmoy, Assigned: mmoy)

References

(Blocks 1 open bug)

Details

(Whiteboard: [needs tests updated])

Attachments

(1 file, 6 obsolete files)

Add SSE2 processing for this routine for the Mac OSX Intel and Windows (32-bit) platforms. I've tested the code on Mac OSX and need to test it on Windows.
Attached patch First pass at the patch (obsolete) — Splinter Review
Works on Mac OSX. I need to test it on Windows.
Attached patch SSE2 JPEG color optimization (obsolete) — Splinter Review
Added a few Windows fixes.
Attachment #301932 - Attachment is obsolete: true
To make the code a bit easier to read, you could put something like this in gfxPlatform.h:

#if defined(GFX_HAVE_SSE2_MNEMONICS)
extern int SSE2Present;
#elif defined(GFX_HAVE_SSE2_INTRINSICS)
#define SSE2Present 1
#endif

That would let you get rid of a whole bunch of guards. I would add a comment along the lines of "// always true for GFX_HAVE_SSE2_INTRINCICS" on the |== 1| test.

Also, you could add:

#if defined(GFX_HAVE_SSE2_MNEMONICS) && defined(GFX_HAVE_SSE2_INTRINSICS)
#error Can't have both GFX_HAVE_SSE2_MNEMONICS and GFX_HAVE_SSE2_INTRINSICS
#endif

In gfxPlatform.cpp, shouldn't the |int SSE2Present = 0;| be outside the guard for #include <intrin.h>?
This patch adds the SSE2 IFAST optimization for Mac/OSX. I will write some comments on the code later this morning. It doesn't include the comments posted a little while ago.
Attachment #302057 - Attachment is obsolete: true
I removed SSE2Present in the 2008-02-11 04:35 PST attachment and did some cleanup in my development environment. I also added your suggestion regarding generating the error message.

On your first suggested change in gfxPlatform.h, how do you use this without having the extern int included in gfxPlatform.cpp where it will be defined without an extern?

(In reply to comment #3)
> To make the code a bit easier to read, you could put something like this in
> gfxPlatform.h:
> 
> #if defined(GFX_HAVE_SSE2_MNEMONICS)
> extern int SSE2Present;
> #elif defined(GFX_HAVE_SSE2_INTRINSICS)
> #define SSE2Present 1
> #endif
> 
> That would let you get rid of a whole bunch of guards. I would add a comment
> along the lines of "// always true for GFX_HAVE_SSE2_INTRINCICS" on the |== 1|
> test.
> 
> Also, you could add:
> 
> #if defined(GFX_HAVE_SSE2_MNEMONICS) && defined(GFX_HAVE_SSE2_INTRINSICS)
> #error Can't have both GFX_HAVE_SSE2_MNEMONICS and GFX_HAVE_SSE2_INTRINSICS
> #endif
> 
> In gfxPlatform.cpp, shouldn't the |int SSE2Present = 0;| be outside the guard
> for #include <intrin.h>?
> 

I found a way to get around the self-include problem.
(In reply to comment #5)
> On your first suggested change in gfxPlatform.h, how do you use this without
> having the extern int included in gfxPlatform.cpp where it will be defined
> without an extern?

Ah, good point. So I guess for _INTRINSICS you could #define SSE2Present 1 in gfxPlatform.h, and for _MNEMONICS you put extern int SSE2Present in nsJPEGDecoder.cpp like you're currently doing. Or you could put the #define in the .cpp file too if you don't want to pollute gfxPlatform.h.

With that, you can get rid of the _MNEMONICS guards around |if (SSE2Present == 1) {| ... |} else {| ... |}|. The compiler should optimize out the test and else case for _INTRINSICS and it will make your code a bit easier to read.



I would also make |__attribute__ ((aligned (16)))| a conditional #define.
It's just easier to read and maintain this (with a better name?):

  JBLOCKROW MCU_buffer[D_MAX_BLOCKS_IN_MCU] SSE2_ALIGNED;
 
#ifdef D_MULTISCAN_FILES_SUPPORTED
  /* In multi-pass modes, we need a virtual block array for each component. */
  jvirt_barray_ptr whole_image[MAX_COMPONENTS] SSE2_ALIGNED;
#endif ! D_MULTISCAN_FILES_SUPPORTED



Index: jpeg/jddctmgr.c
===================================================================

@@ -142,31 +152,23 @@ start_pass (j_decompress_ptr cinfo)

-#ifdef HAVE_SSE2_INTEL_MNEMONICS
-		if (SSE2Available==1) 

Why are you taking out the case for HAVE_SSE2_INTEL_MNEMONICS? I could be wrong, but on Windows this change will cause us to not use SSE2 anymore.
If you can use the #define SSE2Available 1 trick for MacOSX, I think we'd want something like this:

#if defined(HAVE_SSE2_INTRINSICS) || defined(HAVE_SSE2_INTEL_MNEMONICS)
  if (SSE2Available == 1)
  {
    method_ptr = jpeg_idct_ifast_sse2;
    method = JDCT_IFAST;
  }
  else
#else
  {
    method_ptr = jpeg_idct_ifast;
    method = JDCT_IFAST;
  }
#endif

Hrm, that file contains tabs. I don't know what our policy is for mozilla/jpeg/ but if you don't mind, could you expand those tabs (to two spaces each, it seems) for the code you touch?


+    buffer_shift = (PRUint32)
+      (*cinfo->mem->alloc_large) ((j_common_ptr) cinfo, JPOOL_IMAGE,
+				  D_MAX_BLOCKS_IN_MCU * SIZEOF(JBLOCK) + 16);
+    buffer_shift = ((buffer_shift + 16) >> 4) << 4;
+    buffer = (JBLOCKROW) buffer_shift;

I think you meant + 15 there (so if you're already aligned nothing changes). And use & ~15 here too.

Too bad we don't have NS_PTR_ALIGN(p, a).

Also, if you use PRUptrdiff for both of these you should be 64-bit safe.
If it turns out you did want to keep _MNEMONICS above, make sure to put it back here.



Index: jpeg/jdcoefct.c
===================================================================

@@ -718,19 +730,32 @@ jinit_d_coef_controller (j_decompress_pt

+    compptr->dct_table = (void *) ((buffer_pointer + 16) & ~15);

Same + 15 here.



Index: modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp
===================================================================
@@ -512,17 +520,21 @@ nsresult nsJPEGDecoder::ProcessData(cons

+#ifdef GFX_HAVE_SSE2_INTRINSICS
+    mInfo.dct_method =  JDCT_IFAST;
+#else
     mInfo.dct_method =  JDCT_ISLOW;
+#endif ! GFX_HAVE_SSE2_INTRINSICS

Is the jpeg_idct_ifast_sse2 function as accurate as the jpeg_idct_islow one? Or are we ok with a loss of a bit of accuracy here?



To nit-pick some more, could you make GFX_HAVE_SSE2_MNEMONICS be ..._INTEL_MNEMONICS, or change the one in jpeg to drop the _INTEL?
(In reply to comment #6)
> I found a way to get around the self-include problem. 

How did you solve it?
(In reply to comment #8)
> (In reply to comment #6)
> > I found a way to get around the self-include problem. 
> 
> How did you solve it?

I was going to put a #define in the .cpp file before the #include gfxPlatform.h to indicate that this is the source and not the extern so that the .h file.
> I would also make |__attribute__ ((aligned (16)))| a conditional
> #define.  It's just easier to read and maintain this (with a better
> name?):

I'll take a look at putting in a #define for alignment. It's somewhat of a
sticky problem as the formats are different for MSVC and GCC but we're not
worrying about MSVC for now.

> Why are you taking out the case for HAVE_SSE2_INTEL_MNEMONICS? I could
> be wrong, but on Windows this change will cause us to not use SSE2
> anymore.  If you can use the #define SSE2Available 1 trick for MacOSX,
> I think we'd want something like this:

I think that code branch was for IFAST processing. On Windows, we
currently use ISLOW processing which goes to the SSE2 or scalar code
depending on the processor capabilities. Just a note that Intel
provided IFAST MMX support for MSVC using inline assembler. It's never
been used in Firefox to my knowledge but the code is there. It's
faster than the ISLOW SSE2 code on Pentium 4s and AMD K8s as they
mostly implement SSE2 instructions as pairs of 64-bit instructions.

> Hrm, that file contains tabs. I don't know what our policy is for
> mozilla/jpeg/ but if you don't mind, could you expand those tabs (to
> two spaces each, it seems) for the code you touch?

This was a sticky issue for the previous change and the reviewers
gave me a pass on untabifying the whole modules. I'll try to use
spaces for my code though.

> I think you meant + 15 there (so if you're already aligned nothing
> changes).  And use & ~15 here too.

I had 15 in there for a long time and changed it to 16 recently but I
don't recall why. After thinking about it, 15 should work. I think
that GCC on the Mac aligns memory allocations to 16 bytes but that
probably isn't true for Windows and Linux and there may be some
non-library memory allocation routines.

> Also, if you use PRUptrdiff for both of these you should be 64-bit
> safe.  If it turns out you did want to keep _MNEMONICS above, make
> sure to put it back here.

I'll look into that. I wasn't aware of that type.

> Is the jpeg_idct_ifast_sse2 function as accurate as the
> jpeg_idct_islow one? Or are we ok with a loss of a bit of accuracy
> here?

JPEG provides for three approaches: IFAST, ISLOW and FLOAT. It's
pretty hard to get good performance from FLOAT on x86 processors as
you can only do four operations at a time and float operations are
generally slower than integer operations. With integer operations
8 operations can be done at a time.

The scalar ISLOW code operates in 32-bit integer arithmetic. The
vector ISLOW code operates in 16-bit integer arithmetic so that it can
fit an eight by eight matrix row in a vector. This code has been used
for some time. The IFAST code uses a faster, less accurate integer
implementation and there's some information in the header of
jidctfst.c. The output of these algorithms is eight-byte integers so I
don't think that using 16-bit integer arithmetic is much of a problem.

The code here (and in inline assembler form) has been used for about
two years in Win32, Windows x64 and Mac OSX builds and I haven't
received any idct-related complaints.
(In reply to comment #10)
> > I would also make |__attribute__ ((aligned (16)))| a conditional
> > #define.  It's just easier to read and maintain this (with a better
> > name?):
> 
> I'll take a look at putting in a #define for alignment. It's somewhat of a
> sticky problem as the formats are different for MSVC and GCC but we're not
> worrying about MSVC for now.

As long as they go in the same place, making this a define would make your life even simpler. If they go in between the return type and the function name, or before the return type, you could always wrap the relevant part in a macro, e.g.

SSE2_ALIGNED(JBLOCKROW MCU_buffer[D_MAX_BLOCKS_IN_MCU])

or

jvirt_barray_ptr SSE2_ALIGNED(whole_image[MAX_COMPONENTS])

which can then place the annotation the the right place.


> > Why are you taking out the case for HAVE_SSE2_INTEL_MNEMONICS? I could
> > be wrong, but on Windows this change will cause us to not use SSE2
> > anymore.  If you can use the #define SSE2Available 1 trick for MacOSX,
> > I think we'd want something like this:
> 
> I think that code branch was for IFAST processing. On Windows, we
> currently use ISLOW processing which goes to the SSE2 or scalar code
> depending on the processor capabilities. Just a note that Intel
> provided IFAST MMX support for MSVC using inline assembler. It's never
> been used in Firefox to my knowledge but the code is there. It's
> faster than the ISLOW SSE2 code on Pentium 4s and AMD K8s as they
> mostly implement SSE2 instructions as pairs of 64-bit instructions.

Right, so you'd want to use the IFAST processing on Windows, but since on Windows _INTRINSICS isn't defined you'll fall back to the generic code.
Am I missing something here, or is that what you mean to do?


> > I think you meant + 15 there (so if you're already aligned nothing
> > changes).  And use & ~15 here too.
> 
> I had 15 in there for a long time and changed it to 16 recently but I
> don't recall why. After thinking about it, 15 should work. I think
> that GCC on the Mac aligns memory allocations to 16 bytes but that
> probably isn't true for Windows and Linux and there may be some
> non-library memory allocation routines.

It sounds like you're talking about the + 16 for the allocation, which, btw, can also be changed to + 15, but I meant the + 16 for the pointer:

+    buffer_shift = (PRUint32)
+      (*cinfo->mem->alloc_large) ((j_common_ptr) cinfo, JPOOL_IMAGE,
+                                 D_MAX_BLOCKS_IN_MCU * SIZEOF(JBLOCK) + 0xF);
+    buffer_shift = (buffer_shift + 0xF) & ~0xF;

See also what we do at http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/nsprpub/lib/ds/plarena.h&rev=3.6&root=/cvsroot#99 though I think the _MASK + 1 in the init call at line 103 really only needs to be _MASK.



> > Is the jpeg_idct_ifast_sse2 function as accurate as the
> > jpeg_idct_islow one? Or are we ok with a loss of a bit of accuracy
> > here?
> 
> JPEG provides for three approaches: IFAST, ISLOW and FLOAT. It's
> pretty hard to get good performance from FLOAT on x86 processors as
> you can only do four operations at a time and float operations are
> generally slower than integer operations. With integer operations
> 8 operations can be done at a time.
> 
> The scalar ISLOW code operates in 32-bit integer arithmetic. The
> vector ISLOW code operates in 16-bit integer arithmetic so that it can
> fit an eight by eight matrix row in a vector. This code has been used
> for some time. The IFAST code uses a faster, less accurate integer
> implementation and there's some information in the header of
> jidctfst.c. The output of these algorithms is eight-byte integers so I
> don't think that using 16-bit integer arithmetic is much of a problem.
> 
> The code here (and in inline assembler form) has been used for about
> two years in Win32, Windows x64 and Mac OSX builds and I haven't
> received any idct-related complaints.

Ah, that's good to know. jpeglib.h explicitly mentions that JDCT_IFAST is faster but less accurate than JDCT_ISLOW, which I assume they do for a reason. So, since nsJPEGDecoder picked JDCT_ISLOW, what was their reasoning? Maybe they were just playing it safe? Note that I am relatively ignorant on the specifics of jpeg decoding so I'm asking this more for my edification than as a critique on the change.
Btw, see bug 416906 about using ! after #endif. /* ... */ is the way to go (or just leave it off where it's trivial to connect the #endif with the #ifdef/#if).
I modified the checking to #define SSE2Present = 1 if intrinsics are there, to 0 if they aren't so that the SSE2 code doesn't get generated if there MNEMONICS and INTRINSICS aren't set. This got rid of several #ifdefs in nsJPEGDecoder.cpp. I'm still working on the other stuff.
This is the latest version of the patch, tested on Win32 and Mac/OSX. I will do a writeup on the changes in the form of responses to the review comments later tonight.
Attachment #302573 - Attachment is obsolete: true
(Comment #7 jag (Peter Annema)

> I would also make |__attribute__ ((aligned (16)))| a conditional #define.
> It's just easier to read and maintain this (with a better name?):

>   JBLOCKROW MCU_buffer[D_MAX_BLOCKS_IN_MCU] SSE2_ALIGNED;

> #ifdef D_MULTISCAN_FILES_SUPPORTED
>   /* In multi-pass modes, we need a virtual block array for each component. */
>   jvirt_barray_ptr whole_image[MAX_COMPONENTS] SSE2_ALIGNED;
> #endif ! D_MULTISCAN_FILES_SUPPORTED

I used the SSE2_ALIGNED() form. For GFX, it handles MSVC and GCC. For JPEG, it just
handles GCC for now.

> Hrm, that file contains tabs. I don't know what our policy is for mozilla/jpeg/
> but if you don't mind, could you expand those tabs (to two spaces each, it
> seems) for the code you touch?

I took the tabs out of the changes for this patch.

> +    buffer_shift = (PRUint32)
> +      (*cinfo->mem->alloc_large) ((j_common_ptr) cinfo, JPOOL_IMAGE,
> +                                 D_MAX_BLOCKS_IN_MCU * SIZEOF(JBLOCK) + 16);
> +    buffer_shift = ((buffer_shift + 16) >> 4) << 4;
> +    buffer = (JBLOCKROW) buffer_shift;

> I think you meant + 15 there (so if you're already aligned nothing changes).
> And use & ~15 here too.

> Too bad we don't have NS_PTR_ALIGN(p, a).

> Also, if you use PRUptrdiff for both of these you should be 64-bit safe.
> If it turns out you did want to keep _MNEMONICS above, make sure to put it back
> here.

Made the changes from 16 to 15 and am using PRUptrdiff now here, in jdcoeffct.c and
jmemmgr.c.

Comment 11, jag (Peter Annema):

> Right, so you'd want to use the IFAST processing on Windows, but since on
> Windows _INTRINSICS isn't defined you'll fall back to the generic code.
> Am I missing something here, or is that what you mean to do?

I was originally planning to leave the ISLOW code on Windows and add the IFAST
code for OSX as time was short and I wanted to get in a minimal code change.
The code right now leaves the Windows code as it is and adds IFAST for Mac OSX
only.

The INTEL_MNEMONICS is set for Win32 so it will use ISLOW (either SSE2 or scalar)
as chosen in jddctmgr.c. The code there already exists which is why it isn't seen
in the patch.

> Ah, that's good to know. jpeglib.h explicitly mentions that JDCT_IFAST is
> faster but less accurate than JDCT_ISLOW, which I assume they do for a reason.
> So, since nsJPEGDecoder picked JDCT_ISLOW, what was their reasoning? Maybe they
> were just playing it safe? Note that I am relatively ignorant on the specifics
> of jpeg decoding so I'm asking this more for my edification than as a critique
> on the change.

I don't have the history on the choice but I think that your assumption is reasonable.

Comment 12 jag (Peter Annema):

> Btw, see bug 416906 about using ! after #endif. /* ... */ is the way to go (or
> just leave it off where it's trivial to connect the #endif with the
> #ifdef/#if).

Done.

Attachment #306803 - Flags: review?(pavlov)
I think you can actually just do this:

+#ifdef GFX_HAVE_SSE2_INTRINSICS
+#define GFX_SSE2_ALIGN __attribute__ ((aligned (16)))
+#endif
+
+#ifdef GFX_HAVE_SSE2_INTEL_MNEMONICS
+#define GFX_SSE2_ALIGN __declspec(align(16))
+#endif

So you can drop the extra parentheses for (IMO) increased readability:

+GFX_SSE2_ALIGN const short color_constants_1[8] = {128, 128, 128, 128, 128, 128, 128, 128};

+  SSE2_ALIGN JBLOCKROW MCU_buffer[D_MAX_BLOCKS_IN_MCU];

+  SSE2_ALIGN jvirt_barray_ptr whole_image[MAX_COMPONENTS];

+SSE2_ALIGN unsigned short SSE2_ADD_128[8] =
+  {4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096};
Comment on attachment 306803 [details] [diff] [review]
Color patch with the SSE2 IFAST intrinsics optimizations for OSX

this patch looks OK to me.  Should we address jag's comments?
Attachment #306803 - Flags: review?(pavlov) → review+
Modified the SSE2_ALIGN macros as per jag's comments. I had to use two patches to build as I ran into the Parental Controls Vista SDK issue. I recreated the patch and had to strip the Parental Controls stuff.
Attachment #306803 - Attachment is obsolete: true
Attachment #313882 - Flags: review?(pavlov)
Flags: wanted-next+
This change exposes a problem with VS2005 SP1 and the SDK (I think). I just upgraded to SP1 and the SDK and the include "intrin.h" results in a dual definition failure. I looked it up on Google and it's a common problem. The recommendation was to comment out the declaration in intrin.h which works.
I'd like to continue to make progress on this for 1.9.1..
Flags: wanted1.9.1+
Is it just VS2005's bugs that's holding this patch up? Or is there anything else needed?
I don't think that the updated patch was rereviewed.
Attachment #313882 - Flags: review?(pavlov) → review+
Attached patch Updated patch for 3.1 (obsolete) — Splinter Review
I had to modify the patch as there was an extra or missing line from one of the modules that I had to fix. The patch has been tested on Windows and Mac OSX on the 3.1 code base.
Attachment #313882 - Attachment is obsolete: true
Using a name as a variable or #define didn't work in getting the compiler to ignore the SIMD code that wouldn't compile on unsupported platforms. So I changed the plumbing to use GFX_SSE2_POSSIBLE (meaning that SSE2 support is possible on the hardware/software platform) and GFX_SSE2_AVAILABLE as a variable or #define that determines at compile-time (for Mac/OSX Intel) or run-time (for Win32) which path to take. The SIMD code is only active when GFX_SSE2_POSSIBLE is defined so that there won't be compilation errors on unsupported platforms.

I also modified the gfx code to use __sse2_available on Win32. This flag is used by the memcpy CRT routine (and others) to determine SSE2 support at runtime and is present in VS2003, VS2005 and VS2008. This removes the need to use __cpuid and therefore, intrin.h which caused multiply defined symbols.
Attachment #332163 - Attachment is obsolete: true
Comment on attachment 332500 [details] [diff] [review]
Updated patch for tinderbox

Some additional changes to the SSE2 infrastructure to get it to work on the tryserver. The patch passed the tryserver build on Linux, Mac OSX and Windows.
Attachment #332500 - Flags: review?(pavlov)
Attachment #332500 - Flags: review?(pavlov) → review+
Who should I request for Superreview?
Comment on attachment 332500 [details] [diff] [review]
Updated patch for tinderbox

Looks good to me.
Attachment #332500 - Flags: superreview+
Checkin requested (unless there's another testing phase required).
mmoy, I'm adding the checkin-needed flag to the whiteboard for you. It would help whoever checks it in if you provide commit message you want to show in the Hg log when they check the patch in.
Keywords: checkin-needed
One other comment - there may be resulting changes in JPEG test results as JPEG results are not lossless and there can be small differences due to computational differences. Do these tests get assigned to me?
Pushed as 17112:6eec92f9276a.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.1a2
Yeah, the tests need to be updated... :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Althouhg changing JPEG decoder setting to IFAST, it need update test codes like commet #32.
But, I believe that SSE2 code of ycc_rgb_convert_argb() functin is usefull for all platforms without test regression.

Michael, could you separate this into IFAST issue and SSE2 code issue of ycc_rgb_convert_argb() function?
How do you think this?
I wanted to use IFAST for Mac OSX because of the performance regression in JPEG from 2.* to 3.0 but leave ISLOW SSE2 for Win32. The Mac OSX IFAST code generally runs fine on Windows x64. It probably runs fine on Linux but I can't test that.

I will look into separating out the argb code from the idct code.

I can also just port the ISLOW SSE2 code to intrinsics. Ideally, there would be a preference so that the user could select the transform code based on their needs.
What needs to be done to update the tests? Would we end up with different test results on different platforms?
Product: Core → Core Graveyard
Component: GFX → Graphics
Product: Core Graveyard → Core
QA Contact: general → thebes
Target Milestone: mozilla1.9.1a2 → ---
Flags: wanted1.9.2?
Whiteboard: [needs tests updated]
Blocks: jpeg-perf
I think this whole patch will be unnecessary once we land libjpeg-turbo (bug 573948).  Anyone mind if I WONTFIX?
It is only open for the updating of tests, so that part can/should remain open.
(In reply to comment #37)
> It is only open for the updating of tests, so that part can/should remain open.

I don't follow the discussion above about the tests.  What about this bug is remaining to be done?
(In reply to comment #38)
> (In reply to comment #37)
> > It is only open for the updating of tests, so that part can/should remain open.
> 
> I don't follow the discussion above about the tests.  What about this bug is
> remaining to be done?

Michel changed jpeg decompression algorithm from ISLOW to IFAST.  So reftest is failed.

If we can remove ycc_rgb_convert_argb by merging libjpeg-turbo, we should dup of bug 573948.
libjpeg-turbo has a vectorized ycc_rgb_convert_argb function.  Mind if I dupe with bug 573948?
(In reply to comment #40)
> libjpeg-turbo has a vectorized ycc_rgb_convert_argb function.  Mind if I dupe
> with bug 573948?

Seems sane.
Status: REOPENED → RESOLVED
Closed: 16 years ago14 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.