Add SSE2 processing for JPEG color, use static instead of dynamic arrays

RESOLVED FIXED in mozilla1.9beta3

Status

()

Core
ImageLib
P1
enhancement
RESOLVED FIXED
10 years ago
6 years ago

People

(Reporter: Michael Moy, Assigned: Michael Moy)

Tracking

({perf})

Trunk
mozilla1.9beta3
x86
Windows XP
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9 +
in-testsuite ?

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 6 obsolete attachments)

(Assignee)

Description

10 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20080101 BonEcho/2.0.0.11 (mmoy CE K8C-X03)
Build Identifier: Trunk

Add SSE2 optimization for ycc_rgb_convert and turn five dynamic tables into static tables. ycc_rgb_convert is a common color conversion and using SSE2 code instead of scalar code results in about a 25% performance improvement in my testing.

There are four tables that are generated in the jdcolor.c module every time an image is processed. The change here uses a static table instead of going through malloc, computation and free each time.

There is a table called range_limit that's created for every image and this is done in jdmaster.c. The change here changes that to a static table instead of creating it for every image.

The SSE2 code and the table code are only varianted on Win32 builds built with MSVC. The SSE2 code will only run on processors that support it but that should be most processors in the last six years.


Reproducible: Always

Steps to Reproduce:
1.
2.
3.



This code piggybacks on the HAVE_SSE2_INTEL_MNEMONICS define variable and the SSE2Available global that's in the existing SSE2 JPEG code for the Inverse Discrete Cosign Transform.

This code was ported from my Mac OSX code by changing the alignment declarations from GNU style to MSVC style. Most of the changes in this patch should work on Mac OSX Intel and Linux with minor modifications.
(Assignee)

Comment 1

10 years ago
Created attachment 296054 [details] [diff] [review]
JPEG Color SSE2 and static tables changes

Changes are to jdcolor.c and jdmaster.c.
Attachment #296054 - Flags: approval1.9?
(Assignee)

Comment 2

10 years ago
I've done a fair amount of testing on this code and it's been around for about two years. I haven't tried building with this particular patch on OSX or Linux though I did do builds with various switches turned on and off. I also eyeballed the image results on the edges and corners and where images meet other images.

I have a bunch of other JPEG optimization but this is probably the safest with the
biggest return that should provide enough time for comments and testing before Beta 3.
Version: unspecified → Trunk
Comment on attachment 296054 [details] [diff] [review]
JPEG Color SSE2 and static tables changes

mmoy, you need to get review+ before seeking approval.

Maybe ask akayser?
Attachment #296054 - Flags: approval1.9?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9?
(Assignee)

Comment 4

10 years ago
I need to make a few small changes to the patch after looking at the dif. Will take me 30 minutes.
Attachment #296054 - Flags: superreview?(pavlov)
Attachment #296054 - Flags: review?(pavlov)
(Assignee)

Comment 5

10 years ago
Created attachment 296062 [details] [diff] [review]
SSE2 JPEG Color optimization and use static instead of dynamic tables

Changes to jdcolor.c and jdmaster.c
Attachment #296054 - Attachment is obsolete: true
Attachment #296062 - Flags: approval1.9?
Attachment #296054 - Flags: superreview?(pavlov)
Attachment #296054 - Flags: review?(pavlov)
Comment on attachment 296062 [details] [diff] [review]
SSE2 JPEG Color optimization and use static instead of dynamic tables

You need review before approval.
Attachment #296062 - Flags: superreview?(pavlov)
Attachment #296062 - Flags: review?(pavlov)
Attachment #296062 - Flags: approval1.9?
(Assignee)

Comment 7

10 years ago
I just sent email to Alfred at his gmail account requesting a review.
Attachment #296062 - Flags: review?(alfredkayser)
Assignee: nobody → mmoy

Updated

10 years ago
Keywords: perf

Comment 8

10 years ago
Michael - this is um, awesome.

Can we make sure to use the SSE2 detection code already in:

http://lxr.mozilla.org/mozilla/source/jpeg/jdapimin.c#481

So this doesn't break on non SSE2 machines?

Anyone else want to take this to GCC linux/mac?

Comment 9

10 years ago
Just a note that adapting some of bug 386651 to cover JPEG might be a great way to get automated test coverage for these sorts of changes.  Dolske may take a look.

Comment 10

10 years ago
(In reply to comment #8)
> Michael - this is um, awesome.
> 
> Can we make sure to use the SSE2 detection code already in:
> 
> http://lxr.mozilla.org/mozilla/source/jpeg/jdapimin.c#481
> 
> So this doesn't break on non SSE2 machines?
> 

Um, nevermind.  You've already got that covered :-) Sorry for the noise.

Comment 11

10 years ago
Comment on attachment 296062 [details] [diff] [review]
SSE2 JPEG Color optimization and use static instead of dynamic tables

After I have R'ed the (updated) patch, then pavlov only needs to do the SR.
Attachment #296062 - Flags: review?(pavlov)
Attachment #296062 - Flags: review?(alfredkayser)
Attachment #296062 - Flags: review-

Comment 12

10 years ago
A few minor comments:
1. Is __declspec(align(16)) really needed? It may cause issues with other compilers...

2. No tabs please (that is why the diff view show no indenting...)

3. Instead of:
    #ifndef HAVE_SSE2_INTEL_MNEMONICS
     oldcode
    #else
     if (SSE2Available==1) 
       ssecode
     else 
       oldcode
    #endif
do
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
     if (SSE2Available==1) 
       ssecode
     else 
    #else
       oldcode
    #endif
to prevent code duplication

4. The tables should be made 'const' so that they can be placed in shared/readonly memory.

5. The static table optimization is valid for all platforms, not just when SSE2 is enabled.

6. In  jpeg_calc_output_dimensions replace:
      JSAMPLE * table;
      int i;
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
      /* Use a static table for Win32/MSVC */
      table = (JSAMPLE *) static_range_table;
      table += (MAXJSAMPLE+1);	/* allow negative subscripts of simple table */
      cinfo->sample_range_limit = table;
    #else
      ...
with	
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
      /* Use a static table for Win32/MSVC */
      /* allow negative subscripts of simple table */
      cinfo->sample_range_limit = ((JSAMPLE *) static_range_table) + MAXJSAMPLE+1;
    #else
      JSAMPLE * table;
      int i;
      ...	
As, the 'i' will be warned as 'unused', and 'table' is not really used in the static code variant.

7. Last but not least: When static tables are used 'cconvert' is now now needed anymore, which saves another alloc.

Comment 13

10 years ago
Just for grins, I added this patch and turned on HAVE_SSE2_INTEL_MNEMONICS for OS X, but that drags in some exisiting inline asm, which needs -fasm-blocks and some other tweaking just to get to compile. Not clear if it's worth enabling only this patch for OS X, or if we should try to get all the SSE2 bits going.

(Assignee)

Comment 14

10 years ago
Created attachment 296196 [details] [diff] [review]
Mac OSX patch for 2.0.0.6

The Windows patch above was derived from this Mac OSX patch which applies against 2.0.0.6 and has the idct, color, sample and static table code in it. The builds are at http://www.vector64.com/elliott.html

I believe that the dct_method has to be changed from JDCT_ISLOW to JDCT_IFAST in nsJPEGDecoder.cpp and you need to do it by hand. I don't know whether or not the patch will apply to FF 3.0.
(Assignee)

Comment 15

10 years ago
Created attachment 296230 [details]
Image benchmark framework

This html file should be placed in a directory with jpeg files and modified to point to those jpeg files. To run the benchmark, drag the file to the browser. Exit the browser, reopen it and then drag it again. The first time will load the files into system disk cache so that the benchmark will better reflect image rendering time. You need to get enough images to be able to see a difference easily and the reasonable size will depend on your system. I like to shutdown as many background processes as possible when I run benchmarks.

Google Images, a digital camera, digital scanners or other websites can provide images. One should try to get a variety of images, maybe some black and whites.

This test doesn't work on Opera and Safari as they apparently indicate that the page is done after the first page is displayed.

Comment 16

10 years ago
I'd like to get this for b3..
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1

Comment 17

10 years ago
I've got another blocker to spank first, then can help get this going on Linux and Mac if needed.
(Assignee)

Comment 18

10 years ago
Thanks to Alfred for doing the review.

I cleaned up the code today and added support for Mac OSX for the color routines and tested those out and they look good from a quick perusal. I need to do a little more testing on images that I normally test JPEG changes with - I just don't store my stock images on the Mac.

The Mac support provides a template for adding Linux support. Someone just needs to do the code for turning on intrinsics based on the combination of hardware, OS and compiler and adding code to check the processor type to determine SSE2 support and set SSE2Available accordingly.

I found a few surprising things on the Mac OSX platform related to performance using a bunch of JPEGS.

mmoy 2.0.0.6 build:   9 seconds
2.0.0.11 Official:   18 seconds
3.0b2                37 seconds
Trunk build          37 seconds
Trunk with SSE2 code 34.5 seconds

So it appears to me that there has been a major regression from the 2.0.0.* line to 3.0.

So I ran some tests on Windows (the image files are different):

Trunk with SSE2:   3.6  seconds
mmoy 2.0.0.11:     2.35 seconds
3.0b2:             5.29 seconds
Trunk:             4.79 seconds (improvements between beta2 and today?)
Official 2.0.0.11: 6.02 seconds

The Windows results look reasonable. The Mac OSX results look very unattractive.
(Assignee)

Comment 19

10 years ago
Created attachment 296901 [details] [diff] [review]
Changes after Alfred's comments

A few minor comments:

1. Is __declspec(align(16)) really needed? It may cause issues with
   other compilers...

   It isn't needed in the tables. It is needed in the constants as
   there are instructions used which require alignment. Note that
   it is currently used in jidctint.c.

   I've removed it from the tables in jdmaster.c and jdcolor.c

   I put some code in jconfig.h to make the declarations support
   the alignment attributes in MSVC and gcc for the two datatypes
   that I used.

2. No tabs please (that is why the diff view show no indenting...)

   I think that the modules are clean now. I used M-x untabify to get
   rid of the tabs.

3. Instead of:
    #ifndef HAVE_SSE2_INTEL_MNEMONICS
     oldcode
    #else
     if (SSE2Available==1) 
       ssecode
     else 
       oldcode
    #endif
do
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
     if (SSE2Available==1) 
       ssecode
     else 
    #else
       oldcode
    #endif
to prevent code duplication

    The oldcodes weren't the same. I added a small optimization to the
    second version to factor out an addition so that the addition
    operation is only done once instead of three times.

    But I made the modifications and added the range_scale_y optimization.
    And took out the register declarations. In Windows, they actually
    generate slower code as variables are needlessly pushed to the stack
    to keep the latest variables in registers.

    Fixed in jdcolor.c

4. The tables should be made 'const' so that they can be placed in
   shared/readonly memory.

   Fixed in jdcolor.c.

5. The static table optimization is valid for all platforms, not just
   when SSE2 is enabled.

   Fixed in jdcolor.c and jdmaster.c.

6. In  jpeg_calc_output_dimensions replace:
      JSAMPLE * table;
      int i;
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
      /* Use a static table for Win32/MSVC */
      table = (JSAMPLE *) static_range_table;
      table += (MAXJSAMPLE+1);  /* allow negative subscripts of simple table */
      cinfo->sample_range_limit = table;
    #else
      ...
with    
    #ifdef HAVE_SSE2_INTEL_MNEMONICS
      /* Use a static table for Win32/MSVC */
      /* allow negative subscripts of simple table */
      cinfo->sample_range_limit = ((JSAMPLE *) static_range_table) +
MAXJSAMPLE+1;
    #else
      JSAMPLE * table;
      int i;
      ...       
As, the 'i' will be warned as 'unused', and 'table' is not really used in the
static code variant.

   Fixed in jdmaster.c

7. Last but not least: When static tables are used 'cconvert' is now
   now needed anymore, which saves another alloc.

   There is at least one other field in cconvert that are still used. This is
   in routine jinit_color_deconverter (jdcolor.c):

     cconvert->pub.start_pass = start_pass_dcolor

-------------

- Added #ifdefs around #include emmtrin.h in jdcolor.c.
- Added HAVE_SSE2_INTRINSICS. HAVE_SSE2_INTEL_MNEMONICS currently means
  MSVC SSE2 support.
- Added Mac OSX support.
- Protected SSE2Available in jddctmgr.c (not sure why this compiled on non-MSVC
  platforms before as it's only defined in jdapimin for MSVC).
- Added templates for other platforms in jdapimin.c and jmorecfg.h.
Attachment #296062 - Attachment is obsolete: true
Attachment #296062 - Flags: superreview?(pavlov)
(Assignee)

Updated

10 years ago
Attachment #296901 - Flags: review?(alfredkayser)
Attachment #296901 - Flags: superreview?(pavlov)
(Assignee)

Comment 20

10 years ago
Out of curiosity, does the superreview build the code changes or do others build and test changes or does code generally go in after review?

Comment 21

10 years ago
(In reply to comment #18)

> So I ran some tests on Windows (the image files are different):
> 
> Trunk with SSE2:   3.6  seconds
> mmoy 2.0.0.11:     2.35 seconds
> 3.0b2:             5.29 seconds
> Trunk:             4.79 seconds (improvements between beta2 and today?)
> Official 2.0.0.11: 6.02 seconds
> 
> The Windows results look reasonable. The Mac OSX results look very
> unattractive.
> 

Bug 411718 related?

What else in your build accounts for the difference between this patch and mmoy builds?
Super-reviewers, and reviewers for that matter, are not obligated to apply, build or test the patch. You should ask if you want that to happen as part of the code review process.

/be
http://www.mozilla.org/hacking/reviewers.html gives a description of how the super-review process works.
(Assignee)

Comment 24

10 years ago
> What else in your build accounts for the difference between this patch 
> and mmoy builds?

On the build side, there is -O2, -GL and PGO. I think that I can make the case that we should be building with -O2 on Windows though perhaps that should be done elsewhere. My guess is that -GL provides a 10% boost and that PGO provides about an 8% boost. This is at the cost of image size. In 2.0, I always do static builds instead of shared builds but the addition of libxul results in a build error when I try to static builds in 3.0 so I'm doing shared builds. Static builds allow more whole program optimization (-GL).

On the code side, there are a few other things though my memory is a little foggy as the performance code was done before 2007. Reducing the copying around
of data certainly helps. I had an optimization to get rid of the division operations in GFXImageFrame. I may have had a small Huffman optimization - not sure if that was Win32 or x64. And I think that I'm using optimized memset, memcpy and memmove routines for large memory operations. The optimized memcpy/memmove routine is about 1,200 lines of assembler. I think that I inlined some of the Mozilla string functions or switched to inline Microsoft string functions and improved some locking code in the PL area. I'm not sure if this affected JPEG rendering.

Microsoft's VS2005 SSE2 memcpy/memmove implementation is pretty poor. There's a test for SSE2 and if it is available, there are tests to see whether the source and destination are both 16-byte aligned. If they are, then a MOVDQA copy loop is run to do the copy with some scalar code to do the remainder. If not, then the regular scalar code is used (sliding doubleworld copy I think). If you don't align structures, at worst, you'd use the SSE2 copy one out of 256 times. Most structures are probabily aligned on four bytes so that's one out of 16 times.

Optimzed mem* routines typically look at the size of what you're copying and, if large, use MMX or SSE register copies. I've read that some even use f87 for Pentium 1s. One can take advantage of SIMD instruction parallelism, non-temporal  writes, data prefetching and data alignment to get the best performance in the mem* routines.

One of the reasons why I want to play with ICC is that I think that they have heavily optimized c-runtime libraries, at least for mem*.

Comment 25

10 years ago
(In reply to comment #18)
           37 seconds
> Trunk build          37 seconds
> Trunk:             4.79 seconds (improvements between beta2 and today?)
> Official 2.0.0.11: 6.02 seconds
> 
> The Windows results look reasonable. The Mac OSX results look very
> unattractive.
> 

I can confirm a big disparity between Windows/Mac.  On same hardware using today's trunk running the same test I get:

Vista: 3s
Mac: 24s

You can watch the mac draw the JPEG's - so something is very very wrong.   I've got a series of test images that I'll upload once my machine is back up and running later today.  I'll file a separate bug on that.
(Assignee)

Comment 26

10 years ago
(In reply to comment #25)
> (In reply to comment #18)
>            37 seconds
> > Trunk build          37 seconds
> > Trunk:             4.79 seconds (improvements between beta2 and today?)
> > Official 2.0.0.11: 6.02 seconds
> > 
> > The Windows results look reasonable. The Mac OSX results look very
> > unattractive.
> > 
> 
> I can confirm a big disparity between Windows/Mac.  On same hardware using
> today's trunk running the same test I get:
> 
> Vista: 3s
> Mac: 24s
> 
> You can watch the mac draw the JPEG's - so something is very very wrong.   I've
> got a series of test images that I'll upload once my machine is back up and
> running later today.  I'll file a separate bug on that.

The images that I used in my Windows tests weren't the same as on the Mac tests. What I wanted to show is that Mac Trunk is taking twice as long as Mac 2.0.0.11 while Windows Trunk is faster than Windows 2.0.0.11. I'm pretty sure that jpeg rendering on Mac is quite a bit slower than Windows with both on 2.0.0.11. But those using the Mac want to see something faster than what they have now. Getting something that takes quite as long would be considered a step backwards.

Updated

10 years ago
Depends on: 412396

Comment 27

10 years ago
You want to check bug 411718, where another optimization is in the patch review process for JPEG decoding.

Yet another optimization is to make the JPEG decoder in libjpeg itself convert the colorplanes directly to Cairo format (RGB24, which is actually a DWORD with (B | G>>8 | R>>16). Currently libjpeg converts from colorpanes to packed RGB (3 bytes per pixel) after which the decoder converts packed RGB to Cairo's RGB24.
This can be done adding a new 'color converter' into libjpeg (and that one could then also be SSE2 optimized...)

Comment 28

10 years ago
Comment on attachment 296901 [details] [diff] [review]
Changes after Alfred's comments

A few more minor review-nits:
1. Alignment is in multiple places not correct: e.g jdapimin.c line 53 is indented too much. Indenting according to the overall style in libjpeg is 2 space per indent.
2. For a patch with so many 'whitespace only change' it would be good to also add a whitespace ignorant diff (-ws) for the review of real changes.
3. In jdcolor.c, in my_color_deconverter the four table pointers are no longer required (use #if 0), otherwise space will still be allocated for them.
4. Use the reference to the static table directly instead of via a variable:
     const int * Crrtab = Cr_r_tab; // This is not needed...

With that addressed, R+
Attachment #296901 - Flags: review?(alfredkayser) → review+

Comment 29

10 years ago
I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2 vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test on Safari reports a tiny fraction of a second, I suspect that they are rendering asynchronously to JS execution, but even so the page is coming up in considerably less wall clock time, even from a cold start. (A second run comes up instantaneously, they must cache CGImages in memory or something.)

(Assignee)

Comment 30

10 years ago
(In reply to comment #29)
> I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2
> vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test
> on Safari reports a tiny fraction of a second, I suspect that they are
> rendering asynchronously to JS execution, but even so the page is coming up in
> considerably less wall clock time, even from a cold start. (A second run comes
> up instantaneously, they must cache CGImages in memory or something.)

The Javascript timer doesn't work on Safari and Opera. I believe that they report "done" when the screen is finished painting which obviously is only a part of the rendering of images.

Comment 31

10 years ago
(In reply to comment #30)
> (In reply to comment #29)
> > I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2
> > vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test
> > on Safari reports a tiny fraction of a second, I suspect that they are
> > rendering asynchronously to JS execution, but even so the page is coming up in
> > considerably less wall clock time, even from a cold start. (A second run comes
> > up instantaneously, they must cache CGImages in memory or something.)
> 
> The Javascript timer doesn't work on Safari and Opera. I believe that they
> report "done" when the screen is finished painting which obviously is only a
> part of the rendering of images.
> 

The onload handler does fire on Safari before the images are decoded/painted.   The scoll event I put in the testcase (in the URL I provided) should force Safari to paint (and thus decode) but I wouldn't worry to much about those #'s.  The 2->3 FF numbers are the ones to watch.

Comment 32

10 years ago
So the good news is that the patch applies cleanly to my trunk build, and seems to work correctly on OS X 10.4. The bad news is that I'm not seeing any noticeable speedup. Out of paranoia, I ran it under the debugger and confirmed that indeed it was considering SSE2 to be available, executing the intrinsics code in ycc_rgb_convert, and that the code generation looked reasonable.

Random comment: The imgITools helpers I recently added might be useful for testing here; the decodeImageData() routine basically works directly with the image decoders.

For example, see:

http://mxr.mozilla.org/seamonkey/source/modules/libpr0n/test/unit/test_imgtools.js#174
(Assignee)

Comment 34

10 years ago
(In reply to comment #32)
> So the good news is that the patch applies cleanly to my trunk build, and seems
> to work correctly on OS X 10.4. The bad news is that I'm not seeing any
> noticeable speedup. Out of paranoia, I ran it under the debugger and confirmed
> that indeed it was considering SSE2 to be available, executing the intrinsics
> code in ycc_rgb_convert, and that the code generation looked reasonable.

What are you using for a benchmark? I'm seeing the improvement greatly reduced with the other Mac OSX issue around. I'll try the patch on 2.0.0.11 to see if it applies. It should be easier to see the difference without the other noise.

Comment 35

10 years ago
I've got a stash of 25 500K-1M plant photos (read: noisy), in your benchmark framework. I should try dialing it up to 100, in case the perf difference is too small to show clearly - I am getting some random jumping around between 2nd, 3rd, 4th, etc runs, so maybe my macbook is being churned by something else.

Comment 36

10 years ago
(In reply to comment #35)
> I've got a stash of 25 500K-1M plant photos (read: noisy), in your benchmark
> framework. I should try dialing it up to 100, in case the perf difference is
> too small to show clearly - I am getting some random jumping around between
> 2nd, 3rd, 4th, etc runs, so maybe my macbook is being churned by something
> else.
> 

Given the profile in the dependent bug:

40.3% CoreGraphics argb32_image_mark_rgb32
24.7% CoreGraphics resample_byte_h_3cpp_vector
8.5%  CoreGraphics decode_byte_8bpc_3
6.0%  CoreGraphics decode_data
5.1%  XUL          jpeg_idct_islow
4.1%  XUL          ycc_rgb_convert
2.8%  XUL          decode_mcu
1.1%  XUL          h2v1_fancy_upsample


I'm not surprised you are seeing no advantages on mac.  I'd suggest we fix that bug first, then see what this does.
(Assignee)

Comment 37

10 years ago
There is clearly a very wide performance variation in this bug. The 3.0 patch didn't apply against 2.0.0.11 as there were two failures. I have limited time on the Mac in the evenings so I didn't press for the machine.

If you want to see the difference in performance on the Mac, I can only point you to http://www.vector64.com/elliott.html where you can compare the 2.0.0.6 build to the official 2.0.0.* build. That build has the kitchen sink in it including jdcolor.

Comment 38

10 years ago
Michael, possibly this patch is better split into two: static tables and SSE2 optimization.

Even more, check out bug 412753 where I moved this YCrCb conversion to the JPEG decoder (to allow to generate directly into Ciaro format). It would be even better to apply SSE2 tricks that direct conversion.
(Assignee)

Comment 39

10 years ago
I think that it would make more sense to optimize the color code in nsJPEGDecoder.cpp than it would here as the routine in jdcolor.c will never get called. Perhaps it would be better to change this bug into just doing the static tables and putting in the plumbing for portable SSE2.

What I wanted to do was to get in the infrastructure and color code and then do portable idct and maybe jdsample in other transactions. And follow that up by moving the SSE2 infrastructure out of JPEG and into a common area of code that could be used for other areas of code such as PNG.

Regarding libjpeg, when that option is used, are the pointers changed to use different routine names for the jpeg routines or does the mozilla copy of jpeg not get compiled? If the latter, then we'd need to put the SSE2 infrastucture somewhere else right away.

Comment 40

10 years ago
Having the optimized color conversion code in nsJPEGDecoder makes it so that that code will be used in both scenarios: compiled/linkedin libjpeg v. a system linked libjpeg. (note as far as I know the 'system linked libjpeg' is only done on Linux).
(Assignee)

Comment 41

10 years ago
Created attachment 297746 [details] [diff] [review]
Prototype nsJPEGDecoder SSE2 color optimization.

This patch contains Alfred's changes from Bug 411718 and Bug 412753 and SSE2 optimization of the latter bug. It doesn't have platform stuff though which I'm working on next. This code will likely go into another Bug. Overall performance improvement from Alfred's two bugs, the static table stuff here and the SSE2 optimization of nsJPEGDecoder is 30% using my old benchmark.

Comment 42

10 years ago
Comment on attachment 296901 [details] [diff] [review]
Changes after Alfred's comments

is it possible to get this patch without all the tab removal changes?  it is kind of a pain to review with them. (do we want them at all?)

Comment 43

10 years ago
also, should we use the cpuid intrinsic with msvc2005 rather than the inline assembly?  probably doesn't matter.
(Assignee)

Comment 44

10 years ago
(In reply to comment #42)
> (From update of attachment 296901 [details] [diff] [review])
> is it possible to get this patch without all the tab removal changes?  it is
> kind of a pain to review with them. (do we want them at all?)

I was taking them all out this morning before leaving for the office. I was going to ask Alfred to give me a pass on the tab/space stuff given that this was an external contribution and it appears that changes to the code since then haven't been untabified.

The code change will be a lot simpler without the color code as that is moving to nsJPEGDecoder.cpp. This change is just going to be the static tables and intrinsics infrastructure. Unfortunately I will have to replicate that in nsJPEGDecoder (or somewhere else if that's more appropriate).

I'll look into the CPUID intrinsic. It would certainly make the code a lot neater. As long as there are no objections to not running on compilers below
VS2005.

(Assignee)

Comment 45

10 years ago
Created attachment 298016 [details] [diff] [review]
SSE2 Intrinsics Support and Static Tables

I removed the space changes (converting tabs to spaces) and left the original spacing as intact as I could and what's left is infrastructure for intrinsics and using static tables instead of dynamic tables. I can redo the spacing stuff if it's required to get this in.
Attachment #296901 - Attachment is obsolete: true
Attachment #296901 - Flags: superreview?(pavlov)
(Assignee)

Comment 46

10 years ago
Created attachment 298019 [details] [diff] [review]
nsJPEGDecoder.cpp SSE2 jdcolor (cairo) optimization

This patch includes Alfred's code to do certain color/cairo conversion in nsJPEGDecoder.cpp.
Attachment #297746 - Attachment is obsolete: true
(Assignee)

Comment 47

10 years ago
Does Alfred need to review this again or does it go to someone else now?

Updated

10 years ago
Attachment #298019 - Flags: review?(pavlov)

Comment 48

10 years ago
(In reply to comment #47)
> Does Alfred need to review this again or does it go to someone else now?
> 

Let's have pav take the final review..

Comment 49

10 years ago
Comment on attachment 298019 [details] [diff] [review]
nsJPEGDecoder.cpp SSE2 jdcolor (cairo) optimization

rather than ycc_rgb_convert_cairo and mUseCairo, use argb/ARGB instead of 'cairo'?

this patch looks fine, but it is hard to tell what is in this patch vs what is in alfred's patch in another bug.  Would be great to split this patch out.
(Assignee)

Comment 50

10 years ago
Sorry, the patch to review was "SSE2 Intrinsics Support and Static Tables" which is the second to last patch. The last one has the changes that include Alfred's patches which I was just saving here until Alfred's two patches show up. The "SSE2 Intrinsics Support and Static Tables" patch is orthogonal to Alfred's two changes and the changes in the last patch

Updated

10 years ago
Attachment #298019 - Flags: review?(pavlov)

Comment 51

10 years ago
Comment on attachment 298016 [details] [diff] [review]
SSE2 Intrinsics Support and Static Tables

Adjusting review flags
Attachment #298016 - Flags: review?(pavlov)

Comment 52

10 years ago
Comment on attachment 298016 [details] [diff] [review]
SSE2 Intrinsics Support and Static Tables

can you add some comments around the if 0s or perhaps remove them entirely?  aside from that this looks fine.
Attachment #298016 - Flags: review?(pavlov) → review+
Is the output from this code expected to be 100% pixel identical to the existing code? Or might there be +/- 1 rounding differences? I have a set of JPEGs to use for tests (bug 411626), but they'll be compared against reference PNGs.
(Assignee)

Comment 54

10 years ago
(In reply to comment #53)
> Is the output from this code expected to be 100% pixel identical to the
> existing code? Or might there be +/- 1 rounding differences? I have a set of
> JPEGs to use for tests (bug 411626), but they'll be compared against reference
> PNGs.

This code is expected to be 100% pixel identical to the existing code but this code doesn't do that much. It just takes the dynamically generated tables (which are the same for every image) and puts them into a static array. The size and type of the static array is the same as that of the dynamically generated array as the static array was generated from the output of the dynamically generated array. The rest of the changes just adds intrinsics infrastructure.

The color changes to go in later may have +/- 1 rounding differences as they use 16-bit intermediate calculations instead of 32-bit intermediate calculations for addition and subtraction. Multiplication uses 32-bit intermediate calculations but results are stored in 16-bits. This is how you get the parallelism of doing eight operations at the same time. This kind of processing is already in the Windows code for systems supporting SSE2 in the form of the current Inverse Discrete Cosine Transform code.
(Assignee)

Comment 55

10 years ago
(In reply to comment #52)
> (From update of attachment 298016 [details] [diff] [review])
> can you add some comments around the if 0s or perhaps remove them entirely? 
> aside from that this looks fine.

Will do.

(Assignee)

Comment 56

10 years ago
Created attachment 298856 [details] [diff] [review]
Comment added for #if 0

There are three #if 0 statements that I added and one didn't have a comment so I added a comment there. On one of the others, I moved the comment to before the #if 0.
Attachment #298016 - Attachment is obsolete: true
Keywords: checkin-needed

Updated

10 years ago
Attachment #298856 - Flags: superreview+
Attachment #298856 - Flags: review+

Comment 57

10 years ago
Some more minor comments:

1. in jdcolor.c:
  /* These fields are needed anymore as these are now static tables */
--> not needed anymore

2. The comment on line 241 is no longer valid:
    241  * We assume build_ycc_rgb_table has been called.

3. In ycck_cmyk_convert:
convert: 
    outptr[0] = range_limit[MAXJSAMPLE - (y + Cr_r_tab[cr])]
    etc...
also to:
    range_limit_y = range_limit + MAXJSAMPLE - y;
    outptr[0] = range_limit_y + Cr_r_tab[cr];
    etc...
    	
4. In prepare_range_limit_table (j_decompress_ptr cinfo)
  cinfo->sample_range_limit = (JSAMPLE *) static_range_table + (MAXJSAMPLE+1);;
Remove the second ;
(Assignee)

Comment 58

10 years ago
Created attachment 298929 [details] [diff] [review]
Fixes to Alfred's latest comments (4)

The four items have been addressed. I wasn't able to find an image to test the ycck_cmyk_convert change as it seems to be pretty rare on the web.
Attachment #298856 - Attachment is obsolete: true

Comment 59

10 years ago
Comment on attachment 298929 [details] [diff] [review]
Fixes to Alfred's latest comments (4)

Final Review..
Attachment #298929 - Flags: review?(pavlov)

Updated

10 years ago
Attachment #298929 - Flags: review?(pavlov) → review+

Updated

10 years ago
Flags: in-testsuite?
Leaving open for the second part of the fix. Please address all nits and request review.

Checking in jpeg/jconfig.h;
/cvsroot/mozilla/jpeg/jconfig.h,v  <--  jconfig.h
new revision: 3.12; previous revision: 3.11
done
Checking in jpeg/jdapimin.c;
/cvsroot/mozilla/jpeg/jdapimin.c,v  <--  jdapimin.c
new revision: 3.8; previous revision: 3.7
done
Checking in jpeg/jdcolor.c;
/cvsroot/mozilla/jpeg/jdcolor.c,v  <--  jdcolor.c
new revision: 3.3; previous revision: 3.2
done
Checking in jpeg/jddctmgr.c;
/cvsroot/mozilla/jpeg/jddctmgr.c,v  <--  jddctmgr.c
new revision: 3.4; previous revision: 3.3
done
Checking in jpeg/jdmaster.c;
/cvsroot/mozilla/jpeg/jdmaster.c,v  <--  jdmaster.c
new revision: 3.3; previous revision: 3.2
done
Checking in jpeg/jmorecfg.h;
/cvsroot/mozilla/jpeg/jmorecfg.h,v  <--  jmorecfg.h
new revision: 3.20; previous revision: 3.19
done
Status: NEW → ASSIGNED
Keywords: checkin-needed
Target Milestone: --- → mozilla1.9 M11

Comment 61

10 years ago
(In reply to comment #60)
> Leaving open for the second part of the fix. Please address all nits and
> request review.
> 

What second part?
(Assignee)

Comment 62

10 years ago
The original patch did SSE2 color processing, added general SSE2 intrinsics support and put in static tables instead of dynamic tables. Alfred has a change pending that moves the color processing and byteswap into nsJPEGDecoder.cpp so I changed this patch to just do SSE2 intrinsics support and static tables - I took out the color SSE2 processing. After Alfred's change goes through (improve performance by 10%), I will add the SSE2 color processing to Alfred's combined code in nsJPEGDecoder.cpp.

I was planning to do that in another bug but can do it here.

Comment 63

10 years ago
Is there a bug for the Mac jpeg performance regression?

Comment 64

10 years ago
(In reply to comment #63)
> Is there a bug for the Mac jpeg performance regression?
> 

bug 412396
(In reply to comment #62)
> I was planning to do that in another bug but can do it here.

Actually, another bug is fine. I'll resolve this bug then, and you can open a new bug.
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED

Updated

10 years ago
Blocks: 414072
Depends on: 414042

Comment 66

9 years ago
the recent addition of this line:
jdapimin.c:
  #include "intrin.h"   

Is *busting* my build. My system does not have this include file ...

I had to add an: 

  #undef HAVE_MMX_INTEL_MNEMONICS 

 to work around ... 

Please adviese ...

Comment 67

9 years ago
I think we have to move the SSE2 detection code to somewhere else, as the mozilla/jpeg code is not used when compiling with system-jpeg.
A good place would be to put the SSE2 detection is in gfxPlatform.h/cpp.
All image/rendering code include gfxPlatform.
See also bug 414947.
(Assignee)

Comment 68

9 years ago
Should we have it in the JPEG library as well as gfxPlatform.h? It would be nice to call it from browser initialization so that it is called before anything else uses it and so that we don't have to test whether the init was called each time.

Comment 69

9 years ago
For code inside libjpeg that does SSE2, the check is needed inside libjpeg.
For code inside image decoders and libraries we need to have this check in gfxPlatform.

See also bug 415054.

Note, Michael, it looks as if the code in the block copy/convert in bug 414947 is something that could be SSE2 enabled (as a followup bug to that one).
(Assignee)

Comment 70

9 years ago
Is there initialization code for GFX that gets called on browser startup where we can put the CPUID check? I would rather do it once instead of checking to see that we sniffed cpu capabilities for every use of SSE2 in GFX.
gfxPlatform

Comment 72

9 years ago
(In reply to comment #15)
> Created an attachment (id=296230) [details]
> Image benchmark framework

Appropo of nothing:

I populated this HTML file with 2600 JPEG images, pulled from my web proxy logs after 10 days of casual browsing.  After several runs on SeaMonkey v1.1.8/Linux, this is what the high-level system profiling looks like:

samples  %        app name                 symbol name
62231    24.1899  libnecko.so              nsMemoryCacheDevice::EvictEntriesIfNecessary()
23650     9.1930  libjpeg.so.62.1.0        decode_mcu
16191     6.2936  libgdk-x11-2.0.so.0.1000.14 (no symbols)
15001     5.8311  libfb.so                 (no symbols)
9709      3.7740  libc-2.6.so              memcpy
9013      3.5035  libjpeg.so.62.1.0        ..@16.bs
7583      2.9476  libjpeg.so.62.1.0        jpeg_idct_islow_sse2.column_end
6376      2.4784  libjpeg.so.62.1.0        jpeg_idct_islow_sse2.columnDCT
6301      2.4493  Xorg                     (no symbols)
3614      1.4048  nvidia_drv.so            (no symbols)
3374      1.3115  libgklayout.so           nsLineBox::IndexOf(nsIFrame*) const
3147      1.2233  libqt-mt.so.3.3.8        (no symbols)
2823      1.0973  nvidia                   (no symbols)
2795      1.0864  libc-2.6.so              _int_malloc

I just mention this because I'm intrigued by the amount of time spent in disk cache maintainance, nearly as much time as is spent overall in the JPEG library.  Hmmm...

Comment 73

9 years ago
(In reply to comment #72)
> (In reply to comment #15)
> > Created an attachment (id=296230) [details] [details]
> > Image benchmark framework
> 
> Appropo of nothing:
> 
> I populated this HTML file with 2600 JPEG images, pulled from my web proxy logs
> after 10 days of casual browsing.  After several runs on SeaMonkey
> v1.1.8/Linux, this is what the high-level system profiling looks like:
> 
> samples  %        app name                 symbol name
> 62231    24.1899  libnecko.so             
> nsMemoryCacheDevice::EvictEntriesIfNecessary()

> I just mention this because I'm intrigued by the amount of time spent in disk
> cache maintainance, nearly as much time as is spent overall in the JPEG
> library.  Hmmm...
> 

Filed bug 417154 to capture this.
Depends on: 520672
You need to log in before you can comment on or make changes to this bug.