Closed Bug 573948 Opened 14 years ago Closed 13 years ago

Replace libjpeg with libjpeg-turbo

Categories

(Core :: Graphics: ImageLib, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla5
Tracking Status
firefox5 + fixed
blocking2.0 --- -
status2.0 --- ?

People

(Reporter: atzaus, Assigned: justin.lebar+bug)

References

(Blocks 3 open bugs, )

Details

Attachments

(6 files, 12 obsolete files)

1.67 MB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
3.53 KB, patch
Details | Diff | Splinter Review
869.19 KB, text/plain
Details
195.00 KB, text/plain
Details
2.57 KB, patch
Details | Diff | Splinter Review
2.80 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4
Build Identifier: 

To quote the URL
"libjpeg-turbo is a high-speed version of libjpeg  for x86 and x86-64 processors which uses SIMD instructions (MMX, SSE2, etc.) to accelerate baseline JPEG compression and decompression. libjpeg-turbo is generally 2-4x as fast as the unmodified version of libjpeg, all else being equal.

libjpeg-turbo was originally based on libjpeg/SIMD by Miyasaka Masaru, but the TigerVNC and VirtualGL projects made numerous enhancements to the codec, including improved support for Mac OS X, 64-bit support, support for 32-bit and big endian pixel formats (RGBA, ABGR, etc.), accelerated Huffman encoding/decoding, and various bug fixes. The goal was to produce a fully open source codec that could replace the partially closed source TurboJPEG/IPP codec used by VirtualGL and TurboVNC. libjpeg-turbo generally achieves 80-120% of the performance of TurboJPEG/IPP. It is faster in some areas but slower in others."

Fedora already working on incorporating it in their distribution.
http://fedoraproject.org/wiki/Features/libjpeg-turbo

License is wxWindows license for some parts and LGPL for the rest.

This would close a lot of bugs regarding libjpeg/SSE Optimization/64-bit Porting

Reproducible: Always
Blocks: 477728, 475225
You'll probably need to consult with licensing@mozilla.org about the license. See guidelines here: http://www.mozilla.org/MPL/license-policy.html

When looking at third party libraries for the video decoding and YCbCr conversion LGPL was an issue.
Using libjpeg-simd is probably an easier option here than libjpeg-turbo license wise.
According to
http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/trunk/README-turbo.txt

"Some of the optimizations to the Huffman encoder/decoder were borrowed from
VirtualGL, and thus the libjpeg-turbo distribution, as a whole, falls under the
wxWindows Library Licence, Version 3.1.  A copy of this license can be found in
this directory under LICENSE.txt.  The wxWindows Library License is based on
the LGPL but includes provisions which allow the Library to be statically
linked into proprietary libraries and applications without requiring the
resulting binaries to be distributed under the terms of the LGPL.

The rest of the source code, apart from these modifications, falls under a less
restrictive, BSD-style license (see README.)"

So it seems that wxWindows license is a more liberal version of LGPL and the rest is BSD (I was mistaken in the initial posting)
(In reply to comment #2)
> Using libjpeg-simd is probably an easier option here than libjpeg-turbo license
> wise.

Sorry for the bug spam

Regarding libjpeg-simd

From the Google translation of http://cetus.sakura.ne.jp/softlab/jpeg-x86simd/jpegsimd.html

"In addition, AMD64 (EM64T/Intel64) in 64bit mode environment (currently) not supported."

"AMD64 compatible version is under construction...(October 26, 2006: The production process is currently stopped. Seems slow release schedule. Sorry.)"

Jpeg-simd would have to be ported to 64-bit which from what I have read is a cumbersome task.
(In reply to comment #4)
> Jpeg-simd would have to be ported to 64-bit which from what I have read is a
> cumbersome task.

From the diffs between the 64-bit version and the 32-bit version it seems that the process would be mostly mechanical. In fact, it should be possible to use the same source for 64-bit as 32-bit like libvpx does. Further, the 64-bit versions in libjpeg-turbo should be under the same license as jpeg-simd so we should be able to just use them directly.
blocking2.0: --- → ?
Hi, all.  I'm the maintainer and one of the creators of libjpeg-turbo, so I thought I'd throw in some comments.  For starters, libjpeg-turbo is not licensed under pure LGPL.  It uses the wxWindows Library License, which eliminates some of the viral provisions of the LGPL.  My understanding of the differences between the two is spelled out here:

http://www.virtualgl.org/About/License

I have confirmed with the wxWindows developers who created the license that this matches their understanding as well.  Essentially, it is our understanding that wxWindows nullifies Sections 6-7 of LGPL v2.1, which means you can statically link the Library with any application without requiring the application as a whole to fall under the provisions of the LGPL.  You are still bound by other sections of the LGPL, which means you would still be required to distribute source for the Library along with the application.

That being said, however, I can understand the desire for a fully unencumbered version of the source.  The only part of the code which is licensed under wxWindows is the Huffman encoding/decoding optimizations (portions of jchuff.c and jdhuff.c.)  These were developed while I was a Sun employee, so unfortunately I don't have the right to dual license them, or else I certainly would do so.

What I could do, however, is develop a mechanism whereby you could easily build the code without the Huffman optimizations ('configure --without-huffman-opt' or something like that.)  This would simply involve building libjpeg-turbo with the "stock" versions of jchuff.c and jdhuff.c from libjpeg-6b.  I could even go one step farther and make it so that the wxWindows-licensed files are not included in the distribution tarball if you 'configure --without-huffman-opt' and then 'make dist'.

This would slow down the overall performance by 20-25%, but it will still be miles faster than regular libjpeg on x86/x86-64 systems.
What DRC suggests seems like the smart thing to do, to begin with. Let's take the parts which are unambiguously BSD and still provide a significant win. 

We currently have a policy of only taking code which is compatible with all three of our licenses. We'd have to do some analysis on the wxWindows licence to see whether it fits our policy. (Please, people, stop rolling your own licences! :-) If someone could open another bug for that, that would be great.

But if it's not compatible in that way, I think it's highly unlikely we'd want to introduce a new licence into the Mozilla mix just for a 25% win in JPEG decoding (particularly if we'd already recently had a big win over what we'd been using for years).

Gerv
wxWindows is actually a very mature license at this point-- it just isn't well known.  It was created to solve a specific problem -- allowing static linking of an open source library with proprietary code without forcing the proprietary application to fall under the terms of the LGPL, but also guaranteeing the open nature of the code (which BSD-style licenses do not do.)  OpenSceneGraph uses it, and that's where I learned about it originally.

Let me know if there's anything I can do to facilitate this on my end.  If you're just going to merge the code into your build tree, then you can just as easily replace jchuff.c and jdhuff.c with their original versions when you do that, but I'm still willing to add a configure switch to libjpeg-turbo as well.
(In reply to comment #8)
> Let me know if there's anything I can do to facilitate this on my end.  If
> you're just going to merge the code into your build tree, then you can just as
> easily replace jchuff.c and jdhuff.c with their original versions when you do
> that, but I'm still willing to add a configure switch to libjpeg-turbo as well.

This seems like it might be best option.
Has any work been done on ARM or other platforms?
(In reply to comment #10)
> Has any work been done on ARM or other platforms?

One thing I've been meaning to do is switch mobile to use the IFAST dct instead of the ISLOW dct. Android uses IFAST and in fact have an ARM implementation of the IFAST dct. Unfortunately, it is licensed with the apache license. Perhaps we can convince Google to relicense it?
(In reply to comment #10)
> Has any work been done on ARM or other platforms?

We should write NEON SIMD code if we change to libjpeg-turbo.


(In reply to comment #11)
> (In reply to comment #10)
> > Has any work been done on ARM or other platforms?
> 
> One thing I've been meaning to do is switch mobile to use the IFAST dct instead
> of the ISLOW dct. Android uses IFAST and in fact have an ARM implementation of
> the IFAST dct. Unfortunately, it is licensed with the apache license. Perhaps
> we can convince Google to relicense it?

Past year, Michael Moy tried to change to IFAST on some platforms by bug 416157.   But many test cases became failed.  If changing to IFAST, we have to modify many test cases, too.
DRC: the problem is not just incompatibility, but simplicity. A library under the straight LGPL would be compatible with the MPL, the LGPL and the GPL (our three licences) but would increase the licensing complexity of our codebase, something we have historically tried to avoid.

Gerv
So, just so I understand, inbound code to the Mozilla project has to either be licensed under the tri-license or something less restrictive that can be re-licensed under the tri-license?  So if it were possible to re-license the offending code in libjpeg-turbo under your tri-license, then that would solve the problem?
Our goal is that people should be able to redistribute all the code that goes into our products under the terms of any one of the MPL, LGPL, or GPL. So:
-- code that is licensed with a license compatible with all of those (e.g., BSD) is fine
-- code that is licensed with multiple license options that cover the three above licenses is also fine (e.g., dual licensed MPL/LGPL, since the LGPL option is compatible with the GPL as well)

Note that we don't relicense other people's code. E.g., when we import BSD-licensed code into our tree, it keeps the original license.
Unless I'm missing something fundamental, I think wxWindows-licensed code could be legally redistributed under all three licenses.  However, I fully understand the desire not to introduce new licenses into your code tree.

One question-- do you currently copy the libjpeg code into the Mozilla source or do you build it separately?
(In reply to comment #16)
> One question-- do you currently copy the libjpeg code into the Mozilla source
> or do you build it separately?

It's in our tree, but it's not by any means the original libjpeg, as it's been patched to pieces.

http://mxr.mozilla.org/mozilla-central/source/jpeg/
OK, then it seems as if merging in the "turbo" features (minus the Huffman coding) into your existing code tree would be the way to go.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #13)
> DRC: the problem is not just incompatibility, but simplicity. A library under
> the straight LGPL would be compatible with the MPL, the LGPL and the GPL (our
> three licences) but would increase the licensing complexity of our codebase,
> something we have historically tried to avoid.

Original license of SIMD code is http://cetus.sakura.ne.jp/softlab/jpeg-x86simd/jpegsimd.html#conditions

In Japanese, "この SIMD 拡張版 IJG JPEG software の使用条件については、オリジナル版の IJG JPEG software の使用条件が適用されます。"

In English by Google Translate, "This SIMD extension for IJG JPEG software terms of use, the original version of the IJG JPEG software is subject to terms of use."

So, original author who is MIYASAKA-san says that this code is same license as libjpeg.  I believe that license is no problem if we port MIYASAKA-san's code to our tree, not using libjpeg-turbo.

Gerv, how about?
Why does the idea of using the old libjpeg/SIMD code keep coming up in this discussion?  This has already been covered multiple times-- libjpeg/SIMD does not have 64-bit support or many of the other bug fixes we made to libjpeg-turbo.

libjpeg-turbo also bears the same license as libjpeg, if you don't use our accelerated version of jchuff.c and jdhuff.c.

In short-- merge in libjpeg-turbo, minus jchuff.c and jdhuff.c, and there is no licensing problem.
There seems to be some confusion around what should happen here... the right thing, I think, is:

1. Look at our in-tree libjpeg and how it differs from stock libjpeg.

2. Based on that, either merge the delta from libjpeg -> libjpeg-turbo into our tree, OR merge delta of stock libjpeg and in-tree libjpeg into libjpeg-turbo, and then pull that in.

(minus j[cd]huff.c, of course).

I don't really know how extensive the changes in #1 are, and I don't know what, if any, conflicts there would be between the changes for -turbo and our tree version of libjpeg.  But I think it's worth it to find out; the perf wins wouldn't hurt as we try to do decode on draw, and it sounds like there are some pretty significant ones here.
Forgot to ask -- who wants to make a patch? :-)
After a cursory look at the code, it seems like libjpeg-turbo may want to borrow Mozilla's enhancements, where applicable.  In particular, the use of static tables in choice areas like RGB->YUV conversion is likely to improve our performance as well.

Also, there is some of the code which is going to conflict,  because your version already has some SIMD acceleration.  I think that most of it could be replaced with ours, but the areas in which you use SSE2 would probably need to be benchmarked against our version to see which is faster (integer IDCT, for instance.)
(In reply to comment #23)
> After a cursory look at the code, it seems like libjpeg-turbo may want to
> borrow Mozilla's enhancements, where applicable.  In particular, the use of
> static tables in choice areas like RGB->YUV conversion is likely to improve our
> performance as well.
> 
> Also, there is some of the code which is going to conflict,  because your
> version already has some SIMD acceleration.  I think that most of it could be
> replaced with ours, but the areas in which you use SSE2 would probably need to
> be benchmarked against our version to see which is faster (integer IDCT, for
> instance.)

Our integer SSE2 idct is buggy (bug 477728), so while it would be nice to see how the performance compares, I think I would still switch to the libjpeg-turbo one even if ours is faster.
Blocks: jpeg-perf
(In reply to comment #22)
> Forgot to ask -- who wants to make a patch? :-)

I'll try.
Status: NEW → ASSIGNED
(In reply to comment #6)
> What I could do, however, is develop a mechanism whereby you could easily build
> the code without the Huffman optimizations ('configure --without-huffman-opt'
> or something like that.)

My guess is that we'll not want to include j{c,d}huff.c in our source tree at all, if they're encumbered with an incompatible license.  Is this right?
jlebar: that seems wisest to me.

Gerv
blocking2.0: ? → -
Assignee: nobody → justin.lebar+bug
To the build team: Are we willing to add a dependency on nasm so we can get this library?  If I understand correctly, we'd need nasm on both Windows and *nix.
Can you get it building with YASM? YASM is already a build requirement everywhere except Win32. It claims to support nasm-compatible syntax.
I think yasm will work.
I have something which is building and running (with yasm) on x64.  I'd love to run some performance tests so we can see whether this is worth investing in further.

I figure I should try and run tprender tests, as in

  dist/bin/firefox -tprender -tp manifest -tpformat text

but this runs the test and produces no output.

Can anyone suggest a way to make this work or an alternative benchmarking methodology?
Attached file JPEG loading test (obsolete) —
I used this to test some of mmoy's JPEG patches back in the day. It's got a bit of scatter, but I found that it was reasonably consistent if I did 10-20 runs (unfortunately, manually on each one). The images you see referenced in the file were just a bunch of ~3MB pictures from a digital camera.
(In reply to comment #31)
> I have something which is building and running (with yasm) on x64.  I'd love to
> run some performance tests so we can see whether this is worth investing in
> further.
> 
> I figure I should try and run tprender tests, as in
> 
>   dist/bin/firefox -tprender -tp manifest -tpformat text

I don't know what tprender does so I'm not sure it's useful. If you post a patch I can get some tp4 (page load benchmark) numbers to compare against our current baseline.
Trender isn't going to help because that's just repaint speed -- so will be after jpeg decode has already been done.  I think you'll have a hard time coming up with an in-browser benchmark, though decoding a number of (very) large images is probably the best you can do..
Depends on: 580518
Okay.  I filed adding timers for the image decoder as bug 580518.
Some preliminary performance results, on Linux x64.  We're comparing unvectorized libjpeg to libjpeg-turbo's vectorized 64-bit code.  In brief, it appears that the part of decoding that was vectorized runs 2.3x as fast as the unvectorized part.

These benchmarks were run on a quad-core Xeon W3520 (2.67GHz) running Ubuntu 10.04, kernel 2.6.32-23-generic.

Methodology:

$ perf record -ig dist/bin/firefox -tp manifest
    (manifest contains the path to a local copy of [1], which weighs in at 3,888 × 2,592 pixels, 2.95 MB)
$ perf report -g graph

If you're going to try this yourself, note that you need to save the result of perf report before rebuilding.  Your generated perf.data will be invalid (and unreadable) as soon as you modify the binaries.

perf only tells us how long we spend in a function relative to all the other functions in the program.  To work around this, I added a spin loop to the startup sequence.  The loop always does the same amount of work, so we can compare the fraction of our time spent spinning to the fraction of time spent in the JPEG decoder to determine the speedup.

Original:

    11.93%  jpeg_idct_islow
     8.94%  SpinBriefly()
     8.12%  decode_mcu
     7.59%  qcms_transform_data_rgb_out_lut_sse2
     5.71%  ycc_rgb_convert
     2.37%  0x000000001d2170 (ld.so)
     2.31%  jpeg_fill_bit_buffer
     1.97%  __pthread_mutex_unlock_internal

libjpeg-turbo:

    10.84%  SpinBriefly()
     9.49%  qcms_transform_data_rgb_out_lut_sse2
     7.04%  decode_mcu
     6.29%  0x00000000f90c10 (libxul.so)
     2.41%  0x000000001c8e80 (ld.so)
     2.36%  nsJPEGDecoder::OutputScanlines(int*)
     2.14%  __pthread_mutex_unlock_internal
     1.97%  pthread_mutex_lock

I think the unidentified libxul function in the libjpeg-turbo profile is libjpeg-turbo's vectorized jpeg_idct_islow routine.  If so, we can conclude that the speedup for this function is (11.93/8.94) / (6.29/10.84) = 2.30.  The speedup for the JPEG decoder as a whole is of course smaller.

[1] http://commons.wikimedia.org/wiki/File:Schloss_Lichtenstein_04-2010.jpg
Justin,

The comparison you made would only be valid if the two tests were equal in terms of total execution time, but my read of the above is that the amount of work done was the same but the execution time for both tests was different.  Is that correct?
The assumption is that SpinBriefly() takes the same amount of time in both tests.  I don't actually know that -- all I know is that it does the same work.  It's the same code in both cases.  But with the assumption, I think we can compare execution times by normalizing the reported percentages against the percentage reported for SpinBriefly.

To illustrate, say we knew that SpinBriefly() takes 1s.  Then in the original case, we could say that 8.94% = 1s, so 11.93% = (11.93/8.94)s = 1.3s.  We'd similarly compute that the 6.29% reported for libjpeg-turbo counts for (6.29/10.84)s = .58s.  And then we'd conclude that our speedup is 1.3s / .58s = 2.3 as before.  But of course we don't need to know how long SpinBriefly is in order to do this computation.
Fair enough.  My other question is why the slow IDCT is the only accelerated function.  We've established that the accelerated Huffman decode can't be used, but aren't there other functions that can be accelerated using libjpeg-turbo?
(In reply to comment #39)
> My other question is why the slow IDCT is the only accelerated function.

Good question.  Here's where the other functions in the profile live:

qcms_transform_data_rgb_out_lut_sse2 : gfx/qcms/qcmsint.c
decode_mcu : jpeg/jdhuff.c
nsJPEGDecoder::OutputScanlines(int*) : libpr0n/decoders/jpeg/nsJPEGDecoder.cpp

So only the slow IDCT and decode_mcu are functions in libjpeg.  I was actually testing with libjpeg-turbo's jdhuff implementation here, and you can see that decode_mcu is substantially faster with libjpeg-turbo: (8.12/8.94) / (7.04/10.84) = 1.40x speedup.
On 32-bit, yasm .8 (which is what Debian is shipping) hits an infinite loop.  yasm 1.0 seems to work.

I imagine we wouldn't be willing to require yasm 1.0.  But maybe we could disable JPEG SIMD if the machine's yasm isn't new enough.
Pretty sure we can require yasm 1.0, especially if there are easily installable packages for debian/ubuntu/fedora around (even if they're not stock provided).
Debian unstable is still on yasm 0.8, fwiw.  1.0 was released April 8.
(In reply to comment #41)
> On 32-bit, yasm .8 (which is what Debian is shipping) hits an infinite loop. 
> yasm 1.0 seems to work.

Can we work around the infinite loop?
(In reply to comment #45)
> Can we work around the infinite loop?

I think we can.  jccol{mmx,ss2}.asm includes jcclr{mmx,ss2}.asm multiple times, and I think this is what's causing the hang.  We should be able to split up jccol into separate files, one for each include of jcclr.

Do the libjpeg-turbo guys have any thoughts on whether or not this would be a good idea?  I'd prefer to diverge from upstream as little as possible, of course.
I don't have a problem with making those changes to the libjpeg-turbo source, but prototype it first on your end and make sure it fixes the problem.
I successfully got jccol{mmx,ss2} to assemble, but now yasm is hanging in jcsammmx.

I think yasm 0.8 may be a game of whack-a-mole we don't want to play.
(I should add: The reason for the hang in jcsammmx is less obvious than the other files.  I tried collapsing the includes, which aren't as tricky as the jccol includes, to no avail.)
(In reply to comment #48)
> I think yasm 0.8 may be a game of whack-a-mole we don't want to play.

Yup.  Add a dependency to yasm 1.0.  Worst case we build a static binary of it and include it as part of our build setup for linux.
Macports is shipping yasm 1.0.
Just got this working on OSX 64.  Seeing a speedup similar to what was observed on Linux 64:

Original:

  3.8% SpinBriefly()
  7.8% jpeg_idct_islow()

libjpeg-turbo:

  5.4% SpinBriefly()
  4.2% decoding

Speedup: (7.8/3.8) / (4.2/5.4) = 2.6.
yasm 1.0 on OSX 10.6 64-bit seems to be emitting incorrect 32-bit code.  At least, it works when I use nasm, but the yasm build decodes all jpegs as garbage.

The nasm flags I've used are:

  -fmacho -DMACHO -DPIC

and the yasm flags I've tried are:

  -fmacho -DMACHO -DPIC -rnasm -pnasm

and

  -fmacho32 -DMACHO -DPIC -rnasm -pnasm

I'm looking into this, but I have basically no idea why this is happening.  Any ideas would be appreciated...
(In reply to comment #53)
> I'm looking into this, but I have basically no idea why this is happening.  Any
> ideas would be appreciated...

Do you know what the difference is in the generated code?
(In reply to comment #54)
> Do you know what the difference is in the generated code?

Well...here's *a* difference I've been looking at, in _jsimd_rgb_ycc_convert_sse2 (jccolss2.asm):

nasm:
  leal 0x00001bf3(%ecx,%ebp,8),%ebx

yasm:
  leal 0xffffffd3(%ecx,%ebp,8),%ebx

Here's more context:

..@12.adjust:
  pushl   %ebp
  xorl    %ebp,%ebp
  leal    0x00001bf3(%ecx,%ebp,8),%ebx    <---
..@12.ref:
  popl    %ebp
  movl    %ebx,0xffffff7c(%ebp)
  movl    0x08(%eax),%ecx
  testl   %ecx,%ecx
  je      0x00000412

Just the leal line is different between the two assemblers.

Here's the preprocessed source which is generating this.  It's the same for both yasm and nasm.

..@26.adjust:
  push ebp
  xor ebp,ebp
  db 0x8D,0x9C         <----
  jmp near const_base  <----
..@26.ref:
  pop ebp
  mov dword [ebp-(8-(0))*16-4],ebx
  mov ecx, dword[(eax)+8]
  test ecx,ecx
  jz near .return

Before we try and understand how a db plus a jmp combined to form a leal, let's look at the unprocessed source, from jcclrss2.asm:

  get_GOT ebx                     ; get GOT address
  movpic  POINTER [gotptr], ebx   ; save GOT address
  mov     ecx, JDIMENSION [img_width(eax)]
  test    ecx,ecx
  jz      near .return

get_GOT is a macro defined in jsimdext.inc.  Here's the whole thing:

  ; At present, nasm doesn't seem to support PIC generation for Mach-O. 
  ; The PIC support code below is a little tricky. 

          SECTION SEG_CONST 
  const_base: 

  %define GOTOFF(got,sym) (got) + (sym) - const_base 

  %imacro get_GOT 1 
          ; NOTE: this macro destroys ecx resister. 
          call    %%geteip 
          add     ecx, byte (%%ref - $) 
          jmp     short %%adjust 
  %%geteip: 
          mov     ecx, POINTER [esp] 
          ret 
  %%adjust: 
          push    ebp 
          xor     ebp,ebp         ; ebp = 0 
  %ifidni %1,ebx  ; (%1 == ebx) 
          ; db 0x8D,0x9C + jmp near const_base = 
          ;   lea ebx, [ecx+ebp*8+(const_base-%%ref)] ; 8D,9C,E9,(offset32) 
          db      0x8D,0x9C               ; 8D,9C 
          jmp     near const_base         ; E9,(const_base-%%ref) 
  %%ref: 
  %else  ; (%1 != ebx) 
          ; db 0x8D,0x8C + jmp near const_base = 
          ;   lea ecx, [ecx+ebp*8+(const_base-%%ref)] ; 8D,8C,E9,(offset32) 
          db      0x8D,0x8C               ; 8D,8C                       <----
          jmp     near const_base         ; E9,(const_base-%%ref)       <----
  %%ref:  mov     %1, ecx 
  %endif ; (%1 == ebx) 
          pop     ebp 
  %endmacro 

It appears that we're taking the %1 == ebx branch in the macro above (compare the arguments to db).

The apparent key point here is that the db / jmp lines marked with arrows become a leal instruction.  Why?  I dunno.  But perhaps this is what yasm is interpreting differently than nasm.  I'll keep looking after lunch.
I put an e-mail into the yasm-devel list about this.
(In reply to comment #29)
> Can you get it building with YASM? YASM is already a build requirement
> everywhere except Win32. It claims to support nasm-compatible syntax.

YASM won't work on Win32. YASM doesn't output object files which can link with /SAFESEH, so we can't link YASM object files into xul.dll. For libvpx we worked around this by converting the YASM assembly to MASM syntax using http://mxr.mozilla.org/mozilla-central/source/media/libvpx/yasm2masm-as.sh 

Ideally we'd convince the YASM developers to fix their win32 safeseh section generation, or submit a patch which fixes that ourselves upstream.
Given that the code barely works under yasm, I'm not hopeful that we'll be able to easily translate to masm, unless the syntax for macros and whatnot is very close.  But it's definitely worth looking at.

For now, I'm just going to use nasm on OSX-32.  I think a new-enough version of nasm ships on OSX by default.  (My understanding is that the version of nasm that ships isn't new enough to build libjpeg-turbo on 64-bit, but there, yasm works fine.)  It would be really nice to get yasm to work, though.
ted, bsmedberg, cpearce: Do you have any advice as to how to proceed here wrt Windows builds?  It looks like the options are:

* Fix yasm so it works with /SAFESEH or otherwise work around this issue
* Ship nasm in Windows build package (assuming that some version of nasm works on Windows)
* Translate the libjpeg-turbo code to a msam-compatible syntax
* Ship libjpeg-turbo on Linux/Mac and use our current libjpeg implementation on Windows
I'm actually leaning towards the first option since we're going to be distributing yasm ourselves anyway...
Fixing yasm certainly sounds to me like the Right Thing to do.  It looks like there's some support for SAFESEH [1] [2], but perhaps it's insufficient [3].

I don't have a good grasp of what the issue is even after doing some reading, so I'm not sure I'm the right person to try and hack the assembler.  But if nobody has time to hack it themselves, I can try, provided someone can sit down with me and explain what's going on.

[1] http://www.tortall.net/projects/yasm/ticket/130
[2] http://www.tortall.net/projects/yasm/changeset/2032
[3] http://www.tortall.net/projects/yasm/ticket/139
cpearce probably has the best idea at the moment
In response to the second option (shipping NASM on Windows), NASM does work fine to build libjpeg-turbo on Windows, but I don't know whether it also suffers from the /SAFESEH problem.  As with other platforms, you'll need NASM 2.05 or later to build the 64-bit version.

In response to the third option, you'd be responsible for maintaining any such translation, since it is of no use to the upstream project.
(In reply to comment #59)
> * Fix yasm so it works with /SAFESEH or otherwise work around this issue

Ideally we should fix YASM, as we need a working YASM for libvpx as well.

YASM has partial support for SafeSeh, but it doesn't go far enough. YASM ticket 139 seems to diagnose the problem. I don't know much about object formats or assemblers, so I'd need to get up to speed before I could be much help fixing YASM. Maybe an email to the YASM mailing list be helpful?
Looks like the yasm devs the OSX-32 bug in their trunk [1].  I'll spin a build and see for myself on Monday.

http://www.tortall.net/projects/yasm/changeset/2345
The SAFESEH issue may also have been fixed this weekend.  Hooray!

http://www.tortall.net/projects/yasm/changeset/2343
FWIW, I agree with roc. We want to use YASM to build libvpx, so we should fix YASM and require it on Windows. bug 570477 is on file for adding it to MozillaBuild. If we can get a working version (even a nightly build or equivalent) I'd be happy to include it in MozillaBuild. I'd even spin a 1.5.1 release just for that, as long as you can provide me the binaries.
Success on Mac32!  Speedup of jpeg_idct_islow is 

(8.2/3.2) / (2.8/4.7) = 4.3

I'm not so good at reading Shark traces, but it looks like ycc_rgb_convert is also much faster:

  (8.3/3.2) / (1.3/4.7) = 9.3

But maybe the trace is somehow under-reporting for libjpeg-turbo.  In any case, I think it's faster.  :)
Depends on: 583924
I managed to get xperf running on Windows, but khuey and I can't figure out how to get it to load symbols from my build.  The article at [1] seems to describe only the settings for retrieving symbols from our symbol server.

I'm going to see if I have more luck with a different profiler (jst suggested eqtime), but if anyone has suggestions for making this work with xperf, I'm all ears.

[1] https://developer.mozilla.org/en/Profiling_with_Xperf
jrmuizel to the rescue.  I have symbols on Windows.
Feel free to poke me if you have problems with xperf, but for symbols, note the bits in that article about making sure you have the latest xperf and the "For a local firefox build" line -- or did that not work?
Benchmarks for Win32.

Methodology:

xperf -on latency -stackwalk profile
dist/bin/firefox -tp manifest    (same manifest as before)
xperf -d out.etl

xperf outputs the raw sample counts, which is kind of nice.   There's some variance in these (I've observed differences of 10-20% between runs), but it shouldn't be a big deal.

Unless I'm incorrectly counting some functions as belonging to libjpeg (please correct me if I am!), the speedup with libjpeg-turbo is 2.7x.

libjpeg-turbo:

  qcms_transform: 89 samples
  SpinBriefly: 72
  decode_mcu_slow: 50
  jsimd_idct_islow_sse2: 30
  jpeg_fill_bit_buffer: 19
  nsJPECDecoder::OutputScanlines: 15
  decompress_onepass: 12
  fetch_pixel_x8r8g8b8: 9
  jsimd_ycc_rgb_convert_sse2: 8

  JPEG:
    jsimd_idct_islow_sse2: 30
    jsimd_ycc_rgb_convert_sse2: 8
  TOTAL: 38
  
original:

  qcms_transform: 65 samples
  SpinBriefly: 65
  decode_mcu: 62
  ycc_rgb_convert: 47
  nsJPEGDecoder::OutputScanlines: 19
  dct_8x8_inv_16s: 18
  jpeg_fill_bit_buffer: 17
  bits_image_fetch_transformed: 16
  jpeg_idct_islow_sse2: 9
  decompress_onepass: 7
  ownpj_QuantInv_8x8_16s: 7
  ownpj_Add128_x8x_16s8u: 6
  fetch_pixel_x8r8g8b8: 6
  ...
  ippiDCTQuantInv8x8LS_JPEG_16s8u_C1R: 2

  JPEG:
    ycc_rgb_convert: 47
    dct_8x8_inv_16s: 18
    bits_image_fetch_transformed: 16
    jpeg_idct_islow_sse2: 9
    ownpj_QuantInv_8x8_16s: 7
    ownpj_Add128_x8x_16s8u: 6
    ...
    ippiDCTQuantInv8x8LS_JPEG_16s8u_C1R: 2
  TOTAL: 103

103/38 = 2.7
Blocks: 584652
The full patch weighs in at 3.8MB, which is too large for Bugzilla.  I've therefore split it into three parts: Removing libjpeg, adding libjpeg-turbo, and build changes.

There's little that's interesting in this patch; just a lot of - signs.

In the interests of full disclosure: I haven't tested this particular patch on all platforms.  It's working locally, and I'll of course push to try once the assembler is updated on the remote boxes.

I think we probably want configure to try to assemble the optimized code only if it finds a sufficiently new version of yasm -- otherwise, people with outdated versions are going to be confused by the hangs and incorrect code generation.  I have an XXX in configure.in for this; I'm not really sure how to do it.
Unfortunately this part is still too big for bugzilla.  Sorry about gzipping it.

Interesting bits here are probably just modules/libjpeg-turbo/Makefile.in and modules/libjpeg-turbo/simd/Makefile.in.
Fwiw, if we can create three self-contained patches in this order: add libjpeg-turbo, build changes, remove libjpeg and push them like that, that would be better than the order listed in comment 73, right?  And may be better than pushing a single mega-patch, though it's hard to tell.
Attachment #463214 - Flags: review?(ted.mielczarek)
Attachment #463217 - Flags: review?(ted.mielczarek)
Attachment #463225 - Flags: review?(ted.mielczarek)
blocking2.0: - → ?
Is there a good reason to not put libjpeg-turbo on top of libjpeg? Personally I'd prefer that and it should make the patch smaller.
I thought that it was kind of weird that jpeg got its own top-level directory, which is why I moved it to modules.

But I could move jpeg to modules/libjpeg-turbo and then put the libjpeg-turbo source on top of it, if we wanted.  That would probably reduce the patch size...
(In reply to comment #78)
> I thought that it was kind of weird that jpeg got its own top-level directory,
> which is why I moved it to modules.
> 
> But I could move jpeg to modules/libjpeg-turbo and then put the libjpeg-turbo
> source on top of it, if we wanted.  That would probably reduce the patch
> size...

I would suggest doing the reverse. Patch libjpeg-turbo onto of jpeg in the current location, and then we can move it to saner place in a different bug.
brendan is king of repo locations, but my understanding is that jpeg-at-top-level is historical accident, and it could easily be somewhere more sane. (It's just that when we were on CVS, moving it was a big pain.)

Gerv
You should get an imglib peer to r+ the add/remove bits. I can review the build system changes.
Attachment #463214 - Flags: review?(ted.mielczarek) → review?(bobbyholley+bmo)
Attachment #463217 - Flags: review?(ted.mielczarek) → review?(bobbyholley+bmo)
Have you run reftests on this patch?
I'll run them now.
(In reply to comment #77)
> Is there a good reason to not put libjpeg-turbo on top of libjpeg? Personally
> I'd prefer that and it should make the patch smaller.

In my opinion, it depends on whether the patch is useful or not. If the delta is reasonable enough to make it useable in the case of "oh hey, something broke when we upgraded to libjpeg-turbo", then we should patch it. But if it's not a useful diff, I think we should first remove it and then add it.

At the moment I'd probably prefer to put this in modules/libimg/jpeg, even if we eventually want to get rid of libimg (so that at least in the mean time, everything's in the same place). Joe might disagree though.

Given that this review is going to be more or less a rubber stamp, I'd like to at least know that it passes all our tests on all platforms first. Specifically, I'm concerned about our jpeg reftests, little-endian architectures like ARM, and architectures where we're not running the asm you're testing with. As such, I'm going to cancel review until we've got all that sorted out.

How much code changed? Do we need a security audit for this? (Not like the one we got for LCMS did us any good...grumble grumble). Dveditz, can you weigh in?
Attachment #463214 - Flags: review?(bobbyholley+bmo)
Attachment #463217 - Flags: review?(bobbyholley+bmo)
(In reply to comment #84)
> Given that this review is going to be more or less a rubber stamp, I'd like to
> at least know that it passes all our tests on all platforms first.

I think that's fair.  Releng is working on getting the requisite yasm version out to the build servers, and once it's out there, I'll push to try and see what happens.
You mean for tryservers? I don't think we can push this to mozilla-central until there's a mozillabuild with the version of YASM we want people to use.
To the question of how much has changed, the answer is: A lot, but not everything.  :)

The vast majority of changes are in the simd folder, which contains 22000 lines of assembly.  We're of course not using all of that.

We could slim down the libjpeg-turbo patch a bit by removing files we don't need, but I'm not sure it's worth messing with.
(In reply to comment #86)
> You mean for tryservers? I don't think we can push this to mozilla-central
> until there's a mozillabuild with the version of YASM we want people to use.

Yes, I think we're blocking on bug 570477.
Depends on: 570477
ok, I think we separate remove/add patches. a 2.4 meg jpeg library is bigger than we'd like it to be (the current one is 1.5 megs), but not quite worth making it harder to apply upstream patches by yanking things out.
(In reply to comment #87)
> Created attachment 463272 [details] [diff] [review]
> Patch v1 - Patching libjpeg-turbo on top of libjpeg
> 
> To the question of how much has changed, the answer is: A lot, but not
> everything.  :)

The diff doesn't look too unwieldy. I can review this if you like.
(In reply to comment #90)

> The diff doesn't look too unwieldy. I can review this if you like.

If you care enough that you're willing to give it a non-rubber-stamp review, we can do it your way. ;-)
(In reply to comment #91)
> (In reply to comment #90)
> 
> > The diff doesn't look too unwieldy. I can review this if you like.
> 
> If you care enough that you're willing to give it a non-rubber-stamp review, we
> can do it your way. ;-)

I think we should actually require it. libjpeg-turbo hasn't had nearly the same security exposure as libjpeg so we should at least go through to look for areas of increased risk.
My understanding is that we don't want to add new stuff to modules/.

For video/audio we created a new toplevel directory media/ to contain third-party libraries (media/libogg, media/libvpx, etc). For gfx we've put third-party libraries under gfx/ (gfx/cairo, gfx/harfbuzz, gfx/ycbcr, etc). I think this policy is working well.

So I suggest creating a new toplevel directory for image codecs, say images/. Move libpr0n and libpng there as well.
This is a bit bikeshed-dy, but the fewer toplevel dirs we have the better; jpeg, png, etc. would fit in fine under 'media', I think.
(In reply to comment #93)
> My understanding is that we don't want to add new stuff to modules/.
> 
> For video/audio we created a new toplevel directory media/ to contain
> third-party libraries (media/libogg, media/libvpx, etc). For gfx we've put
> third-party libraries under gfx/ (gfx/cairo, gfx/harfbuzz, gfx/ycbcr, etc). I
> think this policy is working well.
> 
> So I suggest creating a new toplevel directory for image codecs, say images/.
> Move libpr0n and libpng there as well.

I think we're all on the same page here, but I don't think it's fair to make this patch implement that change. We've already got libpng in modules/libimg, and so I think it makes sense to keep libjpeg with libpng until we figure out the right place for both of them. This is something I plan on getting to "soonish" (ie, after the branch).
agreed on media/ for image decoders
Don't move existing code before hg (including hg web) can handle it properly, please.
I've filled bug 584894 for moving the jpeg code.
(In reply to comment #96)
> agreed on media/ for image decoders

OK. This also might turn out to be handy if we ever start using a video codec for decoding static images as well (I've heard claims that VP8 can compress better than JPEG for example.)

(In reply to comment #95)
> I think we're all on the same page here, but I don't think it's fair to make
> this patch implement that change.

Sure.

(In reply to comment #97)
> Don't move existing code before hg (including hg web) can handle it properly,
> please.

Do you know if that's being worked on?
> (In reply to comment #97)
> > Don't move existing code before hg (including hg web) can handle it properly,
> > please.
> 
> Do you know if that's being worked on?
http://mercurial.selenic.com/bts/issue1576 has been open for awhile.
(In reply to comment #93)
> So I suggest creating a new toplevel directory for image codecs, say images/.
> Move libpr0n and libpng there as well.

Sounds good to me (in case anyone cares :-P).

/be
Or under media -- comment 99 first para reply makes a good point.

Cc'ing djc about the Hg issue.

/be
I'd prefer we put libjpeg-turbo in jpeg/ initially, and then move it later.
We're failing some of the jpeg reftests tests on Linux.

In all but one case, I can't distinguish the libjpeg-turbo image from the reference.  The one exception is jpg-cmyk-2.jpg, where the libjpeg-turbo image looks a bit sharper than the reference.  I can put the rendered image up here if anyone is curious.

How should we deal with these reftest failures?  Do we just declare libjpeg-turbo to be the new gold standard of jepg decoding and change the references, or do we investigate what's causing the differences?
We definitely need to investigate what is causing these reftest failures. If there are bugs, they are likely to be very subtle, and so I'm afraid it'll be some hard slogging.

Start by finding out the magnitude of the differences. (1 pixel value - say r=125 to r=126 - can often be ignored as rounding differences, though not without determining where those rounding differences came from.) Use a screen magnifier that lets you view the pixel values under the cursor, or use something like Gimp that lets you sample any pixel on the screen.

Also, the reftest analyzer has code that'll circle the differences between two images. That might be helpful!

Finally, this doesn't need to block. I'll take a patch (on a relatively short leash), but I'd release without libjpeg-turbo.
blocking2.0: ? → -
smaug, bsmedberg: issue1576 is the only remaining issue for moving in hg, right? I'll try to conjure up some round tuits to get that into 1.6.3.
Comment on attachment 463225 [details] [diff] [review]
Patch v1 - Part 3: Changes to the build

Canceling build review, since I need to at least move some things around.

I'll still need some help figuring out how to get configure to reject yasm binaries older than whatever version we require (likely to be 1.0.2).
Attachment #463225 - Flags: review?(ted.mielczarek)
Interesting.  There's actually a substantial difference in how libjpeg-turbo and libjpeg render this test image.
Attachment #463214 - Attachment is obsolete: true
Attachment #463217 - Attachment is obsolete: true
Attachment #463225 - Attachment is obsolete: true
Attachment #463272 - Attachment is obsolete: true
Attachment #458811 - Attachment is obsolete: true
Attached patch Method dispatch fix (obsolete) — Splinter Review
Ah.  The rendering difference is due to what appears to be a typo in libjpeg-turbo.

Updated patch now passes the jpeg reftests locally.  I'll roll up a patch with this fix and submit this patch upstream in a sec.
Attachment #463677 - Attachment is obsolete: true
I've tried to build FF/TB with this patch, but it failed because of a missing yasm (yasm: no such file or directory). After installing yasm via MacPorts it now builds just fine. So, does this mean, with this patch, yasm gets a new dependency to build mozilla applications? Or will yasm be implemented into the mozilla source?

On Mac OS X JPEGs are OK with this patch (quick visual test) if you build FF in x86_64. But if I build FF/TB in i386, than all JPEGs are misrendered.
http://img291.imageshack.us/img291/3319/i386q.jpg

I've applied patch v2.1 and build with the 10.6 SDK.
(In reply to comment #112)
> After installing yasm via MacPorts it now builds just fine. So, does this mean, with this patch, yasm gets a new dependency to build mozilla applications?

Please see comment #50.

> On Mac OS X JPEGs are OK with this patch (quick visual test) if you build FF in x86_64. But if I build FF/TB in i386, than all JPEGs are misrendered.
> http://img291.imageshack.us/img291/3319/i386q.jpg

See comment #65.
Oh, seems I overlooked that...


> > On Mac OS X JPEGs are OK with this patch (quick visual test) if you build FF in x86_64. But if I build FF/TB in i386, than all JPEGs are misrendered.
> > http://img291.imageshack.us/img291/3319/i386q.jpg
> 
> See comment #65.
Thanks, this fixed it. :-)
Right now you can build FF to use the system libjpeg.  I wonder if we'll want to disable this with this patch.  libjpeg-turbo has a number of features we can use to make other code faster (see e.g. bug 584652, maybe bug 571739).  We could write code to conditionally call into libjpeg-turbo only if it's enabled, but since we won't be testing with libjpeg, it's probably safer just to assume we have libjpeg-turbo.
Attachment #463675 - Attachment is obsolete: true
Attached patch Patch v1.3 (obsolete) — Splinter Review
Updated to tip of libjpeg-turbo svn, which includes the method dispatch fix.
Attachment #463692 - Attachment is obsolete: true
Attachment #463695 - Attachment is obsolete: true
Comment on attachment 464561 [details] [diff] [review]
Patch v1.3

Tagging jrmuizel for review.  Jeff, how do you think we should proceed here?
Attachment #464561 - Flags: review?(jmuizelaar)
Comment on attachment 464561 [details] [diff] [review]
Patch v1.3

I'm probably not going to have time to get to this review that quickly as this is lower priority than some of the other work I'm doing. One comment I do have so far is that you should fix up the licensing documentation to match the code that we're including. i.e. the LGPL and WXWindows licenses should not be needed.
Blocks: 586320
I can remove LICENSE.txt and LGPL.txt.  The only other place I see wxWindows and (L)GPL referenced is in README-turbo, but I'm not sure it's worth patching README-turbo to remove the references to those licenses.

Do you agree?
I should really clarify the language in README-turbo anyhow. What I mean that file to say is that the overall distribution of LJT is covered by wxWindows if the distribution uses the accelerated Huffman codec but not otherwise.
ops.. ..that's the reason becouse it work
Now that bug 563088 has landed, I think this bug is more important. Changing back to a tab that hasn't been active for a while, I can visibly see images being re-decoded. If the speed gains here really are significant then this will help.
jlebar - can you throw together a try build with libjpeg-turbo based onto trunk? We'd like to see how noticeable the post-discarding redecode speedup is tabbing back to:

http://flavors.me/johnath
and
http://demos.hacks.mozilla.org/openweb/CSSMAKESUSICK/

after waiting a period longer than image.mem.discarding_timeout_ms * 2 milliseconds.

If it's significant enough, we may re-evaluate the blocking status of libjpeg-turbo.
I just pushed to try, but then I remembered that the build machines are still running the old, broken version of yasm.  I believe we'll get working Linux and Mac 64-bit builds, and maybe we'll get a working Linux 32-bit build.

It might just be easier for you to pull the patch, install yasm 1.1, and build it yourself.
My push to try is yielding debug, not opt, builds for some reason.  I'll try and get this sorted out...
Builds are up: http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/jlebar@mozilla.com-53de5036b768/

The Mac 32-bit build is busted -- it'll run but decode images as garbage.  The Linux 32- and 64-bit and the Mac 64-bit builds should all work fine.
Attached patch Patch v1.3.1 (obsolete) — Splinter Review
Fixed typo in configure.in.
Attachment #464561 - Attachment is obsolete: true
Attachment #471713 - Flags: review?(jmuizelaar)
Attachment #464561 - Flags: review?(jmuizelaar)
Nice patch, this also speeds up displaying jpeg-picture attachments in Thunderbird, especially for small files. I now see them immediately. For bigger attachments its nearly the same. (I've ported the config changes from this patch to c-c for my build)
FYI, v1.0.1 was released on September 9th.

http://sourceforge.net/projects/libjpeg-turbo/files/
Jeff, ping me before you review this and I'll update to whatever is libjpeg-turbo's tip of trunk at that time.
Version: unspecified → Trunk
Quick note-- the libjpeg-turbo subversion trunk is used for the latest & greatest (and not necessarily stable) code.  I would suggest continuing to base your in-tree copy of LJT on 1.0.1.  I will create a 1.0.x stable branch as soon as it is needed (i.e. as soon as any bug fixes come in for LJT 1.0.1.)
Is this going to make into FF4? If so, can it be marked as blocking final at least? Don't mean to rush of course. Just wish set my expectations appropriately. ;-)
This bug is *not* blocking, as the flag indicates. And at this point, I really don't think we should take it. We actually need to ship at some point.
Does anyone foresee that the libjpeg-turbo code will be added to FF 4.1 as a feature in the future?
(In reply to comment #138)
> Does anyone foresee that the libjpeg-turbo code will be added to FF 4.1 as a
> feature in the future?

I think that's when it's likely to land.
Depends on: post2.0
This isn't yet reviewed, and before it gets reviewed, it needs to be updated to the latest version of the library.  It looks like the post2.0 tracking bug tracks bugs which are ready to land, so I'm removing that dependency.
No longer depends on: post2.0
FYI:

Stable branch for libjpeg-turbo 1.0.x has been created:

https://libjpeg-turbo.svn.sourceforge.net/svnroot/libjpeg-turbo/branches/1.0.x

This is probably what you want to keep your in-tree copy synced with for now, unless you need any of the features in 1.1.
Bug on integration libjpeg-turbo in Google Chrome, already have status verified fixed...

http://code.google.com/p/chromium/issues/detail?id=48789
Maybe a different reviewer should be assigned to possibly get this in for Fx5.0? 

According to the draft plan for the new development process, the 5.0 branch could be cut early(3 weeks).
(In reply to comment #144)
> Maybe a different reviewer should be assigned to possibly get this in for
> Fx5.0? 

Jeff, you reviewing?

/be
(In reply to comment #145)
> (In reply to comment #144)
> > Maybe a different reviewer should be assigned to possibly get this in for
> > Fx5.0? 
> 
> Jeff, you reviewing?
> 
> /be

I've completed a large chunk of the review. I haven't had time to get to the rest of it yet. I am hoping to for 5.

-Jeff
I'll be starting full-time on 3/28, so I'll be able to update the patch and whatnot then.
Comment on attachment 471713 [details] [diff] [review]
Patch v1.3.1

I did not review the simd code as that's more work and lower risk because it's mostly just optimized versions of existing functions. I did go through the rest of the stuff, hopefully I was thorough enough (there was a lot of change).

A few of these comments are on Mozilla changes, the others are on the upstream code. I don't consider it critical that the upstream stuff be addressed before landing.

>diff --git a/jpeg/MOZCHANGES b/jpeg/MOZCHANGES
>--- a/jpeg/MOZCHANGES
>+++ b/jpeg/MOZCHANGES
>@@ -1,10 +1,42 @@
>+    jchuff.c, jdhuff.c, jdhuff.h
>+  since the versions of these files in libjpeg-turbo are also under the
>+  wxWindows license.  (It would have been nicer to revert them to the new
>+  libjepg-8b code, but that doesn't easily integrate with libjepg-turbo.)
s/jepg/jpeg/ (both of them)

>diff --git a/jpeg/jpegint.h b/jpeg/jpegint.h
>--- a/jpeg/jpegint.h
>+++ b/jpeg/jpegint.h
>@@ -364,17 +364,17 @@ EXTERN(void) jinit_color_deconverter JPP
> /* Utility routines in jutils.c */
> EXTERN(long) jdiv_round_up JPP((long a, long b));
>-EXTERN(long) jround_up JPP((long a, long b));
>+EXTERN(size_t) jround_up JPP((size_t a, size_t b));

This change of type here is sort of interesting. It is extended to 64 bits on
Windows and changed to unsigned. 
The checkin comment for the change was "Bleepin' Windows uses LLP64, not LP64",
however it seems like the only code that needs the new size is code added by to jpeg-6b.
(jmemmgr.c and huffman encoder/decoder not included in this patch)

This change removes the type parallelism between jround_up and its sister function jdiv_round_up,
it's also sort of an abuse of the size_t type because the function is used for more than just
sizes of objects. Some callers haven't been updated and still do jround_up((long) val)

I'd suggest using a less general align-to-power-of-two function in jmemmgr.c, I didn't investigate
why the huffman code needs to round 64bit values

>diff --git a/jpeg/jchuff.c b/jpeg/jchuff.c
>--- a/jpeg/jchuff.c
>+++ b/jpeg/jchuff.c
>@@ -14,16 +14,19 @@
> #include "jchuff.h"		/* Declarations shared with jcphuff.c */
> 
>+/* MOZILLA CHANGE: libjpeg-turbo doesn't define INLINE in its config file, so
>+ * we define it here. */
>+#define INLINE

This doesn't seem like a very good definition of INLINE, using
the one removed by this patch seems like a better idea.

>diff --git a/jpeg/jcphuff.c b/jpeg/jcphuff.c
>--- a/jpeg/jcphuff.c
>+++ b/jpeg/jcphuff.c
>@@ -218,17 +218,16 @@ dump_buffer (phuff_entropy_ptr entropy)
> 
>-INLINE
> LOCAL(void)
> emit_bits (phuff_entropy_ptr entropy, unsigned int code, int size)
> /* Emit some bits, unless we are in gather mode */
>@@ -271,17 +270,16 @@ flush_bits (phuff_entropy_ptr entropy)
> 
>-INLINE
> LOCAL(void)
> emit_symbol (phuff_entropy_ptr entropy, int tbl_no, int symbol)

I'm not sure why INLINE needs to go...

>diff --git a/jpeg/jpeglib.h b/jpeg/jpeglib.h
>--- a/jpeg/jpeglib.h
>+++ b/jpeg/jpeglib.h
>@@ -1,58 +1,41 @@
> /*
>  * jpeglib.h
>  *
>  * Copyright (C) 1991-1998, Thomas G. Lane.
>+ * Copyright (C) 2009, D. R. Commander.
>  * This file is part of the Independent JPEG Group's software.
>  * For conditions of distribution and use, see the accompanying README file.
>  *
>  * This file defines the application interface for the JPEG library.
>  * Most applications using the library need only include this file,
>  * and perhaps jerror.h if they want to know the exact error codes.
>  */
> 
> #ifndef JPEGLIB_H
> #define JPEGLIB_H
> 
>-#ifdef XP_OS2
>-/*
>- * On OS/2, the system will have defined RGB_* so we #undef 'em to avoid warnings
>- * from jmorecfg.h.
>- */
>-#ifdef RGB_RED
>-	#undef RGB_RED
>-#endif
>-#ifdef RGB_GREEN
>-	#undef RGB_GREEN
>-#endif
>-#ifdef RGB_BLUE
>-	#undef RGB_BLUE
>-#endif
>-

This is probably needed for OS/2 but I'd rather it be fixed upstream.

>diff --git a/jpeg/jcdctmgr.c b/jpeg/jcdctmgr.c
>--- a/jpeg/jcdctmgr.c
>+++ b/jpeg/jcdctmgr.c
> /*
>+ * Find the highest bit in an integer through binary search.
>+ */
>+LOCAL(int)
>+flss (UINT16 val)
>+{

It would be good to use compiler intrinsics like BitScanReverse and __builtin_clz here when they're available.

>diff --git a/jpeg/jccolor.c b/jpeg/jccolor.c
>--- a/jpeg/jccolor.c
>+++ b/jpeg/jccolor.c
>+  24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 25,
>+  26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27,
>+  27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29
>+};

It would be nice to know where these tables come from

>@@ -141,20 +212,20 @@ rgb_ycc_convert (j_compress_ptr cinfo,
> 
>   while (--num_rows >= 0) {
>     inptr = *input_buf++;
>     outptr0 = output_buf[0][output_row];
>     outptr1 = output_buf[1][output_row];
>     outptr2 = output_buf[2][output_row];
>     output_row++;
>     for (col = 0; col < num_cols; col++) {
>-      r = GETJSAMPLE(inptr[RGB_RED]);
>-      g = GETJSAMPLE(inptr[RGB_GREEN]);
>-      b = GETJSAMPLE(inptr[RGB_BLUE]);
>-      inptr += RGB_PIXELSIZE;
>+      r = GETJSAMPLE(inptr[rgb_red[cinfo->in_color_space]]);
>+      g = GETJSAMPLE(inptr[rgb_green[cinfo->in_color_space]]);
>+      b = GETJSAMPLE(inptr[rgb_blue[cinfo->in_color_space]]);
>+      inptr += rgb_pixelsize[cinfo->in_color_space];

If the compiler can't lift the rgb_* out of the loop. It may be worth doing it by hand.

>@@ -183,36 +254,46 @@ rgb_ycc_convert (j_compress_ptr cinfo,
>-      r = GETJSAMPLE(inptr[RGB_RED]);
>-      g = GETJSAMPLE(inptr[RGB_GREEN]);
>-      b = GETJSAMPLE(inptr[RGB_BLUE]);
>-      inptr += RGB_PIXELSIZE;
>+    for (; outptr < maxoutptr; outptr++, inptr += rgbstride) {
>       /* Y */
>-      outptr[col] = (JSAMPLE)
>-		((ctab[r+R_Y_OFF] + ctab[g+G_Y_OFF] + ctab[b+B_Y_OFF])
>-		 >> SCALEBITS);
>+      #if BITS_IN_JSAMPLE == 8
>+      *outptr = red_lut[inptr[rindex]] + green_lut[inptr[gindex]]
>+	    + blue_lut[inptr[bindex]];

It's not clear what the advantage of this is...
Attachment #471713 - Flags: review?(jmuizelaar) → review+
I strongly suggest updating libjpeg-turbo from 1.0.0 to 1.1.0 before taking time on complete reviewing. It will be wiser because many changes where made.



Significant changes since 1.0.0:

[1] The Huffman decoder will now handle erroneous Huffman codes (for instance, from a corrupt JPEG image.) Previously, these would cause libjpeg-turbo to crash under certain circumstances.

[2] Fixed typo in SIMD dispatch routines which was causing 4:2:2 upsampling to be used instead of 4:2:0 when decompressing JPEG images using SSE2 code.

[3] configure script will now automatically determine whether the INCOMPLETE_TYPES_BROKEN macro should be defined.



Significant changes since 1.0.1"

[1] Added emulation of the libjpeg v7 and v8b APIs and ABIs. See README-turbo.txt for more details. This feature was sponsored by CamTrace SAS.

[2] Created a new CMake-based build system for the Visual C++ and MinGW builds.

[3] TurboJPEG/OSS can now compress from/decompress to grayscale bitmaps.

[4] jpgtest can now be used to test decompression performance with existing JPEG images.

[5] If the default install prefix (/opt/libjpeg-turbo) is used, then 'make install' now creates /opt/libjpeg-turbo/lib32 and /opt/libjpeg-turbo/lib64 sym links to duplicate the behavior of the binary packages.

[6] All symbols in the libjpeg-turbo dynamic library are now versioned, even when the library is built with libjpeg v6b emulation.

[7] Added arithmetic encoding and decoding support (can be disabled with configure or CMake options)

[8] Added a TJ_YUV flag to TurboJPEG/OSS which causes both the compressor and decompressor to output planar YUV images.

[9] Added an extended version of tjDecompressHeader() to TurboJPEG/OSS which allows the caller to determine the type of subsampling used in a JPEG image.

[10] Added further protections against invalid Huffman codes.



Significant changes since 1.0.90 (1.1 beta1):

[1] The algorithm used by the SIMD quantization function cannot produce correct results when the JPEG quality is >= 98 and the fast integer forward DCT is used. Thus, the non-SIMD quantization function is now used for those cases, and libjpeg-turbo should now produce identical output to libjpeg v6b in all cases.

[2] Despite the above, the fast integer forward DCT still degrades somewhat for JPEG qualities greater than 95, so TurboJPEG/OSS will now automatically use the slow integer forward DCT when generating JPEG images of quality 96 or greater. This reduces compression performance by as much as 15% for these high-quality images but is necessary to ensure that the images are perceptually lossless. It also ensures that the library can avoid the performance pitfall created by [1].

[3] Ported jpgtest.cxx to pure C to avoid the need for a C++ compiler.

[4] Fixed visual artifacts in grayscale JPEG compression caused by a typo in the RGB-to-chrominance lookup tables.

[5] The Windows distribution packages now include the libjpeg run-time programs (cjpeg, etc.)

[6] All packages now include jpgtest.

[7] The TurboJPEG dynamic library now uses versioned symbols.

[8] Added two new TurboJPEG API functions, tjEncodeYUV() and tjDecompressToYUV(), to replace the somewhat hackish TJ_YUV flag.
Per comment 147, I intend to update to the new version before checking in.  That sound good to you, Jeff?
(In reply to comment #150)
> Per comment 147, I intend to update to the new version before checking in. 
> That sound good to you, Jeff?

That's fine. An interdiff would to have for a quick look over. Also, one thing that I forgot to mention in the review comments. I'd prefer that we only include the files that are needed for actually building libjpeg-turbo and not checking in all of the extra programs and documentation.
>diff --git a/jpeg/jpeglib.h b/jpeg/jpeglib.h
>-#ifdef XP_OS2
>-/*
>- * On OS/2, the system will have defined RGB_* so we #undef 'em to avoid warnings
>- * from jmorecfg.h.
>- */
>-#ifdef RGB_RED
>-	#undef RGB_RED
>-#endif
>-#ifdef RGB_GREEN
>-	#undef RGB_GREEN
>-#endif
>-#ifdef RGB_BLUE
>-	#undef RGB_BLUE
>-#endif

> This is probably needed for OS/2 but I'd rather it be fixed upstream.

Do you mean upstream of libjpeg-turbo, i.e., in libjpeg?  libjpeg-turbo is a fork of an old version of libjpeg, so there's really no "upstream" to speak of.
(In reply to comment #148)
> Comment on attachment 471713 [details] [diff] [review]
> Patch v1.3.1
[snip] 
> >diff --git a/jpeg/jcdctmgr.c b/jpeg/jcdctmgr.c
> >--- a/jpeg/jcdctmgr.c
> >+++ b/jpeg/jcdctmgr.c
> > /*
> >+ * Find the highest bit in an integer through binary search.
> >+ */
> >+LOCAL(int)
> >+flss (UINT16 val)
> >+{
> 
> It would be good to use compiler intrinsics like BitScanReverse and
> __builtin_clz here when they're available.
[snip]

https://sourceforge.net/tracker/?func=detail&aid=3249446&group_id=303195&atid=1278160

It's an improvement in both speed and size, but probably not enough to make a compelling case for inclusion of the patch.  We've seen better results in JS & NSPR because they both examine 32 bits.  The libjpeg-turbo code only looks at 16 bits so there are less savings to be had from the BSR instruction.
(In reply to comment #152)
> Do you mean upstream of libjpeg-turbo, i.e., in libjpeg?  libjpeg-turbo is a
> fork of an old version of libjpeg, so there's really no "upstream" to speak of.

I meant upstream in libjpeg-turbo instead of patching our local copy.
Oh, I see; that's a change we made, but which isn't in libjpeg-turbo.

Until it's fixed upstream, why don't we just guard the #define's in jmorecfg.h?  We already keep that file out of sync with libjpeg-turbo's mainline.
(In reply to comment #155)
> Oh, I see; that's a change we made, but which isn't in libjpeg-turbo.
> 
> Until it's fixed upstream, why don't we just guard the #define's in jmorecfg.h?
>  We already keep that file out of sync with libjpeg-turbo's mainline.

That sounds fine.
Attachment #471713 - Attachment is obsolete: true
Attachment #522449 - Attachment mime type: application/octet-stream → text/diff
Attachment #522449 - Attachment mime type: text/diff → text/plain
This interdiff is the same as attachment 522449 [details], but it doesn't include the files I deleted, in an attempt to make it easier to read.
Attachment #522445 - Flags: review?(jmuizelaar)
DRC, what would be a good medium for handling Jeff's remaining review comments?

I'm going to push the latest patches to try now.
I'll address the review comments that are relevant to the upstream code here:

(1) jround_up() argument types:  this was the result of a bad game of Celebrity Whack-a-Mole that I played while trying to port the code to Win64.  Jeff's idea of creating a local, power-of-2-specific round-up function for jmemmgr.c is a better idea, and I've done so in a new patch (attachment 522544 [details] [diff] [review]) (actually, I made it a macro instead of a function.)  Please let me know if that looks like a reasonable solution.

(2) RGB_{RED|GREEN|BLUE} macro conflicts on OS/2 (OS/2?  Really?):  This seems like an innocuous enough patch, although I'd rather have it in jmorecfg.h instead of jpeglib.h.  I'd also like to know more of the back story as to why OS/2 is defining those macros.  Is this a generic problem that anyone building libjpeg-turbo on OS/2 would encounter?  My lingering doubt here is that libjpeg supports OS/2, so I wonder why they haven't encountered this problem on that platform.  Is Mozilla including some header file that causes those macros to be defined?

(3) bitscan instructions:  I tested the proposed patch to make jcdctmgr.c use GCC 3.4+ compiler intrinsics to replace flss() with bitscan instructions, but I cannot observe any measurable speedup from this (tested several different image types and platforms, as well as 32-bit and 64-bit.)  If it were even 2% faster, I would agree to include it, but a performance patch that causes no appreciable increase in performance does nothing but needlessly obfuscate the code.

(4) The tables in jccolor.c are part of the optimized RGB-to-grayscale conversion routines.  These are pre-computed R-to-luminance, G-to-luminance, and B-to-luminance conversion values.  The problem with this approach is that it loses a bit of accuracy due to round-off, so I recently checked in code to the SVN trunk (1.2 working version) that uses new SIMD routines to perform RGB-to-grayscale conversion.  This is faster and eliminates the round-off issues, as well as the need for the tables in jccolor.c and the #ifdef in rgb_gray_convert().

(5) The rest of the concerns regarding jccolor.c have already been addressed in trunk as well (pulling rgb_red, rgb_green, rgb_blue out of the loop, etc.)
> (2) RGB_{RED|GREEN|BLUE} macro conflicts on OS/2 (OS/2?  Really?):  This seems
> like an innocuous enough patch, although I'd rather have it in jmorecfg.h
> instead of jpeglib.h.

Attachment 522446 [details] [diff] puts the #undef's in jmorecfg.h, right above the #defines.
And just FYI, another thing that I'm currently being contracted to do is investigate rewriting the performance optimizations in j{c|d}huff*.  This will hopefully allow me to re-license those files in libjpeg-turbo 1.2.
Attachment #522445 - Flags: review?(jmuizelaar) → review+
Comment on attachment 522544 [details] [diff] [review]
Proposed upstream patch for addressing concern about jround_up() argument types.


>+#define PAD(size, align) (((size) + (align) - 1) & (~((align) - 1)))

I usually prefer functions over macros for the additional type safety and to avoid evaluating arguments twice (e.g. PAD(x, a++))
(In reply to comment #163)
> (2) RGB_{RED|GREEN|BLUE} macro conflicts on OS/2 (OS/2?  Really?):  This seems
> like an innocuous enough patch, although I'd rather have it in jmorecfg.h
> instead of jpeglib.h.  I'd also like to know more of the back story as to why
> OS/2 is defining those macros.  Is this a generic problem that anyone building
> libjpeg-turbo on OS/2 would encounter?  My lingering doubt here is that libjpeg
> supports OS/2, so I wonder why they haven't encountered this problem on that
> platform.  Is Mozilla including some header file that causes those macros to be
> defined?

I'd be fine with dropping this altogether. If OS/2 still needs it, we can always add it back.

> 
> (3) bitscan instructions:  I tested the proposed patch to make jcdctmgr.c use
> GCC 3.4+ compiler intrinsics to replace flss() with bitscan instructions, but I
> cannot observe any measurable speedup from this (tested several different image
> types and platforms, as well as 32-bit and 64-bit.)  If it were even 2% faster,
> I would agree to include it, but a performance patch that causes no appreciable
> increase in performance does nothing but needlessly obfuscate the code.

It's probably worth adding a comment to flss saying that it's not performance critical enough to justify using compiler intrinsics.

> (4) The tables in jccolor.c are part of the optimized RGB-to-grayscale
> conversion routines.  These are pre-computed R-to-luminance, G-to-luminance,
> and B-to-luminance conversion values.  The problem with this approach is that
> it loses a bit of accuracy due to round-off, so I recently checked in code to
> the SVN trunk (1.2 working version) that uses new SIMD routines to perform
> RGB-to-grayscale conversion.  This is faster and eliminates the round-off
> issues, as well as the need for the tables in jccolor.c and the #ifdef in
> rgb_gray_convert().

That sounds like a good solution to me.
(In reply to comment #165)
> And just FYI, another thing that I'm currently being contracted to do is
> investigate rewriting the performance optimizations in j{c|d}huff*.  This will
> hopefully allow me to re-license those files in libjpeg-turbo 1.2.

This sounds great.
(In reply to comment #166)
> >+#define PAD(size, align) (((size) + (align) - 1) & (~((align) - 1)))
> 
> I usually prefer functions over macros for the additional type safety and to
> avoid evaluating arguments twice (e.g. PAD(x, a++))

My experience has been that the compiler does a good job of optimizing out that duplication of effort, and the lack of overhead of the additional function call usually makes up for it, but in the specific case we're referring to, it doesn't matter enough to worry about, and using a function call is more consistent with the way it worked before.  Re-submitted as attachment 522583 [details] [diff] [review].
(In reply to comment #170)
> (In reply to comment #166)
> > >+#define PAD(size, align) (((size) + (align) - 1) & (~((align) - 1)))
> > 
> > I usually prefer functions over macros for the additional type safety and to
> > avoid evaluating arguments twice (e.g. PAD(x, a++))
> 
> My experience has been that the compiler does a good job of optimizing out that
> duplication of effort, and the lack of overhead of the additional function call
> usually makes up for it, but in the specific case we're referring to, it
> doesn't matter enough to worry about, and using a function call is more
> consistent with the way it worked before.  Re-submitted as attachment 522583 [details] [diff] [review] 

I was referring to the fact that PAD(x, a++) does a++ twice which is not what the caller is likely to expect. Regardless, I like the new version better.
Ah, the orange is due to unexpected success.  That's not so bad.  :)

I think we can just back out bug 582850.

http://hg.mozilla.org/mozilla-central/rev/4643426a1523
Yep, and this should close bug 477728
Attachment #522674 - Flags: review?(jmuizelaar) → review+
Why is support for arithmetic encoding/decoding disabled?
Don't we need to use the libjpeg v8 ABI in order to get this?
(In reply to comment #176)
> Why is support for arithmetic encoding/decoding disabled?

We haven't decided whether to enabled this yet. This is a separate issue and needs a separate bug as well as a lot more thought/research.
(In reply to comment #177)
> Don't we need to use the libjpeg v8 ABI in order to get this?

No, it works when emulating the v6b ABI as well.  It just adds two additional functions without breaking compatibility with the rest of the ABI.
http://hg.mozilla.org/mozilla-central/rev/6a8baba56a65
http://hg.mozilla.org/mozilla-central/rev/3eb5a574319f

Please file follow-ups as needed.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Apparently this causes OOM in yasm 0.8.something we may need to up the min version
More specifically, with yasm 0.8.0.2194, which is the latest available by default with Ubuntu 10.04. yasm 1.1.0.2352 works fine.
Fedora 13 has old yasm too
Depends on: 646485
Depends on: 646489
(Though, honestly we'd really prefer to make it buildable with nasm, or gas.  But that's another bug...)
Depends on: 646704
NOTE:  libjpeg-turbo trunk now contains a version of jdhuff.c and jdhuff.h (Huffman decoder) with the optimizations re-written and re-licensed under the same BSD-style license as the rest of libjpeg.  Looking into jchuff* as well, but I figured you were probably more interested in the decoder than the encoder.
I've filed bug 650899 for updating libjpeg-turbo.
Blocks: 650899
Target Milestone: --- → mozilla5
Jesse, why did you nominate this bug for tracking-firefox5?
Because bsmedberg told me to in http://groups.google.com/group/mozilla.dev.planning/msg/bfc2fabaecacdd97.  Also because I think it should be listed on https://wiki.mozilla.org/Features/Release_Tracking to make sure it gets the right security reviews, license file additions, changelog entries, crash stats scrutiny, and whatever else people follow that page for.
NOTE:  libjpeg-turbo trunk now contains a version of jchuff.c
(Huffman encoder) with the optimizations re-written and re-licensed under the
same BSD-style license as the rest of libjpeg.
... and the bug for tracking that is bug 650899.
That bug is for tracking the decoder.  I'm talking about the encoder now.
FYI, the Chrome team has apparently done a security review of the library:
http://code.google.com/p/chromium/issues/detail?id=48789#c3
Security team finished fuzzing as planned, we are good here.
Blocks: 662346
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: