Closed Bug 496298 Opened 15 years ago Closed 12 years ago

Implement ARM NEON optimized IDCT for JPEG decoding

Categories

(Core :: Graphics: ImageLib, enhancement)

ARM
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: ilpo.ruotsalainen, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Attached patch WIP patchSplinter Review
Implementing IDCT in libjpeg with NEON can give a large performance benefit for decoding JPEG images.

This is a WIP patch, with working IDCT NEON code but missing runtime NEON detection and cleanup.
Some benchmarking numbers (both test runs contain the earlier NEON colorspace code):

Without patch:
643 runs, min/max/avg/stddev: 314.13/356.16/345.72/5.94

With patch:
126 runs, min/max/avg/stddev: 290.23/312.19/301.20/5.46
Depends on: 573948
(In reply to comment #2)
> What are we doing with this bug since bug #573948 landed ?

We should get this or something like it landed in upstream libjpeg-turbo.
(In reply to comment #3)
> We should get this or something like it landed in upstream libjpeg-turbo.

Agreed.  DRC, would you take patches for NEON support?  Ilpo, see bug 583958 for NEON detection.
ARM isn't currently a platform we support, mainly because we can't support that which we can't test.  I also don't take WIP's.  However, there is interest in the community for accelerated ARM support, so I would accept the patch into the evolving LJT code base under the following conditions:

(1) if it was integrated with libjpeg-turbo's build system.
    (a) configure.ac needs to be modified such that an automake conditional, SIMD_NEON, is enabled if the platform is ARM and with_simd=yes
    (b) Rather than hack jddctmgr.c, a new SIMD stub file (jsimd_neon.c) needs to be created for Neon (use jsimd_none.c as a template.)
    (c) Makefile.am should include jsimd_neon.c and jidctneo.c into the build if the SIMD_NEON AM conditional is defined.

(2) If Neon is not universally available on all ARM platforms, then appropriate run-time detection needs to be implemented.  Easiest way to do this is in the body of the jsimd_can_idct_islow() function.

(3) Someone needs to volunteer to test the upstream modifications once I've checked them in, so it can be verified that they build correctly and that they produce the same 15% speedup reported above.

I don't mind doing (1) myself as long as someone else can check my work.  I'm not sure how I would go about testing any of this.
We actually had done some ARM NEON optimizations for both jpeg encoding and decoding for Nokia N900, based on libjpeg6. But it took a long time to get approval to open source these patches, which finally happened several months ago in the form of a quick and simple conversion of an old ARM NEON patchset to libjpeg-turbo: https://maemo.gitorious.org/meego-image-editor/libjpeg-turbo

While we were waiting for this approval, the world moved on and actually there also happens to be some ongoing work from the linaro side: https://blueprints.launchpad.net/linaro-multimedia-wg/+spec/multimedia-linaro-optimize-jpeg-decoding
Since that time I dropped the ball and just waited for what is going to happen, which might have been not the very best idea. But Mozilla's decision to finally take libjpeg-turbo into use now surely adds some clarity to the whole picture.

Sorry for a relatively long post, this just describes how things look from my side.

Anyway, I think I can provide you with the ARM NEON optimizations of acceptable quality for libjpeg-turbo. I wonder what is the best place to submit patches to libjpeg-turbo for review. Would you prefer them posted to some mailing list or attached in bugtracker?
Regarding this particular patch from Ilpo Ruotsalainen. Has anybody checked whether it's license header is acceptable/compatible with both libjpeg-turbo and firefox? Is it safe to take?
Gerv, does this license header look ok to you?
Yep. It will need adding to about:license. You can either do that in the same patch, or file a follow-up bug (CC me, and I'll review that part).

Gerv
> quality for libjpeg-turbo. I wonder what is the best place to submit patches to
> libjpeg-turbo for review. Would you prefer them posted to some mailing list or
> attached in bugtracker?

If you want them reviewed upstream, please submit them to the Patches tracker on the libjpeg-turbo Sourceforge page.  I'm interested in such patches.  I would just need someone else to test them once they're integrated.
(In reply to comment #10)
> If you want them reviewed upstream, please submit them to the Patches tracker
> on the libjpeg-turbo Sourceforge page.

OK, thanks, will try that in the beginning of the next week after I finish some other things.

> I'm interested in such patches.  I would just need someone else to test them 
> once they're integrated.

I keep contact with ARM maintainers of gentoo linux distro and also angstrom, so additional testing should not be a problem.
It took a bit longer to verify everything, but now I have added some initial ARM NEON patches to the libjpeg-turbo patches tracker:
https://sourceforge.net/tracker/?func=detail&aid=3291291&group_id=303195&atid=1278160

And also some other things which are not directly related, but would be nice to fix:
https://sourceforge.net/tracker/?func=detail&aid=3291302&group_id=303195&atid=1278160
https://sourceforge.net/tracker/?func=detail&aid=3291305&group_id=303195&atid=1278160

And the attached here 'jpeg_idct_islow' patch from Qualcomm seems to produce results which differ from C implementation, hence failing libjpeg-turbo tests. If this is unacceptable, then I'm afraid the best option might be to drop it and re-implement 'jpeg_idct_islow' from scratch. Anyway, if anyone wants to experiment with this Qualcomm idct code, I have put it here:
  http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/log/?h=qualcomm-idct
> And the attached here 'jpeg_idct_islow' patch from Qualcomm seems to produce
> results which differ from C implementation, hence failing libjpeg-turbo tests.

I think we (and, I imagine, DRC) would like at least to understand why the tests are failing.
(In reply to comment #13)
> I think we (and, I imagine, DRC) would like at least to understand why the
> tests are failing.

It could be some bug in the code (for example an unexpected overflow) or maybe somewhat relaxed precision to gain a bit more performance. The latter probably can't be ruled out because bug 489148 explicitly mentions "accuracy is slightly worse than the C code" for another jpeg NEON optimization.

Another possibility is that I messed up and incorrectly applied the patch. That's why I provided a branch with this attempt in git repository, so that anyone could review it.

In any case, dealing with non-bitexact optimizations is an extra pain and maintenance burden. Just because it's hard to say for sure whether it works wrong or just different. At least, it is highly desired to have some reference C code which is supposed to provide the same results as assembly optimized code.

But, I guess, what Mozilla actually needs in the context of this bug is just some correct and fast ARM NEON optimized idct for jpeg decoder. We can try to fix this attached patch, implement new idct, or maybe even just wait and see the outcome of linaro's jpeg optimization efforts.
From my POV, I can't accept anything upstream right now that breaks mathematical compatibility with libjpeg v6b.  That includes, as Siarhei and I both discovered, back-porting some of the IDCT scaling optimizations from libjpeg v8b.

libjpeg-turbo at least claims to be 100% compatible with jpeg-6b.  If it isn't, I consider that to be a bug.  I am really unsure how to deal with the optimized DCT/IDCT scaling routines in libjpeg v8.  They do improve performance a bit (as much as 10%) when using scaled decompression, but I would have to modify the SIMD code to match their output.  This is tricky, because I didn't write the SIMD code in question, and in any sense, it would break compatibility with jpeg-6b.

The only way around the dilemma would seemingly be to enable the faster routines only if libjpeg-turbo is configured with --with-jpeg8, but of course that creates a support nightmare, because now I have to implement two different sets of tests.

Ugh.  None of the above is impossible, but seeing as how it's unpaid work, I'm very uninterested in hacking up the LJT code base in that manner just to get a 10% speedup on the non-SIMD code.

Same goes for ARM.  I don't want to check anything in which doesn't pass the unit tests.
(In reply to comment #15)
> Same goes for ARM.  I don't want to check anything in which doesn't pass the
> unit tests.

This leaves us with just this set of patches for now: https://sourceforge.net/tracker/?func=detail&aid=3291291&group_id=303195&atid=1278160
Review would be very much welcome. And I'm ready to fix the code according to your feedback. I tried not to break anything and passing unit tests surely goes without saying. Moreover, I did some additional verification because the default tests are lacking a bit of coverage (test suite can be still improved and I may provide some more patches for it later). Basically this patchset is what Mozilla needs for jpeg decoding if Fennec can be configured to use 'jpeg_idct_ifast' (which is not any worse than using some other NEON optimized idct which in fact does not behave exactly like 'jpeg_idct_islow' should). I have plans to make a real bitexact ARM NEON optimized 'jpeg_idct_islow' later, primarily because this is what normal linux users want.

Regarding the other things. I'm even not sure if Mozilla needs the scaled decoding feature or jpeg encoding. As I already mentioned, they are not directly related to this bug and are more like collateral discoveries :)
Scaled decoding came up in the WebP decoder as well, so we should probably have a separate bug to consider it since there's no infrastructure to handle it right now.
I think we could probably switch mobile to ifast if it's a 2x improvement and bitwise identical to libjpeg's ifast.

We may have some bugs relating to this, but bugzilla search isn't working for me at the moment.
libjpeg-turbo SVN trunk now has some basic minimum of ARM NEON optimizations for faster jpeg decoding. Some more optimizations are also available as separate patches, which (I assume) could be possibly applied too if that is ok for libjpeg-turbo and mozilla release schedule.
What's the performance like?
(In reply to comment #20)
> What's the performance like?

The detailed benchmarking/profiling data is available in commit messages from http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/log/?h=sent/20110422-arm-neon-try1

When having pure C code, iDCT (ifast) and YUV->RGB conversion together used to take ~70% of CPU time on jpeg decoding. The use of ARM NEON shrinks the time spent in these functions ~4x and ~10x respectively, resulting in roughly 2x overall speedup.

Additionally optimizing fancy upsampling is likely to provide ~10% extra speedup or more. But iDCT and YUV->RGB are definitely the most important low hanging fruits.
Also NEON YUV->RGB conversion code provides optimized JCS_EXTENSIONS (bug 584652). The standalone djpeg program does decoding into 24-bit RGB format, but mozilla needs 32-bit pixels. Or more likely actually r5g6b5 in maemo/android, which would be a very nice new feature for libjpeg-turbo.
You're not the first person to ask for 5-6-5, but it's not exactly a straightforward thing to do.  The RGB colorspace extensions flowed naturally out of the existing libjpeg and libjpeg/SIMD code, since the upstream code already had the ability to put R, G, and B into an arbitrary position in the pixel via the use of the RGB_{RED|GREEN|BLUE|PIXELSIZE} macros.  I simply extended this same concept so the behavior could be controlled at run time instead of compile time.

5-6-5 would require a completely different set of color conversion routines, including new SIMD code.  That's a big enough task that it's just not going to happen unless someone wants it badly enough to pay me for the labor to implement it.
The problem with efficient RGB565 support is even broader if we take a look at the browser as a whole. In order to look nice on screen, the images also need dithering. And dithering is better to be applied after scaling in the case if scaling is needed. Moreover, bilinear scaling ARGB8888->RGB565 is faster than bilinear scaling RGB565->RGB565 on ARM. But ARGB8888 takes a lot of space in memory. One more interesting thing is that unscaled YUV->RGB conversion is typically a memory bandwidth constrained operation when using ARM NEON optimizations. Which means just keeping the images in YUV format and doing direct blits from them with the color conversion performed on the fly is both memory efficient and fast.

I understand that I may be just adding more confusion here :) But the point is that it makes sense to apply all the easy optimizations (for jpeg, png, ...) and use them as the "building blocks" to have better performance right now. Still getting even better performance for handling images in firefox also extends beyond the scope of libjpeg-turbo library.

BTW, are there any good browser specific jpeg loading benchmarks? I tried to hack something, but it has way too much of other stuff happening: http://people.freedesktop.org/~siamashka/images/jpg-load-bench.html

Justin, do you have any plan or schedule regarding jpeg optimizations?
> Justin, do you have any plan or schedule regarding jpeg optimizations?

Aside from this bug and bug 650899, I'm not tracking any jpeg optimization bugs at the moment.  Should I be?  :)
(In reply to comment #25)
> Aside from this bug and bug 650899, I'm not tracking any jpeg optimization
> bugs at the moment.  Should I be?  :)

Bug 584652 is quite important, otherwise everyone without SSE2 (and ARM devices surely don't have it) gets the following C code used for colorspace conversion, overriding any NEON optimizations from libjpeg-turbo: http://mxr.mozilla.org/mozilla2.0/source/modules/libpr0n/decoders/nsJPEGDecoder.cpp#1121

But I still would like to get some answer for "are there any good browser specific jpeg loading benchmarks?" question. Or is jpeg decoding performance a really low priority thing for Mozilla, so it's not being tracked at all? I personally would guess that it's quite likely, considering how long this particular bug was around without being resolved.
Also the code in libjpeg-turbo SVN currently has the same problem as bug 656782
This needs to be taken into account if pulling a new libjpeg-turbo snapshot into Mozilla source tree unless it gets fixed in libjpeg-turbo SVN before that.
> Or is jpeg decoding performance a really low priority thing for Mozilla, so 
> it's not being tracked at all?

I don't think it's fair to say that JPEG decoding performance is a "really low priority" -- it took a large amount of work from a number of people to land libjpeg-turbo in the first place.  There's also a tracking bug for jpeg perf: bug 579106.

I think it is fair to say, however, that we tend to concentrate first on those issues which noticeably impact users and second on those issues which we can numerically demonstrate to be significant.

As it is, it seems that the colorspace conversion isn't a large portion of the jpeg decoding runtime.  (That's basically bug 584652 comment 13.)  If you have other numbers (or if you'd like help benchmarking), please comment in that bug!
(In reply to comment #27)
> Also the code in libjpeg-turbo SVN currently has the same problem as bug
> 656782
> This needs to be taken into account if pulling a new libjpeg-turbo snapshot
> into Mozilla source tree unless it gets fixed in libjpeg-turbo SVN before
> that.

We have a plan that upgrading to NDKr5b.  After upgrading to it, it will be resolved.  (NDKr5 gcc 4.4.3 has __linux__).

Also, the latest libjpeg-turbo's NEON optimization seems to be ifast only.  If we change to ifast, we have to modify some reftests (2 years ago, we try changing to ifast on some platform by bug 416157.  But due to orage, it is backed out immediately).
(In reply to comment #29)
> We have a plan that upgrading to NDKr5b.  After upgrading to it, it will be
> resolved.  (NDKr5 gcc 4.4.3 has __linux__).

Thanks for the information. Do you happen to know if this change was actually intended? Does android documentation explicitly mentions it somewhere? They seem to have a track record of screwing up in this area:
    http://groups.google.com/group/android-ndk/browse_thread/thread/71ff7f7f548b5e5b/573450fddf2b8dab

> Also, the latest libjpeg-turbo's NEON optimization seems to be ifast only. 
> If we change to ifast, we have to modify some reftests (2 years ago, we try
> changing to ifast on some platform by bug 416157.  But due to orage, it is
> backed out immediately).

Yes, this is something to be expected and surely makes testing more complicated. Based on comment 18, I was assuming that it may be acceptable for Mozilla to do such adjustment for mobile. After all, mobile firefox already uses nearest scaling instead of bilinear if NEON is not available. And this must be handled by automated tests somehow (is it?).

But I agree that just implementing NEON optimized 'jpeg_idct_islow' may be easier than tweaking tests.

That said, if Mozilla needs any assistance from me with NEON optimizations for jpeg decoding, it's better to be done with this sooner than later. I simply can't be on 'standby' forever and also have other things to do (the same probably was also the case for Ilpo Ruotsalainen who originally submitted this bug). One tricky thing is that no stable libjpeg-turbo releases are planned before 2012 as stated in the last DRC's comment:
    https://sourceforge.net/tracker/?func=detail&aid=3291291&group_id=303195&atid=1278160
Does it make any sense backporting NEON optimizations to libjpeg-turbo 1.1.1?
(In reply to comment #30)
> (In reply to comment #29)
> > We have a plan that upgrading to NDKr5b.  After upgrading to it, it will be
> > resolved.  (NDKr5 gcc 4.4.3 has __linux__).
> 
> Thanks for the information. Do you happen to know if this change was
> actually intended? Does android documentation explicitly mentions it
> somewhere? They seem to have a track record of screwing up in this area:
>    
> http://groups.google.com/group/android-ndk/browse_thread/thread/
> 71ff7f7f548b5e5b/573450fddf2b8dab

There is no document about __linux__.  I think that this is by buildspecs are changed by NDKr5/4.4.3.  But since __ANDROID__ is defined by -mandroid (This option becomes default setting by 4.4.3), using __ANDORID__ is better to support some gcc versions.
If you guys want to back-port to LJT 1.1.1, you're on your own.  That is a stable branch in the upstream code, so I would not check in a potentially disruptive feature such as this.

The re-licensing work in the libjpeg-turbo trunk has been completed, so really ARM support is the only other major feature that needs to solidify before 1.2 beta.  Still waiting for some agreement between the iOS and Android camps as to the appropriate set of patches, as well as benchmarking information to verify the speedup of the NEON code.
To make this a little easier for others to follow:

(From comment 30)
> One tricky thing is that no stable libjpeg-turbo releases are planned before 
> 2012 as stated in the last DRC's comment:
>   https://sourceforge.net/tracker/?func=detail&aid=3291291&group_id=303195&atid=1278160

The comment being referred to is:

(DRC)
> LJT 1.2 is slated for early 2012, but the beta may land this Fall if I can
> finish the re-licensing work in a shorter timeframe.

But the re-licensing work is done.  Which explains the first sentence of the second paragraph of comment 32.  So since we don't necessarily need to wait until next year for a stable LJT with NEON support, I don't think we need to worry about backporting.
(In reply to comment #32)
> The re-licensing work in the libjpeg-turbo trunk has been completed, so
> really ARM support is the only other major feature that needs to solidify
> before 1.2 beta.

OK, sounds good.

> Still waiting for some agreement between the iOS and Android camps as to
> the appropriate set of patches,

I'm actually from the "traditional linux" camp myself. And there is no disagreement between me and Martin Aumüller at:
    https://sourceforge.net/tracker/?func=detail&aid=3303840&group_id=303195&atid=1278160
The whole issue really consists of two parts:
1. Tweak the assembly code to make it less problematic for iOS assembler (refactor the code to use a smaller subset of GNU assembler features without causing any regressions)
2. Add iOS support itself

The first 3 patches from him solve the first part. But I can't do much about the second part other than just confirming that it does not seem to immediately break Linux/Android.

> as well as benchmarking information to verify the speedup of the NEON code.

Some benchmarking data had been available in comment 21 (tested with a small collection of 5 MPix pictures taken by my phone camera) and in the commit messages from my git repository:
 - http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/commit/?h=sent/20110422-arm-neon-try1&id=eddaf138bdf3c08699577c4cc18646eca991f29a
 - http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/commit/?h=sent/20110422-arm-neon-try1&id=cef39af8622fe26e7e452d138d9042755a7a2fd4

I also have just found some time for that and posted 'tjbench' benchmark numbers as you asked:
   https://sourceforge.net/mailarchive/message.php?msg_id=27613210
For what it's worth, I have translated the NEON iDCT implementation in libjpeg-turbo trunk (Siarhei's, I think?) into compiler intrinsics, which should be compatible with both GCC and RealView.  It would be very interesting to see what the iOS toolchain does with this.

Released GCC versions are not yet very good at optimizing the NEON register allocation and load/store scheduling, but work on this is taking place as part of the Linaro toolchain effort.  (Even with the rather awful register management in 4.5.x, my preliminary tjbench runs show JPEG decompression performance within about 10% of the assembly version.)

Patch here:  https://github.com/mkedwards/crosstool-ng/blob/master/patches/libjpeg-turbo/trunk/0001-Implement-jsimd_idct_ifast-using-NEON-intrinsics.patch

GCC 4.6 patch here, to work around ICEs when scheduling bulk spill/reload of NEON registers to/from stack:  https://github.com/mkedwards/crosstool-ng/blob/master/patches/gcc/linaro-4.6-bzr/1004-reload-large-vectors.patch

However, I ran into something rather unpleasant in the course of trying to test this code.  libjpeg-turbo's "tjunittest" seems to be insensitive to most of the sign-flips and other damage I introduced in order to get the unit test to break.  (Sufficiently blatant damage does break it, so the unit test is running the new code.)  So I can't make any claims about the correctness of the translation (which I have made no attempt yet to test visually) without a more quantitative accuracy test.

How does the Mozilla team assess the accuracy of a candidate JPEG decompression library?
I'm very open to suggestions regarding better unit tests.  TJUnitTest was never really meant to validate anything but the TurboJPEG wrapper code.  The TurboJPEG wrapper is an abstraction layer for VirtualGL, implemented so that VirtualGL could take advantage of a variety of proprietary codecs (IPP, mediaLib, Pegasus Imaging, Quicktime, libjpeg, etc.)  Most of these implementations have long since been deprecated, but the wrapper was brought into libjpeg-turbo to allow people to easily drop in libjpeg-turbo as a replacement for TurboJPEG/IPP and the other implementations that VirtualGL commonly used.  TJUnitTest largely makes the assumption that the underlying codec is mathematically correct, and what it's really testing is the correctness of the wrapper.

The other unit tests, which do byte compares against images that were compressed/decompressed with the unmodified libjpeg code, are really the tests that are validating the mathematical correctness of the underlying libjpeg-turbo codec.  If your intentional errors aren't making those tests fail, then we probably need to add some new tests.
(In reply to comment #35)
> How does the Mozilla team assess the accuracy of a candidate JPEG
> decompression library?

We have a set of JPEG reftests. I can't vouch for how complete they are, but upgrading to libjpeg-turbo did cause some of them to pass when they were failing before.

https://developer.mozilla.org/en/Creating_reftest-based_unit_tests
Ah, so I'm doing it wrong.  What a relief.  ;-)  Apologies if my comment seemed critical; I was just puzzled at the insensitivity of the unit test to numerical changes.  Duh; that's because its a *unit* test of the layer above.

It looks like I need to run the tests that start with "./djpeg -dct fast" to get the IFAST implementation of inverse DCT.  I suppose I will need to alter the constants that represent the fractional part of the iDCT polynomial coefficients back to the 7-bit values in the NEON assembly version.  I'll do that, and debug, and update the patch.

The question for the Mozilla team remains: are you going strictly on the byte-for-byte match with djpeg, or is there some tool you use to measure decompression accuracy against the original (or against an "ideal" with the expected artifacts at that compression level)?
Joe Drew:  oops, race.  Is there a good subset of the reftests for exercising JPEG decoding?
(In reply to comment #39)
> Joe Drew:  oops, race.  Is there a good subset of the reftests for
> exercising JPEG decoding?

Our only JPEG reftests are in modules/libpr0n/test/reftest/jpeg and yes, they're pixel-perfect tests.
I fixed bugs and did some tuning for performance and accuracy, using a quantitative measure (the "signal-to-noise ratio" reported by pnmpsnr: http://netpbm.sourceforge.net/doc/pnmpsnr.html).  I'll need to run more images through it, but initial results look good:

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr build-trees/libjpeg-turbo/testorig.ppm build-trees/libjpeg-turbo/testimgflt.ppm 
pnmpsnr: PSNR between build-trees/libjpeg-turbo/testorig.ppm and build-trees/libjpeg-turbo/testimgflt.ppm:
pnmpsnr: Y  color component: 57.81 dB
pnmpsnr: Cb color component: 62.75 dB
pnmpsnr: Cr color component: 57.71 dB

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr build-trees/libjpeg-turbo/testorig.ppm build-trees/libjpeg-turbo/testimgfst.ppm 
pnmpsnr: PSNR between build-trees/libjpeg-turbo/testorig.ppm and build-trees/libjpeg-turbo/testimgfst.ppm:
pnmpsnr: Y  color component: 50.27 dB
pnmpsnr: Cb color component: 50.63 dB
pnmpsnr: Cr color component: 50.06 dB

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr build-trees/libjpeg-turbo/testorig.ppm build-trees/libjpeg-turbo/testoutfst.ppm 
pnmpsnr: PSNR between build-trees/libjpeg-turbo/testorig.ppm and build-trees/libjpeg-turbo/testoutfst.ppm:
pnmpsnr: Y  color component: 56.79 dB
pnmpsnr: Cb color component: 61.29 dB
pnmpsnr: Cr color component: 57.68 dB

What this means is that this implementation is now a great deal more faithful to the original than the "reference" IFAST implementation, and nearly as good as the reference floating point implementation.  Decode throughput is on the order of 20% less than the hand-coded assembly version, again because GCC does a less than ideal job of allocating NEON registers.  On the other hand, GCC is able to adjust the instruction scheduling for the target core, add preload instructions, and so forth; so I tend to prefer the C-with-intrinsics version over assembly.  (This is in no way a criticism of the implementer(s) of the assembly version; I couldn't have gotten here without it!)

The patch at https://github.com/mkedwards/crosstool-ng/blob/master/patches/libjpeg-turbo/trunk/0001-Implement-jsimd_idct_ifast-using-NEON-intrinsics.patch has been updated.
(In reply to comment #35)
> Released GCC versions are not yet very good at optimizing the NEON register
> allocation and load/store scheduling, but work on this is taking place as
> part of the Linaro toolchain effort.  (Even with the rather awful register
> management in 4.5.x, my preliminary tjbench runs show JPEG decompression
> performance within about 10% of the assembly version.)

You seem to be implying here that the 10% (or 20% with the latest revision of your patch) worse overall jpeg decoding throughput measured by tjbench is not a big deal. But just have a look at the last profiling report from 
    http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/commit/?h=sent/20110422-arm-neon-try1&id=cef39af8622fe26e7e452d138d9042755a7a2fd4

For decoding typical jpeg images taken with a phone camera, the contribution of neon optimized idct to the CPU usage is only about ~18-19%. Now guess how much slower should the idct function become to get 10-20% slowdown overall?

> GCC 4.6 patch here, to work around ICEs when scheduling bulk spill/reload of
> NEON registers to/from stack: 
> https://github.com/mkedwards/crosstool-ng/blob/master/patches/gcc/linaro-4.6-
> bzr/1004-reload-large-vectors.patch

So which is it? You have a patch for gcc 4.6, but still did benchmarks with gcc 4.5, which has "awful register management" as you said. Is gcc 4.6 already good enough performance wise or not?

(In reply to comment #41)
> What this means is that this implementation is now a great deal more
> faithful to the original than the "reference" IFAST implementation, and
> nearly as good as the reference floating point implementation.

If you want to introduce a new idct or any other great improvements, please submit patches to the upstream libjpeg-turbo project:
    https://sourceforge.net/projects/libjpeg-turbo/    

I doubt that Mozilla is going to take any custom modifications not approved by upstream because this obviously adds a great deal of extra maintenance burden for them. And if you read all the comments here, you might have noticed that Mozilla is quite passive at taking new jpeg performance improvements. So I can hardly imagine them being particularly excited to use your patch.

> On the other hand, GCC is able to adjust the instruction scheduling for the
> target core, add preload instructions, and so forth; so I tend to prefer the
> C-with-intrinsics version over assembly.

Let me ask you straight. Did you convince some manager about this GCC instructions scheduling goodness and now have work assignment to "replace assembly with compiler scheduled intrinsics" or something like this? I can't find any other good reason why you could persist pursuing this goal even though the odds are clearly against you.

I have already provided you with the link to my report in gcc bugzilla when discussing the same in the beagleboard mailing list:
    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 
The point is that the test code from that gcc bug contains a very simple code, which does some trivial arithmetic operations in-place on the data from some array. This is to some extent a simplified model of what idct does. And as long as gcc fails to optimize that trivial code snippet, it can't be any good for idct too.

I don't have any prejudice against gcc. Making a good optimizing compiler is an extremely hard task and gcc developers are very talented people. I'm glad that linaro is sponsoring their efforts, which should clearly bring some benefits in the long run. But currently their focus seems to be a bit more academic than practical.
Michael,

I don't think we're quite ready to take code which is slower than an alternative we have, requires compiler support which isn't quite done yet, and doesn't produce entirely correct results.

I think this is an interesting project, though, especially inasmuch as we can get GCC to a point where NEON intrinsics are usable.  Perhaps you could file a new bug for your intrinsic IDCT work so we can track its progress.  If you do, please cc me.
(In reply to comment #43)
> Michael,
> 
> I don't think we're quite ready to take code which is slower than an
> alternative we have, requires compiler support which isn't quite done yet,
> and doesn't produce entirely correct results.

Oh no, I wouldn't suggest that you take my code at this stage.  It's really just meant as a heads-up, in case for some reason it would be easier to get the compiler-intrinsic version to converge under iOS than the hand-coded assembly version.  But I'm not sure what you mean by "doesn't produce entirely correct results".  The metrics available to me suggest that this code decodes JPEGs much more accurately than the assembly version (or the reference integer iDCT), simply because it makes use of the extra two bits' precision and rounding capabilities available more or less for free in the NEON instruction set.  It would be trivial to adjust the assembly implementation to match, and thus to bring its decode accuracy nearly up to the "top-quality" floating point version.

> I think this is an interesting project, though, especially inasmuch as we
> can get GCC to a point where NEON intrinsics are usable.  Perhaps you could
> file a new bug for your intrinsic IDCT work so we can track its progress. 
> If you do, please cc me.

I'll do that once I've put a little more effort into testing with a broad corpus of images.  It's really more interesting to me as a compiler stress test than for the sake of JPEG decoding as such.  (Our shipping product been getting by just fine on the performance of the jpeg6b version.)
From the point of view of libjpeg-turbo, at the moment we still use jpeg-6b as a yardstick for accuracy, since we're trying to bill ourselves as a faster version of libjpeg but also as a drop-in replacement for it. I feel like we can't go off and implement accuracy enhancements or other modifications that would break mathematical compatibility with libjpeg at this point. No one really knows what's going to happen with libjpeg, since it isn't exactly being developed in the open. However, I'm steeling myself for the possibility that there will be a libjpeg v9, it very well may break API/ABI compatibility like v7 and v8 did, and it may garner enough adoption that LJT has to at least support the interface and retain drop-in compatibility.

Barring any of that as a concern, I would still reject accuracy enhancements that affected only one platform and not others. All platforms have to produce mathematically identical results.

As far as iOS, we seem to have reached a state in the LJT trunk where both iOS and Android developers are able to build and use it (there's an entry in our patch tracker on Sourceforge with much more info on this.) Since I currently have no way of testing either, I'm relying in other developers to collaborate between the iOS and Android camps and tell me when the code works in both. I don't want to have to re-visit that process unless there is a compelling performance reason to do so.

The other thing to be aware of is that the iOS build platform seems to be rather specific. Somewhat doubtful if you could use bleeding edge versions of GCC with it. I would say that once the necessary enhancements are a regular part of the standard GCC release that Apple has adopted, and performance is equal to the GAS code, then it would be the right time to entertain a move from GAS to intrinsics.
(In reply to comment #42)
> (In reply to comment #35)
> > Released GCC versions are not yet very good at optimizing the NEON register
> > allocation and load/store scheduling, but work on this is taking place as
> > part of the Linaro toolchain effort.  (Even with the rather awful register
> > management in 4.5.x, my preliminary tjbench runs show JPEG decompression
> > performance within about 10% of the assembly version.)
> 
> You seem to be implying here that the 10% (or 20% with the latest revision
> of your patch) worse overall jpeg decoding throughput measured by tjbench is
> not a big deal. But just have a look at the last profiling report from 
>    
> http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/commit/?h=sent/20110422-
> arm-neon-try1&id=cef39af8622fe26e7e452d138d9042755a7a2fd4
> 
> For decoding typical jpeg images taken with a phone camera, the contribution
> of neon optimized idct to the CPU usage is only about ~18-19%. Now guess how
> much slower should the idct function become to get 10-20% slowdown overall?

Right: once the iDCT has been vectorized, it no longer dominates the cost of decoding a JPEG.  (It's now mostly in the Huffman decode, which doesn't vectorize easily.)  So I don't really care if converting the iDCT to NEON intrinsics makes some microbenchmark take twice as long -- as long as it meets an accuracy bar, works on all the platforms I need it to, and makes efficient use of available bandwidth to DRAM.  That's where cache-line-sized loads and core-appropriate PLD scheduling come in; and I like having the compiler's help with these.

> > GCC 4.6 patch here, to work around ICEs when scheduling bulk spill/reload of
> > NEON registers to/from stack: 
> > https://github.com/mkedwards/crosstool-ng/blob/master/patches/gcc/linaro-4.6-
> > bzr/1004-reload-large-vectors.patch
> 
> So which is it? You have a patch for gcc 4.6, but still did benchmarks with
> gcc 4.5, which has "awful register management" as you said. Is gcc 4.6
> already good enough performance wise or not?

GCC 4.5.x was already more than good enough for me, performance-wise; the code it generates is ugly but mostly seems to fiddle with registers while waiting for a memory transaction to complete.  Linaro's variant of GCC 4.6.x is better -- not so much on the targets that anyone else cares about, but on my weird neon-d16-fp16 model, which would be totally unsuitable for an official Mozilla build.  I offer the patch against Linaro 4.6 just in case anyone else is benchmarking with it.

> (In reply to comment #41)
> > What this means is that this implementation is now a great deal more
> > faithful to the original than the "reference" IFAST implementation, and
> > nearly as good as the reference floating point implementation.
> 
> If you want to introduce a new idct or any other great improvements, please
> submit patches to the upstream libjpeg-turbo project:
>     https://sourceforge.net/projects/libjpeg-turbo/    

Maybe I'll do that, if and when it reaches a level of maturity that would make it interesting to upstream.  I don't see it as a "great improvement", just as an experiment that might or might not prove worthwhile.

> I doubt that Mozilla is going to take any custom modifications not approved
> by upstream because this obviously adds a great deal of extra maintenance
> burden for them. And if you read all the comments here, you might have
> noticed that Mozilla is quite passive at taking new jpeg performance
> improvements. So I can hardly imagine them being particularly excited to use
> your patch.

I don't think they should be excited.  Frankly, this bug's history shows that they've been more receptive to performance-centric changes in unreleased code than I would be in their position.  But I thought some of the subscribers to this bug might be interested to know about a third option besides "slow" and "assembly", and might even be interested in doing their own benchmarks -- whether as a preview of something that might be suitable for merge in the future, or as a figure of merit for a compiler's manipulation of NEON registers and vectorized datatypes.  And I also thought I'd get some good suggestions about how to test the implementation -- which I did, from DRC and Joe Drew.

> > On the other hand, GCC is able to adjust the instruction scheduling for the
> > target core, add preload instructions, and so forth; so I tend to prefer the
> > C-with-intrinsics version over assembly.
> 
> Let me ask you straight. Did you convince some manager about this GCC
> instructions scheduling goodness and now have work assignment to "replace
> assembly with compiler scheduled intrinsics" or something like this? I can't
> find any other good reason why you could persist pursuing this goal even
> though the odds are clearly against you.

<snicker>  That's not really the sort of relationship I have to my management.  They trust me to spend most of my time putting rabbits into hats so that I can pull them out later.  I became interested in evaluating the state of NEON intrinsics in GCC for my own reasons, and picked libjpeg-turbo's iDCT more or less at random as an exercise in translation.  If posting a few comments to a mailing list and a Bugzilla is your idea of persistence, you haven't seen me when I *really* get a bee in my bonnet.

> I have already provided you with the link to my report in gcc bugzilla when
> discussing the same in the beagleboard mailing list:
>     http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725 
> The point is that the test code from that gcc bug contains a very simple
> code, which does some trivial arithmetic operations in-place on the data
> from some array. This is to some extent a simplified model of what idct
> does. And as long as gcc fails to optimize that trivial code snippet, it
> can't be any good for idct too.

I don't really care whether it "optimizes" the generated assembly to your standards.  I care whether it generates correct code, and does a tight enough job on the arithmetic and stack churning to saturate what little DRAM throughput is left over when all the other cores on the SoC (video capture/encode/decode/display, GPU, audio) are doing their thing.  And while that's a quite difficult thing to measure in a microbenchmark, a library that decodes "99%" 4:4:4 JPEGs at 7+ Mpixel/sec is way beyond any need my system has or is likely to develop.

> I don't have any prejudice against gcc. Making a good optimizing compiler is
> an extremely hard task and gcc developers are very talented people. I'm glad
> that linaro is sponsoring their efforts, which should clearly bring some
> benefits in the long run. But currently their focus seems to be a bit more
> academic than practical.

"Academic"?  What a strange thing to say.  I'm not directly involved with Linaro, merely a fellow-traveler; but it's one of the most impressively well organized software development projects that I've seen first-hand, and as far as I can tell all of the companies that matter in the ARM Linux ecosystem are on board.  (Not that all of the important work in that sphere is company-sponsored -- far from it -- but you can't call that "academic".)

Let me repeat: I really appreciate having had an assembly implementation (yours, I think) to start from.  I don't necessarily consider what I've done to be an improvement for anyone's purposes but my own.  (You might, however, consider adopting some of the accuracy-oriented changes to constants and rounding; they should be trivial to backport.)  I don't intend to push this on anyone.  I've already gotten what I wanted out of the interaction, but would be more than happy to support others in making whatever use of this code pleases them.  Feel free to ping me at m.k.edwards@gmail.com or michaedw@cisco.com, or catch me on #linaro (where I can often be found learning and/or kibitzing).

Cheers,
- Michael
(In reply to comment #45)
> From the point of view of libjpeg-turbo, at the moment we still use jpeg-6b
> as a yardstick for accuracy, since we're trying to bill ourselves as a
> faster version of libjpeg but also as a drop-in replacement for it. I feel
> like we can't go off and implement accuracy enhancements or other
> modifications that would break mathematical compatibility with libjpeg at
> this point. No one really knows what's going to happen with libjpeg, since
> it isn't exactly being developed in the open. However, I'm steeling myself
> for the possibility that there will be a libjpeg v9, it very well may break
> API/ABI compatibility like v7 and v8 did, and it may garner enough adoption
> that LJT has to at least support the interface and retain drop-in
> compatibility.

That seems like a reasonable stance.  However, I personally am not too worried about compatibility with post-6b libjpeg.  This does not sound to me like a probable source of major innovations:  http://usenetmessages.com/view.php?c=computer&g=4404&id=460029&p=10 .  But I suppose stranger things have happened ....

> Barring any of that as a concern, I would still reject accuracy enhancements
> that affected only one platform and not others. All platforms have to
> produce mathematically identical results.

That is one approach; but it doesn't seem likely to produce the best accuracy/speed trade-offs on any platform.  In the long run, I would like to see a testsuite that uses some computable proxy for perceptual quality, rather than byte-for-byte comparison against a "reference" decoded image, artifacts and all.  I know it would be a lot of work to produce such a testsuite; but since you now seem to be the custodian of the preferred fork of libjpeg, that work might be something for which sponsorship could be found.

> As far as iOS, we seem to have reached a state in the LJT trunk where both
> iOS and Android developers are able to build and use it (there's an entry in
> our patch tracker on Sourceforge with much more info on this.) Since I
> currently have no way of testing either, I'm relying in other developers to
> collaborate between the iOS and Android camps and tell me when the code
> works in both. I don't want to have to re-visit that process unless there is
> a compelling performance reason to do so.

Fair enough.  I can't spare the attention to test on iOS at the moment myself.  Maybe once I've translated the colorspace conversion and the 2x2 and 4x4 iDCT; and I thought I might take a shot at the forward DCT as well, to see whether it noticeably improves compression speed.

> The other thing to be aware of is that the iOS build platform seems to be
> rather specific. Somewhat doubtful if you could use bleeding edge versions
> of GCC with it. I would say that once the necessary enhancements are a
> regular part of the standard GCC release that Apple has adopted, and
> performance is equal to the GAS code, then it would be the right time to
> entertain a move from GAS to intrinsics.

Apple doesn't track FSF GCC; they're in the LLVM camp.  Support for the NEON intrinsics appears to be in good shape upstream:  http://blog.llvm.org/2010/04/arm-advanced-simd-neon-intrinsics-and.html .  I've been reasonably careful to treat the intrinsics as strongly typed, using vreinterpret* and all that, so the code should compile under LLVM with no more than trivial touch-ups -- apart from the one bit of assembly needed to access VSWP, which uses the undocumented %e/%f selectors for access to the lower/upper dwords of quadword-sized parameters.  This may need massaging or replacement for iOS.
> That seems like a reasonable stance.  However, I personally am not too
> worried about compatibility with post-6b libjpeg.  This does not sound to me
> like a probable source of major innovations: 
> http://usenetmessages.com/view.php?c=computer&g=4404&id=460029&p=10 .  But I
> suppose stranger things have happened ....

Wow.  I had not read that thread before, but I was aware in general of the political situation vis-a-vis the Independent JPEG Group (or, should I say, the Independent JPEG Guido.)  The basic problem, as I see it, is that JPEG and libjpeg are both deeply entrenched, so users really expect them to improve via evolution, not revolution.


> That is one approach; but it doesn't seem likely to produce the best
> accuracy/speed trade-offs on any platform.  In the long run, I would like to
> see a testsuite that uses some computable proxy for perceptual quality,
> rather than byte-for-byte comparison against a "reference" decoded image,
> artifacts and all.  I know it would be a lot of work to produce such a
> testsuite; but since you now seem to be the custodian of the preferred fork
> of libjpeg, that work might be something for which sponsorship could be
> found.

I've done many such studies.  Perceptual quality is important to me, because three of the primary consumers of libjpeg-turbo are VirtualGL, TurboVNC, and TigerVNC, three projects which use high-speed JPEG as a way to deliver rendered 3D images remotely.  These programs are used, for instance, by medical imaging professionals as diagnostic tools, so they are very keen on getting perceptual losslessness.  The thing is, though, in most cases, perceptual losslessness can be achieved quite handily with the unmodified libjpeg v6b using a sufficiently high image quality (90 and above, but we set the defaults in VirtualGL and TurboVNC to 95 to be safe) and no chroma subsampling.  This is using the fast forward DCT.  We don't need to use the integer forward DCT until the quality gets above 95.

I'm not saying that it wouldn't be nice to have better image quality in general.  If someone came up with a DCT across all platforms that performed as well as IFAST but had image quality as good as ISLOW, I'd certainly use it in my own software, but I think we'd still have to provide the legacy methods for backward compatibility, at least for a while.  libjpeg is such an old code that people just expect IFAST and ISLOW to behave a certain way.  It also goes without saying that such a new DCT method would need corresponding SIMD routines in order to be of much use in libjpeg-turbo.  At the moment, we are able to work within the limitations of the existing JPEG standard and the existing libjpeg behavior to produce perceptually lossless images when we need them, so quality enhancements which fundamentally change the mathematically output of libjpeg are really not a priority for me, and there doesn't seem to be any motivation in the libjpeg-turbo community around doing that, either.

When I added libjpeg v7/v8 compatibility, I wasn't able to add the new, supposedly more accurate DCT/iDCT routines from libjpeg v7/v8, because those would break mathematical compatibility with v6b (not to mention require re-written SIMD routines.)  So far, no one has seemed to care.  People just wanted v7/v8 compatibility because some software had already moved to the latest IJG code, so there was software out there that relied on the new ABI.

I don't know that we're the "preferred" fork of libjpeg.  At the end of the day, I'm not really a JPEG expert and don't claim to be.  What I am is a computer architecture and performance expert.  Thus, I'm not in a very good position to take libjpeg-turbo in directions that diverge too much from libjpeg.  I still see us as tracking the "upstream" IJG code.  However, that could definitely change in the future.  It just takes having some money thrown at the project and some additional developers on board who are more steeped in the algorithms than I am ... both of those things would be a product of community demand for such improvements, and IMHO, such demand needs a "killer app" to spark it.
You probably don't want to track Guido's changes to iDCT.  The scalings from 8x8 to 4x4 / 2x2 aren't "more accurate"; they are faster, but only because they're *wrong*.  (I say this based on code inspection rather than measurement; we'll see whether I understand the math.)

The revised downscale routines in jpeg8c drop all components of order higher than the target block size, which a correct downsampler shouldn't.  You won't see it in most images, because they've already lost those coefficients in quantization.  However, high-quality images containing macroblocks with significant power in those higher order terms -- specifically, for both 1/2 and 1/4 scaling, the 5th and 7th order terms -- will suffer.  (The 1/2 scale also drops the 6th order term, and the 1/4 scale also drops the 3rd order term, both incorrectly.  The terms that can safely be dropped -- and should be to reduce the risk of overflow in intermediate results -- are limited to order 4 when 1/2 scaling and orders 2, 4, and 6 when 1/4 scaling.)

If you want to run some comparisons, I suggest you take some PPMs with lots of texture, crop to a multiple of 8 in each dimension, and run them through four different versions of 1/2 rescale:  pnmscale, jpeg6b decode with float, jpeg6b decode with ifast, and jpeg8c decode with ifast.  Encode with jpeg6b float at 99% quality, decode with -scale 1/2.  Compare visually and using pnmpsnr, with the pnmscale output as your reference image.  You should see the resampling problem easily; look for places where there is detail on top of an overall gradient.  The gradient has power in all the odd terms, and the texture moves more of it from the 1st-order terms into the higher odd orders.

The results of the above tests with current libjpeg-turbo should, as I understand it, be identical to jpeg6b.  (The SIMD rescaling implementations in libjpeg-turbo's jsimd_arm_neon.S look plausible to me, although I haven't tested them in detail.)  Do it again with -scale 1/4, and you should see even more artifacts in the jpeg8c decode, which will look sort of like bad deinterlacing along both axes.  That's because the 3rd order term contributes the most to gradient (after the 1st order term, of course), and almost always survives quantization.

As another comparison, I updated my "NEON intrinsics" patch to add the 2x2 and 4x4 methods, with essentially the same iDCT arithmetic as the 8x8 patch and the resampling done explicitly afterward.  (I made a point of zeroing out the coefficients that *can* be dropped; there is no multiply that can correctly be omitted in the 4x4 case, and only one in the 2x2 case, which I didn't bother to optimize out.)  If you do run some tests, you might try this as well, and let me know how it compares to your eye; it should be similar to the jpeg6b output, possibly a little cleaner if I've managed the rounding errors properly.  (I'm mostly tuning with pnmpsnr for now, but that is of course not a real perceptual measure.)

I'm testing, and will push soon, a further update to that patch to add NEON intrinsic implementations of the decode-side color space conversions.  After that, I'll add the encode-side color space conversions and the fdct, and then maybe concoct a better metric of decode quality than pnmpsnr alone.  I'll probably run an FFT over pnmdiff output and look for power in individual frequency bins, as well as some rank tests on absolute pixel differences (so I catch bad range clamping during color space conversion, etc.).
Thanks for the explanation.  Yes, libjpeg-turbo should currently produce mathematically identical results to jpeg-6b on all platforms.  If it doesn't, I consider that to be a bug.  Your explanation further confirms in my mind that maintaining this mathematical compatibility is the right approach for now.  I didn't have any plans to implement the new DCT/iDCT algorithms in jpeg-7/8, mainly because it would be a royal PITA (it would require developing a whole new set of SIMD DCT routines and using the new routines only when jpeg-7/8 emulation was enabled.)  However, if they're less accurate, I have no desire to mess with them.
I pushed a couple of additional patches, one that does color space conversion with NEON intrinsics, and one that plumbs through the additional 2 bits of fractional precision that I think the NEON iDCT allows (due to careful use of saturating and sign-extending arithmetic).

As an initial trial, I chose a black-and-white image with interesting detail more or less at random (http://www.flickr.com/photos/dirkbee/5799351325/sizes/o/in/pool-16978849@N00/), converted it to PPM with jpegtopnm, and re-compressed with "./cjpeg -dct float -quality 99 -outfile ir.jpg ir.ppm", just to make sure I was dealing with a vanilla JFIF.  The difference in decoding accuracy on the resulting JPEG (irasm = original IFAST, irfast = my patch) is pretty dramatic:

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr ir.ppm irfast.ppm 
pnmpsnr: PSNR between ir.ppm and irfast.ppm:
pnmpsnr: Y  color component: 54.67 dB
pnmpsnr: Cb color component  doesn't differ.
pnmpsnr: Cr color component doesn't differ.

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr ir.ppm irasm.ppm 
pnmpsnr: PSNR between ir.ppm and irasm.ppm:
pnmpsnr: Y  color component: 44.22 dB
pnmpsnr: Cb color component  doesn't differ.
pnmpsnr: Cr color component doesn't differ.

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr ir.ppm irint.ppm 
pnmpsnr: PSNR between ir.ppm and irint.ppm:
pnmpsnr: Y  color component: 57.88 dB
pnmpsnr: Cb color component  doesn't differ.
pnmpsnr: Cr color component doesn't differ.

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr ir.ppm irfloat.ppm 
pnmpsnr: PSNR between ir.ppm and irfloat.ppm:
pnmpsnr: Y  color component: 57.17 dB
pnmpsnr: Cb color component  doesn't differ.
pnmpsnr: Cr color component doesn't differ.

The differences between the old IFAST and the other three are quite noticeable; I can't tell the other three apart.  Difference images generated with pnmdiff tell the story: the old IFAST version shows 8x8 artifact blocks tracing all the high-contrast edges in the image.  There are some visible differences along the same trails in the "new IFAST" diff, but they don't have the conspicuous block-pattern structure, and they're a lot fainter (10x fainter, according to the headline numbers from pnmpsnr).  At least on casual inspection, I can't spot anything at all in the pnmdiff output for ISLOW or float.

I will play with this some more, to improve my confidence in the "safety" of those extra two bits of fractional precision.  If it holds up, I might actually go to the trouble of getting someone with domain knowledge in this area to review it.  (I still know a few people from my astronomy days ...)
Very interesting.  The major problem still is that it's taking advantage of a platform-specific feature to improve accuracy.  I'd entertain such accuracy improvements upstream, but only if they were available on all platforms.  That would mean that the x86 IFAST inverse DCT SIMD code would have to be similarly enhanced, and I don't know that that's possible, because that code is almost all 16-bit precision (it operates on eight 16-bit words at a time using the 128-bit SSE2 registers.)

Personally, I don't ever use the IFAST inverse DCT.  The projects I work on that are consumers of libjpeg-turbo (TigerVNC, TurboVNC, VirtualGL) all use the IFAST forward DCT for quality<=95, the ISLOW forward DCT for quality>=96 (IFAST degrades rather severely at these high quality levels), and the ISLOW inverse DCT.  The IFAST forward DCT provides enough of a speedup (8-10%), and the quality difference is negligible enough, for lower quality levels to make it worthwhile relative to ISLOW.  That 8-10% isn't as critical when decoding, because we're dealing with an image pipeline in which the decoding step is almost always much faster than the encoding step anyhow.

I guess what I'm driving at is this-- your code is, by your own admission, slower than our existing NEON IFAST inverse DCT code.  I wonder aloud whether we can achieve the same performance and quality as your enhancements by simply implementing the ISLOW inverse DCT using NEON instructions.  The latter would definitely be something I would want to integrate upstream.  I could probably even throw it together myself if financial sponsorship and ARM test equipment were provided.
If we want to be done with this bug and with the related bug 489148 eventually (and hopefully sooner), then I think that:
1. Mozilla people need to provide the list of libjpeg-turbo functionality they want to have optimized for ARM NEON. Just clearly confirming that ISLOW inverse DCT is needed would be the first step. Maybe something else might be needed too.
2. DRC needs to confirm which features still might be required or are blocking libjpeg-turbo 1.2.0 beta formal release.

And if we compare and match 1) with 2), then we are going to have a clear plan to go forward.

Formal release of libjpeg-turbo will encourage a lot more people to test it and take into use. Also my understanding is that Mozilla is only going to take stable releases of libjpeg-turbo into its source tree. So this bug can't be resolved until some stable libjpeg-turbo version with ARM NEON optimizations gets released.
In terms of features, ARM support is the only thing that is really blocking 1.2 beta at this point.  If Mozilla aims to be a consumer of it, then I want to make sure that the necessary features are implemented so that it will be as easy to integrate as possible.

There are other players, however.  A group of developers is investigating the possibility of merging libjpeg-turbo into Android, for instance.

I tend to put out betas whenever I feel like the new features have more or less gelled.  I don't get that sense yet with this stuff.  It needs more eyes looking at it.
(In reply to comment 53)

> 1. Mozilla people need to provide the list of libjpeg-turbo functionality they
> want to have optimized for ARM NEON. Just clearly confirming that ISLOW inverse
> DCT is needed would be the first step. Maybe something else might be needed
> too.

AFAICT, we currently use ISLOW and only ISLOW [1].  I don't think we'd object to using a different method on principle, but we'd have to adjust some of our tests.

[1] http://mxr.mozilla.org/mozilla-central/search?string=JDCT_
(In reply to comment #55)
> (In reply to comment 53)
> 
> > 1. Mozilla people need to provide the list of libjpeg-turbo functionality they
> > want to have optimized for ARM NEON. Just clearly confirming that ISLOW inverse
> > DCT is needed would be the first step. Maybe something else might be needed
> > too.
> 
> AFAICT, we currently use ISLOW and only ISLOW [1].  I don't think we'd
> object to using a different method on principle, but we'd have to adjust
> some of our tests.
> 
> [1] http://mxr.mozilla.org/mozilla-central/search?string=JDCT_

In fact we have bug 649020 about switching to IFAST on mobile. We just need to resolve the test issues and that can land. So no, we don't really have any short term needs for a ISLOW neon implementation.
(In reply to comment #52)
> Very interesting.  The major problem still is that it's taking advantage of
> a platform-specific feature to improve accuracy.  I'd entertain such
> accuracy improvements upstream, but only if they were available on all
> platforms.  That would mean that the x86 IFAST inverse DCT SIMD code would
> have to be similarly enhanced, and I don't know that that's possible,
> because that code is almost all 16-bit precision (it operates on eight
> 16-bit words at a time using the 128-bit SSE2 registers.)

There's nothing particularly platform-specific about this implementation; it's all bog-standard integer SIMD operations, though it does use NEON niceties (like saturating/rounding shifts and multiplies) where they're available.  The DCT arithmetic is still all done with 16-bit precision, except for a couple of the accumulation stages in the 8x8->4x4 and 8x8->2x2 downsampling iDCTs, where I used the "pairing add" instructions for convenience.  There's no reason why the x86 implementation couldn't be adjusted to match.  I might even take a look at it myself, though probably not this month.

> Personally, I don't ever use the IFAST inverse DCT.  The projects I work on
> that are consumers of libjpeg-turbo (TigerVNC, TurboVNC, VirtualGL) all use
> the IFAST forward DCT for quality<=95, the ISLOW forward DCT for quality>=96
> (IFAST degrades rather severely at these high quality levels), and the ISLOW
> inverse DCT.  The IFAST forward DCT provides enough of a speedup (8-10%),
> and the quality difference is negligible enough, for lower quality levels to
> make it worthwhile relative to ISLOW.  That 8-10% isn't as critical when
> decoding, because we're dealing with an image pipeline in which the decoding
> step is almost always much faster than the encoding step anyhow.

I have now also implemented forward DCT, sample conversion, and quantization using similar techniques.  The result compresses about 10% faster than the current IFAST -- about 4 Mpixel/sec at 99% 4:4:4, or just over half the throughput of IFAST decompression -- and appears to be essentially indistinguishable from the *floating-point* implementation for a round-trip to JPEG and back at 99% quality 4:4:4, at least on the images where I have tried it:

\u@\h:\w\$ ./cjpeg -dct fast -sample 1x1 -quality 99 -outfile matryoshki-fast.jpg matryoshki-float.ppm
\u@\h:\w\$ ./djpeg -dct fast -ppm -outfile matryoshki-fastfast.ppm matryoshki-fast.jpg
\u@\h:\w\$ ./cjpeg -dct float -sample 1x1 -quality 99 -outfile matryoshki-float.jpg matryoshki-float.ppm
\u@\h:\w\$ ./djpeg -dct float -ppm -outfile matryoshki-floatfloat.ppm matryoshki-float.jpg

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr matryoshki-fastfast.ppm matryoshki-float.ppm pnmpsnr: PSNR between matryoshki-fastfast.ppm and matryoshki-float.ppm:
pnmpsnr: Y  color component: 54.23 dB
pnmpsnr: Cb color component: 53.28 dB
pnmpsnr: Cr color component: 53.11 dB

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr matryoshki-floatfloat.ppm matryoshki-float.ppm pnmpsnr: PSNR between matryoshki-floatfloat.ppm and matryoshki-float.ppm:
pnmpsnr: Y  color component: 54.79 dB
pnmpsnr: Cb color component: 53.75 dB
pnmpsnr: Cr color component: 53.18 dB

build@ctbu-bld5:~/CTBU/MonoLake-foreign/opensource-core$ pnmpsnr matryoshki-floatfloat.ppm matryoshki-fastfast.ppm pnmpsnr: PSNR between matryoshki-floatfloat.ppm and matryoshki-fastfast.ppm:
pnmpsnr: Y  color component: 53.00 dB
pnmpsnr: Cb color component: 53.85 dB
pnmpsnr: Cr color component: 53.20 dB

54 dB corresponds to an RMS error of one part in 2^9; this may be a difference in rounding strategy, which I'll probably track down at some point.  I have visually inspected the difference image, and it's completely featureless to my eye.  I suspect that the limiting factor on quality -- irrespective of the choice of DCT -- is now the fact that the results of color space conversion are only retained to 8 bits of precision.  Fixing that involves changes to some internal APIs (padding, downsampling, etc.) that operate on color-space-converted buffers.

> I guess what I'm driving at is this-- your code is, by your own admission,
> slower than our existing NEON IFAST inverse DCT code.  I wonder aloud
> whether we can achieve the same performance and quality as your enhancements
> by simply implementing the ISLOW inverse DCT using NEON instructions.  The
> latter would definitely be something I would want to integrate upstream.  I
> could probably even throw it together myself if financial sponsorship and
> ARM test equipment were provided.

I suppose you could rename this from IFAST to ISLOW and call it done.  ;-)  I don't see a lot of point in localized code-bumming for that last 25% of throughput until it's clear that the code is as correct as it can possibly be.  There's no reason why we shouldn't be able to squeeze out an extra 3-6 dB, at least in the luminance channel, by keeping an extra 3 bits of precision in the results of color space conversion.  This might actually speed my compression code up a bit further, because I left-shift by 3 bits during sample conversion in order to retain precision during the forward DCT.
(In reply to comment #53)

> Formal release of libjpeg-turbo will encourage a lot more people to test it
> and take into use. Also my understanding is that Mozilla is only going to
> take stable releases of libjpeg-turbo into its source tree. So this bug
> can't be resolved until some stable libjpeg-turbo version with ARM NEON
> optimizations gets released.

I would certainly support release of libjpeg-turbo with Siarhei's assembly version of NEON IFAST.  It's very well executed, and a good base for further development.  Personally, I wouldn't delay 1.2 for Android/iOS support; that's something that can safely be added in a 1.2.1 release.  But it's DRC's call, of course.
1.2.1 would not be a feature release, so only bug fixes can be added to that.  Thus, I want to ensure that we have the necessary set of features to make Mozilla happy in 1.2.

Referring to Comment #57, if your algorithm in fact is as accurate as Float but performs as well as IFAST across a wider set of images, then it may very well eliminate the need for ISLOW and Float in the long term.  I'm not in a position where I can make that decision, though.  It needs to be peer reviewed by JPEG domain experts.

Right now, ISLOW and IFAST need to retain mathematical compatibility with jpeg-6b.  I don't care how the algorithm works behind the scenes, but it must produce the same output as either the existing ISLOW or IFAST algorithms.

This highlights the problem in the JPEG community right now.  Ideally, such an algorithmic enhancement as yours should be submitted to some group for peer review, but IJG definitely isn't that group.  I'm not sure if the libjpeg-turbo project is, either, although I have done quite a bit of research in image quality/performance tradeoffs.

Personally, I'd be open to accepting the more accurate IFAST implementation upstream if:

(a) it was implemented either as a new DCT algorithm (JDCT_INEW, for instance) or as a flag that could modify an existing DCT algorithm so the increased accuracy could be switched on and off.

(b) I could clearly diff it against the existing IFAST implementation to understand how one is based on the other.

I think any discussion of that is outside of the scope of this thread, however, and is probably best moved to the libjpeg-turbo-devel list or somewhere else upstream.
(In reply to comment #59)
> 1.2.1 would not be a feature release, so only bug fixes can be added to
> that.  Thus, I want to ensure that we have the necessary set of features to
> make Mozilla happy in 1.2.

Makes sense.  It just wasn't clear to me whether Android/iOS support fell into the "bug fix" category (it's just Makefile fixes and a tweak to one or two bits assembly, right?) and/or the "necessary to make Mozilla happy" category.

> Referring to Comment #57, if your algorithm in fact is as accurate as Float
> but performs as well as IFAST across a wider set of images, then it may very
> well eliminate the need for ISLOW and Float in the long term.  I'm not in a
> position where I can make that decision, though.  It needs to be peer
> reviewed by JPEG domain experts.

Oh, absolutely.  I just thought you might like to play with it, since you're interested in both speed and accuracy on the high end of the quality range.  I don't care that much about JPEGs per se, I'm just using it as an exercise in implementing practical algorithms with high precision using NEON intrinsics.

> Right now, ISLOW and IFAST need to retain mathematical compatibility with
> jpeg-6b.  I don't care how the algorithm works behind the scenes, but it
> must produce the same output as either the existing ISLOW or IFAST
> algorithms.

That's fair, certainly as long as we don't have a satisfactory set of quality metrics and a corpus of test images.  You can't eyeball-test every tweak that adds 1 dB of PSNR on some images to make sure it doesn't overflow or generate conspicuous artifacts on others.

> This highlights the problem in the JPEG community right now.  Ideally, such
> an algorithmic enhancement as yours should be submitted to some group for
> peer review, but IJG definitely isn't that group.  I'm not sure if the
> libjpeg-turbo project is, either, although I have done quite a bit of
> research in image quality/performance tradeoffs.

I've done a certain amount of quality evaluation on image processing pipelines myself, back in my astronomy days; but not much centered on perceptual metrics (vs. photometry, astrometry, etc.).  Still, maybe someone in that community could make some referrals.

> Personally, I'd be open to accepting the more accurate IFAST implementation
> upstream if:
> 
> (a) it was implemented either as a new DCT algorithm (JDCT_INEW, for
> instance) or as a flag that could modify an existing DCT algorithm so the
> increased accuracy could be switched on and off.

It can currently be switched off by an environment variable (JSIMD_FORCE_ARM_NEON_ASM=1).  That could easily be adjusted to a flag, etc.

> (b) I could clearly diff it against the existing IFAST implementation to
> understand how one is based on the other.

That's a little trickier, since you more or less have to match up the NEON intrinsics against the assembly equivalents, and read through to the ARM ARM to understand some of the differences (shifting/rounding strategy and such).  I suppose I could write something up in plain English.  Probably not this month, though; I'm pretty heads-down on coding work, or should be.

> I think any discussion of that is outside of the scope of this thread,
> however, and is probably best moved to the libjpeg-turbo-devel list or
> somewhere else upstream.

Sure.  Is that the libjpeg-turbo-devel list on SourceForge (which appears to have one subscriber and no archives), or is it hosted somewhere else?
(In reply to comment #60)
> > (a) it was implemented either as a new DCT algorithm (JDCT_INEW, for
> > instance) or as a flag that could modify an existing DCT algorithm so the
> > increased accuracy could be switched on and off.
> 
> It can currently be switched off by an environment variable
> (JSIMD_FORCE_ARM_NEON_ASM=1).  That could easily be adjusted to a flag, etc.

As I've said before, the new algorithm would need to be implemented on all platforms, not just NEON, to be acceptable upstream.  It also cannot be implemented in such a way that existing application behavior is changed.  There are so many applications out there that rely on libjpeg, and I'm sure that many of them make certain assumptions based on the image quality generated by the current ISLOW and IFAST algorithms.

The new algorithm needs to be something that an application explicitly enables via the API, not an environment variable.

> > (b) I could clearly diff it against the existing IFAST implementation to
> > understand how one is based on the other.
> 
> That's a little trickier, since you more or less have to match up the NEON
> intrinsics against the assembly equivalents, and read through to the ARM ARM
> to understand some of the differences (shifting/rounding strategy and such).
> I suppose I could write something up in plain English.  Probably not this
> month, though; I'm pretty heads-down on coding work, or should be.

What I'm saying is that, to be acceptable upstream, the code would have to be converted to assembly.  Intrinsics are currently not acceptable for a variety of reasons previously discussed.

> Sure.  Is that the libjpeg-turbo-devel list on SourceForge (which appears to
> have one subscriber and no archives), or is it hosted somewhere else?

Yes, on SourceForge.  It's a brand new list.
(In reply to comment #61)
> As I've said before, the new algorithm would need to be implemented on all
> platforms, not just NEON, to be acceptable upstream.  It also cannot be
> implemented in such a way that existing application behavior is changed. 
> There are so many applications out there that rely on libjpeg, and I'm sure
> that many of them make certain assumptions based on the image quality
> generated by the current ISLOW and IFAST algorithms.
> 
> The new algorithm needs to be something that an application explicitly
> enables via the API, not an environment variable.

I'm in agreement with all of the above.  I just meant to describe the current state of the patch, which is that you can easily turn the new code off for comparison purposes.  The "right thing" is probably to make it a new "algorithm", JDCT_INEW or whatever.  And on the platforms where it makes sense, based on quality and speed benchmarks, the default and/or JDCT_FASTEST could map to it.  (Assuming that it can in fact be made "fastest"; but I see no particular reason why it couldn't be, as it is intrinsically no more complex than the current SIMD IFAST, just mildly impaired by so-so register allocation in the compiler.)

> What I'm saying is that, to be acceptable upstream, the code would have to
> be converted to assembly.  Intrinsics are currently not acceptable for a
> variety of reasons previously discussed.

Again, I'm in full agreement; unless and until compilers that do a clean job on intrinsic-heavy code are in wide circulation, it doesn't make sense to incorporate them upstream.  Conversion to assembly shouldn't be hard.  But I personally am not likely to make time for it any time soon, unless someone has cash and/or favors to trade (in which case, please do ping me at m.k.edwards@gmail.com and/or michaedw@cisco.com).  That's why I've published it as a patch; anyone sufficiently skilled and motivated should be able to take it from here.

> > Sure.  Is that the libjpeg-turbo-devel list on SourceForge (which appears to
> > have one subscriber and no archives), or is it hosted somewhere else?
> 
> Yes, on SourceForge.  It's a brand new list.

I'm signed up; discussion not specific to Mozilla can continue there.
(In reply to comment #54)
> In terms of features, ARM support is the only thing that is really blocking
> 1.2 beta at this point.  If Mozilla aims to be a consumer of it, then I want
> to make sure that the necessary features are implemented so that it will be
> as easy to integrate as possible.

Based on comment 55 and comment 56, Mozilla is not sure whether they need anything else. There is some shade of doubt though, like comment 29. I even think that if Mozilla decided that they need NEON optimized ISLOW iDCT and set it as a requirement, that would make everything much more clear and easier for everyone.

> There are other players, however.  A group of developers is investigating
> the possibility of merging libjpeg-turbo into Android, for instance.

Is there any visibility for their progress and any kind of ETA? I mean if they have any problems, these would be very nice to show up on libjpeg-turbo bugtracker. And just for the record, I had done some testing of libjpeg-turbo using android standalone toolchain:
    http://developer.android.com/sdk/ndk/overview.html
I don't expect any obvious problems there, at least not in the libjpeg-turbo C and assembly sources.

> I tend to put out betas whenever I feel like the new features have more or
> less gelled.  I don't get that sense yet with this stuff.  It needs more
> eyes looking at it.

Could you clarity the source of your concerns? Is it the possibility of bugs or regressions in the new ARM NEON optimized code? Or you actually want better performance for some of the use cases, which means that actually more optimized functions are needed? These goals are kind of mutually exclusive.

This is really dragging really slow. First I expected that linaro could help at least with the upstream communication and clean ARM NEON code integration (like runtime neon detection, build system tweaks, ensuring compatibility across various systems like iOS and android, testing and quality assurance), but waiting clearly did not do any good and in the end I had to handle all of this myself. Also I initially decided to play safe and just start with a few basic things only needed by Mozilla. And I guess it actually helped to minimize the amount of tweaks needed for proper iOS support, so that this work was a bit easier for Martin Aumüller. But now we need to move on. And I wonder if it's a good idea to add a lot more NEON optimizations for jpeg encoding/decoding targeting libjpeg-turbo 1.2 or just handle that with a quick followup 1.3 release?

And if that would be of any help and add more confidence, I probably can provide you with ssh access to one of the Cortex-A8 boards running linux at my home. Just send me your public ssh key if you are interested. But the connection between the U.S. and Europe would be rather laggy.
> Based on comment 55 and comment 56, Mozilla is not sure whether they need 
> anything else. There is some shade of doubt though, like comment 29. I even 
> think that if Mozilla decided that they need NEON optimized ISLOW iDCT and set 
> it as a requirement, that would make everything much more clear and easier for 
> everyone.

If the only NEON routine were ISLOW and it was faster than our current IFAST, I imagine we'd take that.  But we're saying that IFAST would be perfectly fine.

If we were handed a fast NEON IFAST but no NEON ISLOW, I bet we'd get our act together in bug 649020 and turn on IFAST for mobile.  :)

I don't really understand what's not clear about our requirements, but I'm happy to try to clarify, and if I can't, I'll make sure Jeff stops by.
(In reply to comment #63)
> Is there any visibility for their progress and any kind of ETA? I mean if
> they have any problems, these would be very nice to show up on libjpeg-turbo
> bugtracker. And just for the record, I had done some testing of
> libjpeg-turbo using android standalone toolchain:
>     http://developer.android.com/sdk/ndk/overview.html
> I don't expect any obvious problems there, at least not in the libjpeg-turbo
> C and assembly sources.

It's all occurring on a Linaro discussion list, and I'm having trouble figuring out who is doing what.  There is still some confusion there as to how to get things up and running on Android, and I can't provide any assistance, since I've never done it.

> Could you clarity the source of your concerns? Is it the possibility of bugs
> or regressions in the new ARM NEON optimized code? Or you actually want
> better performance for some of the use cases, which means that actually more
> optimized functions are needed? These goals are kind of mutually exclusive.

I need a clear answer from Mozilla as to whether they need a NEON-optimized ISLOW implementation or whether the current IFAST implementation is good enough.  If they need ISLOW, then that needs to be implemented in libjpeg-turbo 1.2.

> And if that would be of any help and add more confidence, I probably can
> provide you with ssh access to one of the Cortex-A8 boards running linux at
> my home. Just send me your public ssh key if you are interested. But the
> connection between the U.S. and Europe would be rather laggy.

I really need a board of my own.  It would be nice if one of the companies involved would also provide financial sponsorship for my efforts, as I am not making a salary to work on this stuff.
(In reply to comment #65)
> I need a clear answer from Mozilla as to whether they need a NEON-optimized
> ISLOW implementation or whether the current IFAST implementation is good
> enough.  If they need ISLOW, then that needs to be implemented in
> libjpeg-turbo 1.2.

We do not need a NEON-optimized ISLOW implementation. The current IFAST implementation is good enough.
We use libjpeg-turbo 1.2 now, which includes ARM NEON optimizations. FIXED?
Indeed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: