Open Bug 822023 Opened 12 years ago Updated 2 years ago

Speed up jpeg huffman decoding

Categories

(Core :: Graphics: ImageLib, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: jrmuizel, Assigned: jrmuizel)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

Attached patch Speed up jpeg huffman decoding (obsolete) — Splinter Review
This takes the approach suggested here http://cbloomrants.blogspot.ca/2010/08/08-12-10-lost-huffman-paper.html and adopts it for libjpeg. It doesn't bring a huge performance improvement, but it is noticeable (from 98mbps to 102mbps).

The patch may still be a little rough, but I'm interested in feedback on it.
Attachment #692621 - Flags: review?(tterribe)
I'm going to try to get around to running some quick & dirty tests with this this week to see if I can reproduce your speedup numbers.  If so, then 5% is compelling enough to look at integrating this upstream.

Huffman encoding/decoding is probably the biggest piece of low-hanging performance fruit in libjpeg-turbo.  One of the biggest issues with it is that there is a large gap between 32-bit and 64-bit that I'd really like to close.  This gap is due in part to the fact that, on 64-bit platforms, the Huffman codec has more bits to work with and thus has to load the register more infrequently, but the encoder in particular also just runs out of registers on 32-bit platforms.  I think that the use of assembly instructions (even just hand-tuned x86 assembly, not even using SIMD) would speed things up probably more than algorithmic changes.  At least one developer in the Linaro project has proposed using SIMD to do the job as well, with the first target of that effort being ARM (but the resulting code, if it ever got written, could be ported to SSE2.)
Comment on attachment 692621 [details] [diff] [review]
Speed up jpeg huffman decoding

Review of attachment 692621 [details] [diff] [review]:
-----------------------------------------------------------------

This looks mostly fine as-is, but there's still room for improvement. I encourage you to look at media/libtheora/lib/huffdec.c (and for adventure, media/libtheora/lib/arm/armbits.s). In Theora we have to solve a slightly more general problem, since there's no requirement that the bit-patterns of the Huffman codes are ordered by length (and the default VP3 tables were not so ordered), but there's still lots of commonalities that apply here.

::: jdhuff.c
@@ +227,2 @@
>      } else {
> +      dtbl->maxcode[l] = last_max;	/* -1 if no codes of this length */

Comment is no longer accurate.

@@ +236,5 @@
>     * First we set all the table entries to 0, indicating "too long";
>     * then we iterate through the Huffman codes that are short enough and
>     * fill in all the entries that correspond to bit sequences starting
>     * with that code.
>     */

L1 cache footprint is important for this table, and currently its twice as large as it needs to be on most platforms (int is 32 bits, but we only use 8 bits for the code value and 4 bits for the code length). Using an explicit 16-bit type would help.

@@ +382,1 @@
>        bits_left = MIN_GET_BITS;

You can now set this to a really large value to never hit this branch again.

@@ +408,5 @@
>    register int c0, c1; \
>    c0 = GETJOCTET(*buffer++); \
>    c1 = GETJOCTET(*buffer); \
>    /* Pre-execute most common case */ \
> +  get_buffer = get_buffer | ((bit_buf_type)c0 << (BIT_BUF_SIZE - 8 - bits_left)); \

So, when you call GET_BYTE repeatedly (as in FILL_BIT_BUFFER_FAST), what you actually want to do is store the shift amount (BIT_BUF_SIZE_-8-bits_left) in a variable, and _subtract_ 8 from that each time you add a byte, and then convert that shift amount back to the corresponding bits_left when you're done. Otherwise the compiler will repeatedly translate back and forth between the two values, wasting operations and/or registers.

::: jdhuff.h
@@ +28,5 @@
> +#define BIT_BUF_SIZE  64		/* size of buffer in bits */
> +
> +#else
> +
> +typedef unsigned int bit_buf_type;	/* type of bit-extraction buffer */

Aww, did libjpeg drop real-mode DOS support? Otherwise this needs to be UINT32, etc.

@@ +159,5 @@
>  	    if (! jpeg_fill_bit_buffer(&(state),get_buffer,bits_left,nbits))  \
>  	      { action; }  \
>  	    get_buffer = (state).get_buffer; bits_left = (state).bits_left; } }
>  
> +/* XXX: requiring 'tmp' here makes this really ugly */

I'd personally prefer you take an l-value as one of the macro parameters, to make it more transparent. You wouldn't need the comma-operator contortions to make sure things are done in the right order, either.

@@ +164,2 @@
>  #define GET_BITS(nbits) \
> +	(tmp = get_buffer, get_buffer <<= nbits, ((int) (tmp >> (bits_left -= (nbits), BIT_BUF_SIZE - (nbits)))))

I guess I'm confused why |bits_left -= (nbits)| is in a nested comma expression instead of being out in front.

@@ +209,2 @@
>    } else { \
>  slowlabel: \

So, one of the things that we found very helpful for Theora was not wasting the lookup, even when the code was too long to fit in the initial lookup table. You should be able to store in the table the minimum number of bits you know the code will need, and jump directly to that iteration of the loop.

@@ +225,2 @@
>      /* Equivalent of jpeg_huff_decode() */ \
> +    while (get_buffer > htbl->maxcode[nb]) { \

I.e., right here, if you had set nb to the actual number of bits needed for this code instead of HUFF_LOOKAHEAD+1, then in most cases the while loop would terminate on the first iteration (improving branch predictability as well as saving operations). It also means you don't need lmax (saving another lookup in the caller).

@@ +230,1 @@
>      s = htbl->pub->huffval[ (int) (s + htbl->valoffset[nb]) & 0xFF ]; \

The only thing pub appears to be used for is access to the huffval[] table. If you put a copy of it in d_derived_tbl, you could do this lookup directly using the same base register as all the other htbl fields, saving an indirection and a register. If you make |lookup| smaller as I suggested above, you'll still come out ahead in terms of memory usage.
Attachment #692621 - Flags: review?(tterribe) → review+
Unfortunately, this patch does not pass the upstream unit tests.  I cannot evaluate its performance until I have some confidence that it is producing correct data in all cases.

Please apply it to the libjpeg-turbo trunk and ensure that both 32-bit and 64-bit builds can successfully complete 'make test'.  If you are doing quick & dirty performance analysis, also be sure that you are testing 32-bit and 64-bit code.  The Huffman codec in libjpeg-turbo is the way it is in part because certain optimizations I tried to do would benefit 64-bit but would slow down 32-bit.
(In reply to DRC from comment #3)
> Unfortunately, this patch does not pass the upstream unit tests.  I cannot
> evaluate its performance until I have some confidence that it is producing
> correct data in all cases.

Indeed, there's a problem in jpeg_huff_decode when there are no codes of length 1, max_code[1] = 0 and get_buffer = 0. Let me see if I can figure out a way to fix it.
Also, I went ahead and ran benchmarks on the decoding paths that were not affected by the bug.  Unfortunately, I can't reproduce any significant speedup at all, nor even a measurable trend in that direction.  The results were all about +/- 1% of the baseline.  Perhaps this new algorithm is something that only affects small images or certain types of images.  At any rate, until it can be shown to provide a significant speedup for at least several different types of images, I am not interested in pursuing it further.  It may be that the new algorithm is faster but that some of the hand-tuned optimizations to the Huffman codec effectively broke when it was implemented.  Given how long it took me to optimize the Huffman codec originally, it wouldn't surprise me if re-optimizing it with the new algorithm was at least a 50-hour job, which is definitely not something I'm able to undertake pro bono.
This version passes the test suite and should even be a little bit faster than the first one. It may still have some problems handling corrupt jpgs but that shouldn't be hard to fix.

These are the results on x86-64 built with clang ~3.2
of tjbench on http://people.mozilla.com/~jmuizelaar/plant.jpg

I also measured noticeable improvement on ARM with the earlier patch.

Unfortunately, I couldn't figure out how to get libjpeg-turbo to build 32 bit on OS X so don't have real results for 32bit.

BIT_BUF_SIZE = 32
pre
D--> Frame rate:           101.909076 fps
     Dest. throughput:     220.843489 Megapixels/sec

post
D--> Frame rate:           104.796332 fps
     Dest. throughput:     227.100359 Megapixels/sec

BIT_BUF_SIZE = 64
pre
D--> Frame rate:           101.856891 fps
     Dest. throughput:     220.730402 Megapixels/sec

post
Image size: 2014 x 1076
D--> Frame rate:           107.303705 fps
     Dest. throughput:     232.533996 Megapixels/sec
Attachment #692621 - Attachment is obsolete: true
FWIW, I also see measurable speedups on smaller images and I measured on a Sandybridge core.
Attached patch The correct patch. (obsolete) — Splinter Review
Attachment #693749 - Attachment is obsolete: true
This adopts Tim's suggestion for improving FILL_BIT_BUFFER_FAST

With this I now get the following on plant.jpg

Image size: 2014 x 1076
D--> Frame rate:           109.335227 fps
     Dest. throughput:     236.936435 Megapixels/sec
Attachment #693760 - Attachment is obsolete: true
I confirm that the new patch passes the unit tests now and definitely is faster, but I'm still getting only ~2-3% speedup on 64-bit code and ~1-3% speedup on 32-bit code (NOTE: the instructions on how to build 32-bit code on your Mac are in the libjpeg-turbo build instructions, BUILDING.txt.)

This isn't a big enough difference to make me care very much yet.  Attaching my spreadsheet.
For reference here is the fps improvement on a Pandaboard:

Pre:
15.8102223333 +-0.181260011062

Post:
16.5738538333 +-0.172450939812
Maybe someone can add this bug to block Bug 579106 - (jpeg-perf) [Tracking] JPEG decoder should be faster
I don't quite understand the motivation for 579106 anymore.  That bug predates inclusion of libjpeg-turbo, unless I miss my guess.  It depends on several "bugs" that should have been marked as closed with the inclusion of libjpeg-turbo:  489148, 584652, 462796.  Personally, I think 579106 should be closed as well.  libjpeg-turbo improved performance by 2-4x.  What you're proposing here improves performance by at most 5% (and more like 3%, in my testing.)  A user will not notice that.
(In reply to DRC from comment #10)
> I confirm that the new patch passes the unit tests now and definitely is
> faster, but I'm still getting only ~2-3% speedup on 64-bit code and ~1-3%
> speedup on 32-bit code (NOTE: the instructions on how to build 32-bit code
> on your Mac are in the libjpeg-turbo build instructions, BUILDING.txt.)
> 
> This isn't a big enough difference to make me care very much yet.  Attaching
> my spreadsheet.

DRC, can you send me the jpegs files that you used for testing? I'm interested in getting some performance results on different processors.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: