Open Bug 822023 Opened 13 years ago Updated 3 years ago

Speed up jpeg huffman decoding

Tracking

()

Status:

NEW

People

(Reporter: jrmuizel, Assigned: jrmuizel)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

Speed up jpeg huffman decoding 13 years ago Jeff Muizelaar [:jrmuizel] 21.99 KB, patch	derf : review+	Details \| Diff \| Splinter Review
Speed up jpeg huffman decoding v2 12 years ago Jeff Muizelaar [:jrmuizel] 1.81 KB, patch		Details \| Diff \| Splinter Review
The correct patch. 12 years ago Jeff Muizelaar [:jrmuizel] 24.74 KB, patch		Details \| Diff \| Splinter Review
Improve FILL_BIT_BUFFER_FAST 12 years ago Jeff Muizelaar [:jrmuizel] 25.03 KB, patch		Details \| Diff \| Splinter Review
Spreadsheet of benchmark results from "Improve FILL_BIT_BUFFER_FAST" patch 12 years ago DRC 16.40 KB, application/octet-stream		Details

Jeff Muizelaar [:jrmuizel]

Assignee

Description

•

13 years ago

Attached patch Speed up jpeg huffman decoding (obsolete) — Details — Splinter Review

This takes the approach suggested here http://cbloomrants.blogspot.ca/2010/08/08-12-10-lost-huffman-paper.html and adopts it for libjpeg. It doesn't bring a huge performance improvement, but it is noticeable (from 98mbps to 102mbps). The patch may still be a little rough, but I'm interested in feedback on it.

Attachment #692621 - Flags: review?(tterribe)

DRC

Comment 1

•

13 years ago

I'm going to try to get around to running some quick & dirty tests with this this week to see if I can reproduce your speedup numbers. If so, then 5% is compelling enough to look at integrating this upstream. Huffman encoding/decoding is probably the biggest piece of low-hanging performance fruit in libjpeg-turbo. One of the biggest issues with it is that there is a large gap between 32-bit and 64-bit that I'd really like to close. This gap is due in part to the fact that, on 64-bit platforms, the Huffman codec has more bits to work with and thus has to load the register more infrequently, but the encoder in particular also just runs out of registers on 32-bit platforms. I think that the use of assembly instructions (even just hand-tuned x86 assembly, not even using SIMD) would speed things up probably more than algorithmic changes. At least one developer in the Linaro project has proposed using SIMD to do the job as well, with the first target of that effort being ARM (but the resulting code, if it ever got written, could be ported to SSE2.)

Timothy B. Terriberry (:derf)

Comment 2

•

13 years ago

Comment on attachment 692621 [details] [diff] [review] Speed up jpeg huffman decoding Review of attachment 692621 [details] [diff] [review]: ----------------------------------------------------------------- This looks mostly fine as-is, but there's still room for improvement. I encourage you to look at media/libtheora/lib/huffdec.c (and for adventure, media/libtheora/lib/arm/armbits.s). In Theora we have to solve a slightly more general problem, since there's no requirement that the bit-patterns of the Huffman codes are ordered by length (and the default VP3 tables were not so ordered), but there's still lots of commonalities that apply here. ::: jdhuff.c @@ +227,2 @@ > } else { > + dtbl->maxcode[l] = last_max; /* -1 if no codes of this length */ Comment is no longer accurate. @@ +236,5 @@ > * First we set all the table entries to 0, indicating "too long"; > * then we iterate through the Huffman codes that are short enough and > * fill in all the entries that correspond to bit sequences starting > * with that code. > */ L1 cache footprint is important for this table, and currently its twice as large as it needs to be on most platforms (int is 32 bits, but we only use 8 bits for the code value and 4 bits for the code length). Using an explicit 16-bit type would help. @@ +382,1 @@ > bits_left = MIN_GET_BITS; You can now set this to a really large value to never hit this branch again. @@ +408,5 @@ > register int c0, c1; \ > c0 = GETJOCTET(*buffer++); \ > c1 = GETJOCTET(*buffer); \ > /* Pre-execute most common case */ \ > + get_buffer = get_buffer | ((bit_buf_type)c0 << (BIT_BUF_SIZE - 8 - bits_left)); \ So, when you call GET_BYTE repeatedly (as in FILL_BIT_BUFFER_FAST), what you actually want to do is store the shift amount (BIT_BUF_SIZE_-8-bits_left) in a variable, and _subtract_ 8 from that each time you add a byte, and then convert that shift amount back to the corresponding bits_left when you're done. Otherwise the compiler will repeatedly translate back and forth between the two values, wasting operations and/or registers. ::: jdhuff.h @@ +28,5 @@ > +#define BIT_BUF_SIZE 64 /* size of buffer in bits */ > + > +#else > + > +typedef unsigned int bit_buf_type; /* type of bit-extraction buffer */ Aww, did libjpeg drop real-mode DOS support? Otherwise this needs to be UINT32, etc. @@ +159,5 @@ > if (! jpeg_fill_bit_buffer(&(state),get_buffer,bits_left,nbits)) \ > { action; } \ > get_buffer = (state).get_buffer; bits_left = (state).bits_left; } } > > +/* XXX: requiring 'tmp' here makes this really ugly */ I'd personally prefer you take an l-value as one of the macro parameters, to make it more transparent. You wouldn't need the comma-operator contortions to make sure things are done in the right order, either. @@ +164,2 @@ > #define GET_BITS(nbits) \ > + (tmp = get_buffer, get_buffer <<= nbits, ((int) (tmp >> (bits_left -= (nbits), BIT_BUF_SIZE - (nbits))))) I guess I'm confused why |bits_left -= (nbits)| is in a nested comma expression instead of being out in front. @@ +209,2 @@ > } else { \ > slowlabel: \ So, one of the things that we found very helpful for Theora was not wasting the lookup, even when the code was too long to fit in the initial lookup table. You should be able to store in the table the minimum number of bits you know the code will need, and jump directly to that iteration of the loop. @@ +225,2 @@ > /* Equivalent of jpeg_huff_decode() */ \ > + while (get_buffer > htbl->maxcode[nb]) { \ I.e., right here, if you had set nb to the actual number of bits needed for this code instead of HUFF_LOOKAHEAD+1, then in most cases the while loop would terminate on the first iteration (improving branch predictability as well as saving operations). It also means you don't need lmax (saving another lookup in the caller). @@ +230,1 @@ > s = htbl->pub->huffval[ (int) (s + htbl->valoffset[nb]) & 0xFF ]; \ The only thing pub appears to be used for is access to the huffval[] table. If you put a copy of it in d_derived_tbl, you could do this lookup directly using the same base register as all the other htbl fields, saving an indirection and a register. If you make |lookup| smaller as I suggested above, you'll still come out ahead in terms of memory usage.

Attachment #692621 - Flags: review?(tterribe) → review+

DRC

Comment 3

•

12 years ago

Unfortunately, this patch does not pass the upstream unit tests. I cannot evaluate its performance until I have some confidence that it is producing correct data in all cases. Please apply it to the libjpeg-turbo trunk and ensure that both 32-bit and 64-bit builds can successfully complete 'make test'. If you are doing quick & dirty performance analysis, also be sure that you are testing 32-bit and 64-bit code. The Huffman codec in libjpeg-turbo is the way it is in part because certain optimizations I tried to do would benefit 64-bit but would slow down 32-bit.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 4

•

12 years ago

(In reply to DRC from comment #3) > Unfortunately, this patch does not pass the upstream unit tests. I cannot > evaluate its performance until I have some confidence that it is producing > correct data in all cases. Indeed, there's a problem in jpeg_huff_decode when there are no codes of length 1, max_code[1] = 0 and get_buffer = 0. Let me see if I can figure out a way to fix it.

DRC

Comment 5

•

12 years ago

Also, I went ahead and ran benchmarks on the decoding paths that were not affected by the bug. Unfortunately, I can't reproduce any significant speedup at all, nor even a measurable trend in that direction. The results were all about +/- 1% of the baseline. Perhaps this new algorithm is something that only affects small images or certain types of images. At any rate, until it can be shown to provide a significant speedup for at least several different types of images, I am not interested in pursuing it further. It may be that the new algorithm is faster but that some of the hand-tuned optimizations to the Huffman codec effectively broke when it was implemented. Given how long it took me to optimize the Huffman codec originally, it wouldn't surprise me if re-optimizing it with the new algorithm was at least a 50-hour job, which is definitely not something I'm able to undertake pro bono.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 6

•

12 years ago

Attached patch Speed up jpeg huffman decoding v2 (obsolete) — Details — Splinter Review

This version passes the test suite and should even be a little bit faster than the first one. It may still have some problems handling corrupt jpgs but that shouldn't be hard to fix. These are the results on x86-64 built with clang ~3.2 of tjbench on http://people.mozilla.com/~jmuizelaar/plant.jpg I also measured noticeable improvement on ARM with the earlier patch. Unfortunately, I couldn't figure out how to get libjpeg-turbo to build 32 bit on OS X so don't have real results for 32bit. BIT_BUF_SIZE = 32 pre D--> Frame rate: 101.909076 fps Dest. throughput: 220.843489 Megapixels/sec post D--> Frame rate: 104.796332 fps Dest. throughput: 227.100359 Megapixels/sec BIT_BUF_SIZE = 64 pre D--> Frame rate: 101.856891 fps Dest. throughput: 220.730402 Megapixels/sec post Image size: 2014 x 1076 D--> Frame rate: 107.303705 fps Dest. throughput: 232.533996 Megapixels/sec

Attachment #692621 - Attachment is obsolete: true

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 7

•

12 years ago

FWIW, I also see measurable speedups on smaller images and I measured on a Sandybridge core.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 8

•

12 years ago

Attached patch The correct patch. (obsolete) — Details — Splinter Review

Attachment #693749 - Attachment is obsolete: true

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 9

•

12 years ago

Attached patch Improve FILL_BIT_BUFFER_FAST — Details — Splinter Review

This adopts Tim's suggestion for improving FILL_BIT_BUFFER_FAST With this I now get the following on plant.jpg Image size: 2014 x 1076 D--> Frame rate: 109.335227 fps Dest. throughput: 236.936435 Megapixels/sec

Attachment #693760 - Attachment is obsolete: true

DRC

Comment 10

•

12 years ago

I confirm that the new patch passes the unit tests now and definitely is faster, but I'm still getting only ~2-3% speedup on 64-bit code and ~1-3% speedup on 32-bit code (NOTE: the instructions on how to build 32-bit code on your Mac are in the libjpeg-turbo build instructions, BUILDING.txt.) This isn't a big enough difference to make me care very much yet. Attaching my spreadsheet.

DRC

Comment 11

•

12 years ago

Attached file Spreadsheet of benchmark results from "Improve FILL_BIT_BUFFER_FAST" patch — Details

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 12

•

12 years ago

For reference here is the fps improvement on a Pandaboard: Pre: 15.8102223333 +-0.181260011062 Post: 16.5738538333 +-0.172450939812

xunxun

Comment 13

•

12 years ago

Maybe someone can add this bug to block Bug 579106 - (jpeg-perf) [Tracking] JPEG decoder should be faster

Joe Drew (not getting mail)

Updated

•

12 years ago

Blocks: jpeg-perf

DRC

Comment 14

•

12 years ago

I don't quite understand the motivation for 579106 anymore. That bug predates inclusion of libjpeg-turbo, unless I miss my guess. It depends on several "bugs" that should have been marked as closed with the inclusion of libjpeg-turbo: 489148, 584652, 462796. Personally, I think 579106 should be closed as well. libjpeg-turbo improved performance by 2-4x. What you're proposing here improves performance by at most 5% (and more like 3%, in my testing.) A user will not notice that.

Jeff Muizelaar [:jrmuizel]

Assignee

Comment 15

•

12 years ago

(In reply to DRC from comment #10) > I confirm that the new patch passes the unit tests now and definitely is > faster, but I'm still getting only ~2-3% speedup on 64-bit code and ~1-3% > speedup on 32-bit code (NOTE: the instructions on how to build 32-bit code > on your Mac are in the libjpeg-turbo build instructions, BUILDING.txt.) > > This isn't a big enough difference to make me care very much yet. Attaching > my spreadsheet. DRC, can you send me the jpegs files that you used for testing? I'm interested in getting some performance results on different processors.

DRC

Comment 16

•

12 years ago

http://www.libjpeg-turbo.org/About/Performance

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.