444661 - Decrease Performance Penalty of color management linear interpolations

Assignee

Description

•

17 years ago

Initial performance analysis (viewing a jpg of several megs with an embedded color profile from the local disk) for firefox color management on my mac indicated that roughly 8.2% of firefox's CPU time was spent in cmsLinearInterpLUT16. As such, I figured it would be a good candidate for optimization. Closer inspection revealed that the function does in fact have hand-tuned inline assembly, but that it's for windows (masm) only. I spent some time porting the assembly to GAS assembly (slightly non trivial due to register allocation issues relating to the use of ebx as a PIC register by gcc), and finally had analogous working assembly for mac and linux. Running it through Shark, I was surprised to find that the performance was almost twice as bad (16.1% as compared to 8.2%). I had a feeling that this had to do with the div instructions, and in fact it did: 80ish% of the samples recorded by shark were recorded in the instructions immediately following the div instructions, which makes sense since apparently div is not pipelined. Commenting out the divs brings performance to a nice 7.1% and results in pretty trippy color rendering. ;-) This means that the assembly is actually to blame for poor performance on windows, not on mac and linux. This makes sense, because the performance numbers stuart were talking about were a bit worse than the ones I was seeing. I've been doing some reading (http://blogs.msdn.com/devdev/archive/2005/12/12/502980.aspx) on some heuristics to do division by constants with shifts and adds. I hope to be able to replace the divs with a hand-tuned solution. After that, I want to look into MMX/SSE/SIMD to see if I can speed things up that way. The function is currently called once for each 2-byte value, so hopefully we can get performance speedups that way (I would imagine I'd have to rearchitect the call structure a bit). One concern is that the code currently consults a LUT for each value, which may limit the amount of SIMD optimizations we can do (we still need N arbitrary memory accesses for N values no matter what instruction set we use).

Working patch of the MASM->GAS port 17 years ago Bobby Holley (:bholley) 5.73 KB, patch		Details \| Diff \| Splinter Review
working patch demonstrating the hashed cache table. Yes I know it doesn't free the tables - it's only proof of concept. 17 years ago Bobby Holley (:bholley) 7.47 KB, patch		Details \| Diff \| Splinter Review
proposed precache patch 17 years ago Bobby Holley (:bholley) 33.00 KB, patch	joe : review-	Details \| Diff \| Splinter Review
updated precache patch 17 years ago Bobby Holley (:bholley) 33.97 KB, patch	joe : review+	Details \| Diff \| Splinter Review
updated patch to fix a bug I found with the test suite 17 years ago Bobby Holley (:bholley) 35.67 KB, patch	vlad : review+ vlad : superreview+	Details \| Diff \| Splinter Review