Open Bug 594864 Opened 14 years ago Updated 2 years ago

Vectorize RGB -> FRGB conversions

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect

Tracking

()

People

(Reporter: justin.lebar+bug, Unassigned)

Details

The conversion from RGB to FRGB (e.g. in nsJPEGDecoder::OutputStanLines() shows up in my profiles and is something we could vectorize. On x86, I think it could be done using the PSHUFB instruction from SSSE3. Unfortunately that instruction is available only on relatively new chips (Atom, Core 2 Duo, i{3,5,7}). Perhaps there's a clever way to do without the shuffle instruction. On ARM, this should be really easy. Do a NEON interleaved load into three registers followed by an interleaved store of four registers (the three you loaded, plus a register of all 1s). See [1] for something similar. [1] http://blogs.arm.com/software-enablement/coding-for-neon-part-1-load-and-stores/
The fastest pre-SSSE3 method is probably to just use small, unaligned loads, e.g.: movd mm0, (src) punpckldq mm0, (src+3) movd mm1, (src+6) punpckldq mm1, (src+9) por mm0,0xFF000000FF000000 (e.g., load this constant in mm7) por mm1,0xFF000000FF000000 movq (dst),mm0 movq (dst+8),mm1 Unroll to taste. Note the last read reads 5 bytes past the end of the 12 input. You may want to experiment with movq2dq/punpcklqdq to reduce the number of stores and allow more unrolling, but I'm skeptical over whether or not that would actually be faster, especially on older machines where movq2dq was more expensive (8 cycle latency on a P4!). movntq may be better than movq if you've got SSE (unless you're on an Athlon, in which case it's much slower).
a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion.
(In reply to comment #2) > a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion. FWIW, the lookup table that's currently in there gave me a 50% speedup on canvas (bug 519400 comment 21). But I think this is still worth experimenting with.
(In reply to comment #3) > (In reply to comment #2) > > a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion. > > FWIW, the lookup table that's currently in there gave me a 50% speedup on > canvas (bug 519400 comment 21). But I think this is still worth experimenting > with. I believe that using a table for GFX_PACKED_PIXEL is faster than SSE2 code like bug 496503's code. When I test to replace bug 519400 with SSE2, SSE2 is slower than table at overall testing for canvas. Because SIMDed GFX_PACKED_PIXEL needs many times mul and shuffle. But, if AVX, it may be faster than table since AVX is 256-bit SIMD.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.