Open
Bug 594864
Opened 14 years ago
Updated 2 years ago
Vectorize RGB -> FRGB conversions
Categories
(Core :: Graphics, defect)
Tracking
()
NEW
People
(Reporter: justin.lebar+bug, Unassigned)
Details
The conversion from RGB to FRGB (e.g. in nsJPEGDecoder::OutputStanLines() shows up in my profiles and is something we could vectorize.
On x86, I think it could be done using the PSHUFB instruction from SSSE3. Unfortunately that instruction is available only on relatively new chips (Atom, Core 2 Duo, i{3,5,7}). Perhaps there's a clever way to do without the shuffle instruction.
On ARM, this should be really easy. Do a NEON interleaved load into three registers followed by an interleaved store of four registers (the three you loaded, plus a register of all 1s). See [1] for something similar.
[1] http://blogs.arm.com/software-enablement/coding-for-neon-part-1-load-and-stores/
Comment 1•14 years ago
|
||
The fastest pre-SSSE3 method is probably to just use small, unaligned loads, e.g.:
movd mm0, (src)
punpckldq mm0, (src+3)
movd mm1, (src+6)
punpckldq mm1, (src+9)
por mm0,0xFF000000FF000000 (e.g., load this constant in mm7)
por mm1,0xFF000000FF000000
movq (dst),mm0
movq (dst+8),mm1
Unroll to taste. Note the last read reads 5 bytes past the end of the 12 input. You may want to experiment with movq2dq/punpcklqdq to reduce the number of stores and allow more unrolling, but I'm skeptical over whether or not that would actually be faster, especially on older machines where movq2dq was more expensive (8 cycle latency on a P4!). movntq may be better than movq if you've got SSE (unless you're on an Athlon, in which case it's much slower).
Comment 2•14 years ago
|
||
a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion.
Reporter | ||
Comment 3•14 years ago
|
||
(In reply to comment #2)
> a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion.
FWIW, the lookup table that's currently in there gave me a 50% speedup on canvas (bug 519400 comment 21). But I think this is still worth experimenting with.
Comment 4•14 years ago
|
||
(In reply to comment #3)
> (In reply to comment #2)
> > a year ago, I filed bug 496503 for testing. It is 25%-50% win for conversion.
>
> FWIW, the lookup table that's currently in there gave me a 50% speedup on
> canvas (bug 519400 comment 21). But I think this is still worth experimenting
> with.
I believe that using a table for GFX_PACKED_PIXEL is faster than SSE2 code like bug 496503's code. When I test to replace bug 519400 with SSE2, SSE2 is slower than table at overall testing for canvas.
Because SIMDed GFX_PACKED_PIXEL needs many times mul and shuffle. But, if AVX, it may be faster than table since AVX is 256-bit SIMD.
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•