Closed
Bug 869829
Opened 12 years ago
Closed 11 years ago
Add a ColorMatrix CPU path to Moz2D (with SSE2 and NEON support)
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 924102
People
(Reporter: gal, Assigned: gal)
References
Details
Attachments
(1 file, 3 obsolete files)
24.18 KB,
patch
|
Details | Diff | Splinter Review |
Most (non-custom) CSS filter effects are basic color matrix filters. Originally I wanted to use Skia as CPU fallback implementation, but turns out the Skia code is pretty slow, so we can as well roll our own (Blur2D-like).
This is a snapshot of my current state. The NEON code is untested. The SSE2 vector implementation itself is tested, the ColorMatrix code is not yet. My goal is to add more tests, and then land this and convert BlurSSE2 over to use SIMD.h as well (blurring is insanely slow on ARM, lets fix that since we are already in Rome).
Assignee | ||
Comment 1•12 years ago
|
||
Assignee: nobody → gal
Assignee | ||
Comment 2•12 years ago
|
||
Looks like -O9 and -ffast-math and -fno-rtti and -fno-exceptions all have no effect on performance, which is good.
furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test
FPU 1.728736 45.491735mp/s
SSE2 0.471666 166.734935mp/s
furball:2d gal$ llvm-gcc -fno-rtti -fno-exceptions -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test
FPU 1.742586 45.130169mp/s
SSE2 0.473836 165.971349mp/s
furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test
FPU 1.741896 45.148046mp/s
SSE2 0.473000 166.264693mp/s
furball:2d gal$ llvm-gcc -O3 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test
FPU 1.736425 45.290295mp/s
SSE2 0.473575 166.062820mp/s
furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test
FPU 1.728127 45.507767mp/s
SSE2 0.473695 166.020752mp/s
furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test
FPU 1.722271 45.662500mp/s
SSE2 0.478623 164.311368mp/s
furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test
Also, the 32-bit version is just as fast as the 64-bit version, despite a lot more spilling, for the FPU and the SSE2 path, which is also good.
Assignee | ||
Comment 3•12 years ago
|
||
Tests for SIMD.h. This passes on x86 (comparing SSE2 and FPU).
Assignee | ||
Comment 4•12 years ago
|
||
FPU/arm:
0.000000 0.000000 0.000000 0.000000
1.000000 1.000000 1.000000 1.000000
1.000000 2.000000 3.000000 4.000000
1.000000 2.000000 3.000000 4.000000
1.000000 1.000000 1.000000 1.000000
2.000000 2.000000 2.000000 2.000000
3.000000 3.000000 3.000000 3.000000
4.000000 4.000000 4.000000 4.000000
5.000000 2.000000 3.000000 4.000000
1.000000 5.000000 3.000000 4.000000
1.000000 2.000000 5.000000 4.000000
1.000000 2.000000 3.000000 5.000000
7.000000 7.000000 7.000000 7.000000
3.500000 3.500000 3.500000 3.500000
Inf Inf Inf Inf
0.000000 0.000000 0.000000 0.000000
14.000000 14.000000 14.000000 14.000000
1.000000 2.000000 3.000000 4.000000
00000000 00000000 00000000 00000000
00000001 00000001 00000001 00000001
00000001 00000002 00000003 00000004
00000001 00000002 00000003 00000004
00000001 00000001 00000001 00000001
00000002 00000002 00000002 00000002
00000003 00000003 00000003 00000003
00000004 00000004 00000004 00000004
00000067 00000045 00000023 00000001
67452312 67452312 67452312 67452312
0000ffff 0000ffff 0000ffff 0000ffff
00000001 00000002 00000003 00000004
00000000 00000001 00000002 00000003
NEON:
0.000000 0.000000 0.000000 0.000000
1.000000 1.000000 1.000000 1.000000
1.000000 2.000000 3.000000 4.000000
1.000000 2.000000 3.000000 4.000000
1.000000 1.000000 1.000000 1.000000
2.000000 2.000000 2.000000 2.000000
3.000000 3.000000 3.000000 3.000000
4.000000 4.000000 4.000000 4.000000
5.000000 2.000000 3.000000 4.000000
1.000000 5.000000 3.000000 4.000000
1.000000 2.000000 5.000000 4.000000
1.000000 2.000000 3.000000 5.000000
7.000000 7.000000 7.000000 7.000000
3.499987 3.499987 3.499987 3.499987
Inf Inf Inf Inf
0.000000 0.000000 0.000000 0.000000
14.000000 14.000000 14.000000 14.000000
1.000000 2.000000 3.000000 4.000000
3f800000 40000000 40400000 40800000
00000000 00000000 00000000 00000000
00000001 00000001 00000001 00000001
00000001 00000002 00000003 00000004
00000001 00000002 00000003 00000004
00000001 00000001 00000001 00000001
00000002 00000002 00000002 00000002
00000003 00000003 00000003 00000003
00000004 00000004 00000004 00000004
00028168 00000000 81e80002 00000000
67452312 67452312 67452312 67452312
00000001 00000002 00000003 00000004
00000002 000088cc 00008960 00000001
Reference (SSE2/FPU on x86):
0.000000 0.000000 0.000000 0.000000
1.000000 1.000000 1.000000 1.000000
1.000000 2.000000 3.000000 4.000000
1.000000 2.000000 3.000000 4.000000
1.000000 1.000000 1.000000 1.000000
2.000000 2.000000 2.000000 2.000000
3.000000 3.000000 3.000000 3.000000
4.000000 4.000000 4.000000 4.000000
5.000000 2.000000 3.000000 4.000000
1.000000 5.000000 3.000000 4.000000
1.000000 2.000000 5.000000 4.000000
1.000000 2.000000 3.000000 5.000000
7.000000 7.000000 7.000000 7.000000
3.500000 3.500000 3.500000 3.500000
inf inf inf inf
0.000000 0.000000 0.000000 0.000000
14.000000 14.000000 14.000000 14.000000
1.000000 2.000000 3.000000 4.000000
00000000 00000000 00000000 00000000
00000001 00000001 00000001 00000001
00000001 00000002 00000003 00000004
00000001 00000002 00000003 00000004
00000001 00000001 00000001 00000001
00000002 00000002 00000002 00000002
00000003 00000003 00000003 00000003
00000004 00000004 00000004 00000004
00000067 00000045 00000023 00000001
67452312 67452312 67452312 67452312
0000ffff 0000ffff 0000ffff 0000ffff
00000001 00000002 00000003 00000004
00000000 00000001 00000002 00000003
Not bad for untested code. Looks like integer shuffling is a bit borked. Probably easy to fix. Division seems to be a bit imprecise. For this specific use case thats probably pretty ok.
Assignee | ||
Comment 5•12 years ago
|
||
0.000000 0.000000 0.000000 0.000000
1.000000 1.000000 1.000000 1.000000
1.000000 2.000000 3.000000 4.000000
1.000000 2.000000 3.000000 4.000000
1.000000 1.000000 1.000000 1.000000
2.000000 2.000000 2.000000 2.000000
3.000000 3.000000 3.000000 3.000000
4.000000 4.000000 4.000000 4.000000
5.000000 2.000000 3.000000 4.000000
1.000000 5.000000 3.000000 4.000000
1.000000 2.000000 5.000000 4.000000
1.000000 2.000000 3.000000 5.000000
7.000000 7.000000 7.000000 7.000000
3.499987 3.499987 3.499987 3.499987
Inf Inf Inf Inf
0.000000 0.000000 0.000000 0.000000
14.000000 14.000000 14.000000 14.000000
1.000000 2.000000 3.000000 4.000000
00000000 00000000 00000000 00000000
00000001 00000001 00000001 00000001
00000001 00000002 00000003 00000004
00000001 00000002 00000003 00000004
00000001 00000001 00000001 00000001
00000002 00000002 00000002 00000002
00000003 00000003 00000003 00000003
00000004 00000004 00000004 00000004
a4b89b18 b0000000 31f83300 b001b001
67452312 67452312 67452312 67452312
0000ffff 0000ffff 0000ffff 0000ffff
00000001 00000002 00000003 00000004
00000000 00000001 00000002 00000003
One failing test remaining. Turns out my printing code in the test was wrong, which caused most of the failures.
Assignee | ||
Comment 6•12 years ago
|
||
Attachment #746813 -
Attachment is obsolete: true
Attachment #746818 -
Attachment is obsolete: true
Assignee | ||
Comment 7•12 years ago
|
||
The FPU version is ungodly slow on my Android device. Not really surprising though. We are likely spilling ourselves to death, and the i2f round trip is costly. The NEON path is much more acceptable (still 10x slower than my beefy laptop FPU, and 30x slower than SSE2). We really don't want to end up on the CPU/NEON path on mobile. This makes me pretty hopeful for being able to speed up Blur a lot.
FPU 105.493011 0.745483mp/s
NEON 16.183075 4.859596mp/s
Assignee | ||
Comment 8•12 years ago
|
||
prefetching is a 50% speedup: 12.184723 6.454246mp/s
So we are definitely memory bound here, which makes me somewhat more confident that the code we emit with NEON is reasonable/sufficient.
Assignee | ||
Comment 9•12 years ago
|
||
For a matrix where we don't have to un-premultiply (doesn't munge alpha) x86 looks like this:
FPU 1.051189 74.813568mp/s
SSE2 0.265869 295.796802mp/s
and ARM:
NEON 9.881073 7.958974mp/s
Assignee | ||
Comment 10•12 years ago
|
||
Attachment #747085 -
Attachment is obsolete: true
Assignee | ||
Comment 11•12 years ago
|
||
Comment on attachment 747674 [details] [diff] [review]
patch to land
I would like to go ahead and land this and then do a follow-up patch that adds some missing functionality for BlurSS2.cpp and turns BlurSSE2.cpp into BlurSIMD.cpp using SIMD.h so we get NEON support there, before I try to land the Moz2D interface next. Makes sense?
Attachment #747674 -
Flags: review?(bas)
Comment 12•12 years ago
|
||
Comment on attachment 747674 [details] [diff] [review]
patch to land
Review of attachment 747674 [details] [diff] [review]:
-----------------------------------------------------------------
I'll look at this tomorrow. Derf should look at this to as he's better able to find any inefficiencies in the SIMD code.
Attachment #747674 -
Flags: review?(tterribe)
Assignee | ||
Comment 13•12 years ago
|
||
Please feel free to bike-shed the naming convention for vec4f/vec4i and also the convention for methods on it (since we are about to add more for Blur). I am not a huge fan of the current convention, but I have no ideas what else to use. For methods we could use more expressive names and caps (PackAndSaturateU8?).
Comment 14•12 years ago
|
||
(In reply to Bas Schouten (:bas.schouten) from comment #12)
> I'll look at this tomorrow. Derf should look at this to as he's better able
> to find any inefficiencies in the SIMD code.
ColorMatrixSIMD.cpp appears to be missing?
Assignee | ||
Comment 15•12 years ago
|
||
oops. Its just an #include of ColorMatrix-inline.h with #define USE_SIMD. Will add. Also, I have a long plane ride tomorrow. Please feel free to bike shed the name and structure of the vector class. I don't particularly like the current setup myself.
One option would be a scalar implementation of a template
template <typename T, int N>
Vector {
};
with specializations for
Vector<float, 4> for SSE2 and NEON etc and then maybe a vec4i shorthand?
Comment 16•11 years ago
|
||
is this still needed ?
Comment 17•11 years ago
|
||
An SSE2 ColorMatrix implementation was added in bug 924102, and bug 961759 tracks the NEON implementation.
I'll close this bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Updated•11 years ago
|
Attachment #747674 -
Flags: review?(tterribe)
Attachment #747674 -
Flags: review?(bas)
You need to log in
before you can comment on or make changes to this bug.
Description
•