Closed Bug 869829 Opened 12 years ago Closed 11 years ago

Add a ColorMatrix CPU path to Moz2D (with SSE2 and NEON support)

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 924102

People

(Reporter: gal, Assigned: gal)

References

Details

Attachments

(1 file, 3 obsolete files)

Most (non-custom) CSS filter effects are basic color matrix filters. Originally I wanted to use Skia as CPU fallback implementation, but turns out the Skia code is pretty slow, so we can as well roll our own (Blur2D-like). This is a snapshot of my current state. The NEON code is untested. The SSE2 vector implementation itself is tested, the ColorMatrix code is not yet. My goal is to add more tests, and then land this and convert BlurSSE2 over to use SIMD.h as well (blurring is insanely slow on ARM, lets fix that since we are already in Rome).
Blocks: 869828
No longer blocks: 869828
Blocks: 869828
Attached patch patch, WIP (obsolete) — Splinter Review
Assignee: nobody → gal
Looks like -O9 and -ffast-math and -fno-rtti and -fno-exceptions all have no effect on performance, which is good. furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.728736 45.491735mp/s SSE2 0.471666 166.734935mp/s furball:2d gal$ llvm-gcc -fno-rtti -fno-exceptions -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.742586 45.130169mp/s SSE2 0.473836 165.971349mp/s furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.741896 45.148046mp/s SSE2 0.473000 166.264693mp/s furball:2d gal$ llvm-gcc -O3 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.736425 45.290295mp/s SSE2 0.473575 166.062820mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.728127 45.507767mp/s SSE2 0.473695 166.020752mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test FPU 1.722271 45.662500mp/s SSE2 0.478623 164.311368mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test Also, the 32-bit version is just as fast as the 64-bit version, despite a lot more spilling, for the FPU and the SSE2 path, which is also good.
Attached file Tests for SIMD.h (obsolete) —
Tests for SIMD.h. This passes on x86 (comparing SSE2 and FPU).
FPU/arm: 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.500000 3.500000 3.500000 3.500000 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00000067 00000045 00000023 00000001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 NEON: 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.499987 3.499987 3.499987 3.499987 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 3f800000 40000000 40400000 40800000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00028168 00000000 81e80002 00000000 67452312 67452312 67452312 67452312 00000001 00000002 00000003 00000004 00000002 000088cc 00008960 00000001 Reference (SSE2/FPU on x86): 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.500000 3.500000 3.500000 3.500000 inf inf inf inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00000067 00000045 00000023 00000001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 Not bad for untested code. Looks like integer shuffling is a bit borked. Probably easy to fix. Division seems to be a bit imprecise. For this specific use case thats probably pretty ok.
0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.499987 3.499987 3.499987 3.499987 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 a4b89b18 b0000000 31f83300 b001b001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 One failing test remaining. Turns out my printing code in the test was wrong, which caused most of the failures.
Attachment #746813 - Attachment is obsolete: true
Attachment #746818 - Attachment is obsolete: true
The FPU version is ungodly slow on my Android device. Not really surprising though. We are likely spilling ourselves to death, and the i2f round trip is costly. The NEON path is much more acceptable (still 10x slower than my beefy laptop FPU, and 30x slower than SSE2). We really don't want to end up on the CPU/NEON path on mobile. This makes me pretty hopeful for being able to speed up Blur a lot. FPU 105.493011 0.745483mp/s NEON 16.183075 4.859596mp/s
prefetching is a 50% speedup: 12.184723 6.454246mp/s So we are definitely memory bound here, which makes me somewhat more confident that the code we emit with NEON is reasonable/sufficient.
For a matrix where we don't have to un-premultiply (doesn't munge alpha) x86 looks like this: FPU 1.051189 74.813568mp/s SSE2 0.265869 295.796802mp/s and ARM: NEON 9.881073 7.958974mp/s
Attached patch patch to landSplinter Review
Attachment #747085 - Attachment is obsolete: true
Comment on attachment 747674 [details] [diff] [review] patch to land I would like to go ahead and land this and then do a follow-up patch that adds some missing functionality for BlurSS2.cpp and turns BlurSSE2.cpp into BlurSIMD.cpp using SIMD.h so we get NEON support there, before I try to land the Moz2D interface next. Makes sense?
Attachment #747674 - Flags: review?(bas)
Comment on attachment 747674 [details] [diff] [review] patch to land Review of attachment 747674 [details] [diff] [review]: ----------------------------------------------------------------- I'll look at this tomorrow. Derf should look at this to as he's better able to find any inefficiencies in the SIMD code.
Attachment #747674 - Flags: review?(tterribe)
Please feel free to bike-shed the naming convention for vec4f/vec4i and also the convention for methods on it (since we are about to add more for Blur). I am not a huge fan of the current convention, but I have no ideas what else to use. For methods we could use more expressive names and caps (PackAndSaturateU8?).
(In reply to Bas Schouten (:bas.schouten) from comment #12) > I'll look at this tomorrow. Derf should look at this to as he's better able > to find any inefficiencies in the SIMD code. ColorMatrixSIMD.cpp appears to be missing?
oops. Its just an #include of ColorMatrix-inline.h with #define USE_SIMD. Will add. Also, I have a long plane ride tomorrow. Please feel free to bike shed the name and structure of the vector class. I don't particularly like the current setup myself. One option would be a scalar implementation of a template template <typename T, int N> Vector { }; with specializations for Vector<float, 4> for SSE2 and NEON etc and then maybe a vec4i shorthand?
is this still needed ?
An SSE2 ColorMatrix implementation was added in bug 924102, and bug 961759 tracks the NEON implementation. I'll close this bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Attachment #747674 - Flags: review?(tterribe)
Attachment #747674 - Flags: review?(bas)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: