Closed Bug 869829 Opened 12 years ago Closed 11 years ago

Add a ColorMatrix CPU path to Moz2D (with SSE2 and NEON support)

Tracking

()

Status:

RESOLVED DUPLICATE of bug 924102

People

(Reporter: gal, Assigned: gal)

References

Details

Attachments

(1 file, 3 obsolete files)

patch, WIP 12 years ago Andreas Gal :gal 21.43 KB, patch		Details \| Diff \| Splinter Review
Tests for SIMD.h 12 years ago Andreas Gal :gal 1.87 KB, text/x-c++src		Details
patch, SIMD.h working on all platforms, with TestSIMD.cpp 12 years ago Andreas Gal :gal 24.24 KB, patch		Details \| Diff \| Splinter Review
patch to land 12 years ago Andreas Gal :gal 24.18 KB, patch		Details \| Diff \| Splinter Review

Andreas Gal :gal

Assignee

Description

•

12 years ago

Most (non-custom) CSS filter effects are basic color matrix filters. Originally I wanted to use Skia as CPU fallback implementation, but turns out the Skia code is pretty slow, so we can as well roll our own (Blur2D-like). This is a snapshot of my current state. The NEON code is untested. The SSE2 vector implementation itself is tested, the ColorMatrix code is not yet. My goal is to add more tests, and then land this and convert BlurSSE2 over to use SIMD.h as well (blurring is insanely slow on ARM, lets fix that since we are already in Rome).

Andreas Gal :gal

Assignee

Updated

•

12 years ago

Blocks: 869828

Andreas Gal :gal

Assignee

Updated

•

12 years ago

No longer blocks: 869828

Andreas Gal :gal

Assignee

Updated

•

12 years ago

Blocks: 869828

Andreas Gal :gal

Assignee

Comment 1

•

12 years ago

Attached patch patch, WIP (obsolete) — Details — Splinter Review

Assignee: nobody → gal

Andreas Gal :gal

Assignee

Comment 2

•

12 years ago

Looks like -O9 and -ffast-math and -fno-rtti and -fno-exceptions all have no effect on performance, which is good. furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.728736 45.491735mp/s SSE2 0.471666 166.734935mp/s furball:2d gal$ llvm-gcc -fno-rtti -fno-exceptions -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.742586 45.130169mp/s SSE2 0.473836 165.971349mp/s furball:2d gal$ llvm-gcc -O9 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.741896 45.148046mp/s SSE2 0.473000 166.264693mp/s furball:2d gal$ llvm-gcc -O3 -ffast-math ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.736425 45.290295mp/s SSE2 0.473575 166.062820mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m64 -o test && ./test FPU 1.728127 45.507767mp/s SSE2 0.473695 166.020752mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test FPU 1.722271 45.662500mp/s SSE2 0.478623 164.311368mp/s furball:2d gal$ llvm-gcc -O3 ColorMatrix.cpp ColorMatrixSIMD.cpp TimeSIMD.cpp -m32 -o test && ./test Also, the 32-bit version is just as fast as the 64-bit version, despite a lot more spilling, for the FPU and the SSE2 path, which is also good.

Andreas Gal :gal

Assignee

Comment 3

•

12 years ago

Attached file Tests for SIMD.h (obsolete) — Details

Tests for SIMD.h. This passes on x86 (comparing SSE2 and FPU).

Andreas Gal :gal

Assignee

Comment 4

•

12 years ago

FPU/arm: 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.500000 3.500000 3.500000 3.500000 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00000067 00000045 00000023 00000001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 NEON: 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.499987 3.499987 3.499987 3.499987 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 3f800000 40000000 40400000 40800000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00028168 00000000 81e80002 00000000 67452312 67452312 67452312 67452312 00000001 00000002 00000003 00000004 00000002 000088cc 00008960 00000001 Reference (SSE2/FPU on x86): 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.500000 3.500000 3.500000 3.500000 inf inf inf inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 00000067 00000045 00000023 00000001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 Not bad for untested code. Looks like integer shuffling is a bit borked. Probably easy to fix. Division seems to be a bit imprecise. For this specific use case thats probably pretty ok.

Andreas Gal :gal

Assignee

Comment 5

•

12 years ago

0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 3.000000 4.000000 1.000000 2.000000 3.000000 4.000000 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 5.000000 2.000000 3.000000 4.000000 1.000000 5.000000 3.000000 4.000000 1.000000 2.000000 5.000000 4.000000 1.000000 2.000000 3.000000 5.000000 7.000000 7.000000 7.000000 7.000000 3.499987 3.499987 3.499987 3.499987 Inf Inf Inf Inf 0.000000 0.000000 0.000000 0.000000 14.000000 14.000000 14.000000 14.000000 1.000000 2.000000 3.000000 4.000000 00000000 00000000 00000000 00000000 00000001 00000001 00000001 00000001 00000001 00000002 00000003 00000004 00000001 00000002 00000003 00000004 00000001 00000001 00000001 00000001 00000002 00000002 00000002 00000002 00000003 00000003 00000003 00000003 00000004 00000004 00000004 00000004 a4b89b18 b0000000 31f83300 b001b001 67452312 67452312 67452312 67452312 0000ffff 0000ffff 0000ffff 0000ffff 00000001 00000002 00000003 00000004 00000000 00000001 00000002 00000003 One failing test remaining. Turns out my printing code in the test was wrong, which caused most of the failures.

Andreas Gal :gal

Assignee

Comment 6

•

12 years ago

Attached patch patch, SIMD.h working on all platforms, with TestSIMD.cpp (obsolete) — Details — Splinter Review

Attachment #746813 - Attachment is obsolete: true

Attachment #746818 - Attachment is obsolete: true

Andreas Gal :gal

Assignee

Comment 7

•

12 years ago

The FPU version is ungodly slow on my Android device. Not really surprising though. We are likely spilling ourselves to death, and the i2f round trip is costly. The NEON path is much more acceptable (still 10x slower than my beefy laptop FPU, and 30x slower than SSE2). We really don't want to end up on the CPU/NEON path on mobile. This makes me pretty hopeful for being able to speed up Blur a lot. FPU 105.493011 0.745483mp/s NEON 16.183075 4.859596mp/s

Andreas Gal :gal

Assignee

Comment 8

•

12 years ago

prefetching is a 50% speedup: 12.184723 6.454246mp/s So we are definitely memory bound here, which makes me somewhat more confident that the code we emit with NEON is reasonable/sufficient.

Andreas Gal :gal

Assignee

Comment 9

•

12 years ago

For a matrix where we don't have to un-premultiply (doesn't munge alpha) x86 looks like this: FPU 1.051189 74.813568mp/s SSE2 0.265869 295.796802mp/s and ARM: NEON 9.881073 7.958974mp/s

Andreas Gal :gal

Assignee

Comment 10

•

12 years ago

Attached patch patch to land — Details — Splinter Review

Attachment #747085 - Attachment is obsolete: true

Andreas Gal :gal

Assignee

Comment 11

•

12 years ago

Comment on attachment 747674 [details] [diff] [review] patch to land I would like to go ahead and land this and then do a follow-up patch that adds some missing functionality for BlurSS2.cpp and turns BlurSSE2.cpp into BlurSIMD.cpp using SIMD.h so we get NEON support there, before I try to land the Moz2D interface next. Makes sense?

Attachment #747674 - Flags: review?(bas)

Bas Schouten (:bas.schouten)

Comment 12

•

12 years ago

Comment on attachment 747674 [details] [diff] [review] patch to land Review of attachment 747674 [details] [diff] [review]: ----------------------------------------------------------------- I'll look at this tomorrow. Derf should look at this to as he's better able to find any inefficiencies in the SIMD code.

Attachment #747674 - Flags: review?(tterribe)

Andreas Gal :gal

Assignee

Comment 13

•

12 years ago

Please feel free to bike-shed the naming convention for vec4f/vec4i and also the convention for methods on it (since we are about to add more for Blur). I am not a huge fan of the current convention, but I have no ideas what else to use. For methods we could use more expressive names and caps (PackAndSaturateU8?).

Timothy B. Terriberry (:derf)

Comment 14

•

12 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #12) > I'll look at this tomorrow. Derf should look at this to as he's better able > to find any inefficiencies in the SIMD code. ColorMatrixSIMD.cpp appears to be missing?

Andreas Gal :gal

Assignee

Comment 15

•

12 years ago

oops. Its just an #include of ColorMatrix-inline.h with #define USE_SIMD. Will add. Also, I have a long plane ride tomorrow. Please feel free to bike shed the name and structure of the vector class. I don't particularly like the current setup myself. One option would be a scalar implementation of a template template <typename T, int N> Vector { }; with specializations for Vector<float, 4> for SSE2 and NEON etc and then maybe a vec4i shorthand?

Mayank Bansal

Comment 16

•

11 years ago

is this still needed ?

Markus Stange [:mstange]

Comment 17

•

11 years ago

An SSE2 ColorMatrix implementation was added in bug 924102, and bug 961759 tracks the NEON implementation. I'll close this bug.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE

Markus Stange [:mstange]

Updated

•

11 years ago

Attachment #747674 - Flags: review?(tterribe)

Attachment #747674 - Flags: review?(bas)

You need to log in before you can comment on or make changes to this bug.