Closed Bug 482200 Opened 15 years ago Closed 6 years ago

Investigate how best to do 4x4 matrices in JS

Categories

(Core :: JavaScript Engine, enhancement)

x86
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: ilmari.heikkinen, Unassigned)

Details

Attachments

(2 files, 2 obsolete files)

4x4 matrix multiplications are used in 3D engines to build the transformation matrices for each object. The amount of matrix multiplications you can do per frame affects the amount of objects you can have in a scene. So the faster the mmult, the more complex scenes you can manage.

On my 2GHz Core 2 box and 2009-03-01 3.2a1pre Minefield, it takes ~23 ms to do 1000 4x4 mmults in JavaScript. Which would limit a JS engine running at 30 fps to 300-1000 objects (depending on whether you're memoizing the object transform stack for static objects or not.)

At 60 fps, the amount objects is at least halved, as you're also seeing static overhead increase in proportion (10 ms per-frame overhead at 30 fps leaves 23 ms for the engine, but only 6 ms at 60 fps.)
Actually, a C port of mmul4x4 compiled with -O2 runs only 4 times faster. And a bit of googling suggests that using SSE might give a 2-4x speedup. So maybe I'm just doing the wrong assumption here by thinking that matrix-by-matrix multiplication is the best way to go for object transformations.
That was going to be my next question, yeah.  That's pretty encouraging jit-wise, though obviously it could be better, especially with SIMD stuff.  I don't think you can get away from matrix-by-matrix mults though; I mean, you can optimize for translation-only matrices for example, but I don't think that'll help that much.
Attached file C port of mmul4x4 (obsolete) —
Timed by setting iters to 10k and `time ./mmul`
Ah, sorry about that, using alloca makes things slow in the C example. Replacing it with malloc (or posix_memalign(&ptr, 16, sz)) reduces the time to 160 ms for a million 4x4 mmults at -O2, 80 ms at -O3. So 150-300x faster than JavaScript.

80 ms is around what you'd expect for a million mmults, as the code does 16x {8x movsd, 4x mulsd, 3x addsd, 2.5x addq}. ADDSD throughput is 1/cycle, MULSD 2, MOVSD 2, ADDQ 1, for a total of 184 cycles: 1e6 / (2.13e9 / 184) = 0.086.
Unrolled the two inner loops, dropping the runtime to 9 ms.
Attachment #366269 - Attachment is obsolete: true
Component: Canvas: 2D → Canvas: WebGL
Hey Ilmari.

Just a suggestion, maybe you'd want to make your testcase a bit larger, since when I tried it on my machine it completed in 1ms w/o even making a noticeable spike in the CPU on one core (latest nightly FF3.7a3pre)

http://learningwebgl.com/blog/?p=1828

And that might interest you too - ran into it on Planet WebGL - offloading matrix mult onto the GPU
BTW, there are some links to other Bugzilla bugs and optimisation suggestions in comments to that post.
The original benchmark here runs quite fast now, because of what's said in comment #7.  However, optimizing matrix multiplication in browsers would be a worthwhile goal -- investigating the fastest ways to represent matrices in JS, and whether things like CSSMatrix etc. are useful perf-wise would be good.
Summary: [c3d] JavaScript 4x4 matrix multiplication benchmark → Investigate how best to do 4x4 matrices in JS
Severity: normal → enhancement
Assignee: nobody → general
Component: Canvas: WebGL → JavaScript Engine
QA Contact: canvas.2d → general
Assignee: general → nobody
Old performance bug, no longer really applicable, because the overall system is now faster. Therefore resolving as INCOMPLETE.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: