Closed Bug 661624 Opened 13 years ago Closed 5 years ago

Implement a JavaScript Matrix API

Categories

(Core :: JavaScript Engine, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: bobby, Unassigned)

References

Details

This morning, this was passed around: http://developer.apple.com/library/safari/#documentation/AudioVideo/Reference/WebKitCSSMatrixClassReference/WebKitCSSMatrix/WebKitCSSMatrix.html

We do a lot of work to write good JS matrix libraries in CubicVR.js, and it would benefit us quite a bit performance-wise to have a simple matrix api do the work for us.

The problem we have with the implementation offered by webkit is one of overhead. We don't need anything fancier than some functions which operate directly on Float32Arrays.

For instance, in JS: 
"mat4Multiply(someFloatArray, anotherFloatArray, destinationArray);"
or
"var newMatrix = mat4Multiply(someFloatARray, anotherFloatArray);"

This way, there is no OO overhead, and engine-writers like us can deal with the data as we want to.
(In reply to comment #0)
> For instance, in JS: 
> "mat4Multiply(someFloatArray, anotherFloatArray, destinationArray);"
> or
> "var newMatrix = mat4Multiply(someFloatARray, anotherFloatArray);"
> 
> This way, there is no OO overhead, and engine-writers like us can deal with
> the data as we want to.

To avoid overhead, you really want the first of these two versions:

  multiply(someFloatArray, anotherFloatArray, destinationArray);

to avoid creating a temporary object to hold the result. This is where WebKit's API is slow, as has already been discussed on the WebGL list.

Just covering small sizes (up to 4x4 matrices) and only matrix multiplication (and perhaps matrix inversion) would already cover the needs of WebGL apps. If we support arbitrary sizes, we should make it clear that it's only fast for small sizes --- nobody wants to have to ship a fast GEMM in the browser.
I think that separate code paths for each handled size is a requirement for good performance: in particular, for very small matrix sizes, loop unrolling is critical.

So here's what seems to be needed:

  2x2 times 2x2 matrix product
  2x2 times 2x1 matrix-vector product
  2x2 matrix inversion
  2D vector dot product

  3x3 times 3x3 matrix product
  3x3 times 3x1 matrix-vector product
  3x3 matrix inversion
  3D vector dot product
  3D vector cross product

  4x4 times 4x4 matrix product
  4x4 times 4x1 matrix-vector product
  4x4 matrix inversion
  4D vector dot product

Total 13 functions.
Depends on: 644389
Evilpies marked this bug as depending on bug 644389, I don't think it actually depends on it, but it's related. Bug 644389 (SIMD intrinsics) would allow JS coders to do the same for 4x4 matrices and 4D vectors, but it would not help much for other sizes (2x2 and 3x3 matrices).

I've also been wondering if we wanted to add transpose() and transpose-multiply/transpose-inverse, but at the moment I dont think so because we must keep the number of functions low and transpose() can be implemented with decent speed by just manipulating the coefficients of a typed array; and the speed advantage of combined transpose-multiply/transpose-inverse functions, while real, is probably not big enough to justify having such separate functions.
I'm not a matrix library expert, but my concerns are
 - maybe only 13 functions, but a fair amount of code that needs impls for all supported architectures in our current codebase
 - also needs tjit/mjit/ti etc. hookup
 - ties authors to particular representation(s)
 - expected to be subsumed by bug 644389.  If so, the work here is "wasted" in that if we continued to support these interfaces, we would rip them out of C++ and move to JS.

There's also a political problem getting these spec'd.  Which group's purview do they fall under?  ECMA?

I could maybe see some value in this if we implemented it with bug 644389 in mind, meaning that we learn about the hard problems and lay the groundwork for generic SIMD so as not to waste backend code.  I.e., the explicit goal should be to eventually implement these in pure JS.
No good way to say related, and i think it's common mark as depend when it's actually only related.

I think we want to implement this with SSE or AVX, so float arrays of the size 4 would be preferable.
I'm all for making matrix math faster.

One question I have up front is, for some typical application, what is the performance of the pure JS version vs. one using ideal matrix API functions? I ask because there exist floating-point microbenchmarks that run just as fast in a modern JS engine as in C++.

As Benoit says in comments 2 and 3, I would guess that slowdowns in floating-point matrix math are more likely to come from memory allocation than the actual computations. So an API to cheaply create struct-like containers for vectors and points and such might solve most of the perf problems.
(In reply to comment #4)
> I'm not a matrix library expert, but my concerns are
>  - maybe only 13 functions, but a fair amount of code that needs impls for
> all supported architectures in our current codebase

Well, a generic implementation in C++, not using any platform-specific optimization, would already be worthwhile. Then we'd want SSE and NEON, yes.

>  - also needs tjit/mjit/ti etc. hookup

True

>  - ties authors to particular representation(s)

True, it's quite low-level, though not as low-level as the SIMD API we're discussing in bug 644389.

>  - expected to be subsumed by bug 644389.

Only the 4x4 case is subsumbed by bug 644389. You can't do 3x3 operations with SIMD, at least not efficiently. Similarly for 2x2 product, SIMD doesn't help.

Also, even in the 4x4 case, while it's true that SIMD intrinsics (bug 644389) would allow the JS developer to do that, it is quite difficult to implement 4x4 matrix inverse efficiently using SIMD intrinsics.

> There's also a political problem getting these spec'd.  Which group's
> purview do they fall under?  ECMA?

Did Apple go though any committee for WebKitCSSMatrix or is that just their own thing?
(In reply to comment #7)
> (In reply to comment #4)
> >  - ties authors to particular representation(s)
> 
> True, it's quite low-level, though not as low-level as the SIMD API we're
> discussing in bug 644389.

Yes, but because the SIMD API is simpler/lower-level, it's easier to spec.  (IMHO.)

But David is right, benchmarks are sorely needed here.
(In reply to comment #4)
> There's also a political problem getting these spec'd.  Which group's
> purview do they fall under?  ECMA?

We should definitely keep in touch with TC39 people on any effort like this. But this seems like more of a library than a language change, so I could see it going like the Canvas API--someone just makes it, and if devs like it, work on standardizing.

> I could maybe see some value in this if we implemented it with bug 644389 in
> mind, meaning that we learn about the hard problems and lay the groundwork
> for generic SIMD so as not to waste backend code.  I.e., the explicit goal
> should be to eventually implement these in pure JS.

Good point.
Ahoy, thought I'd add some benchmark results to help out:

I've created a simple test case for inverting a 4x4 matrix using various JS methods and the MesaGL C version of gluInvertMatrix.

My rig is an Intel i7 960 running 64-bit ubuntu linux.

The JS Version is here: http://jsperf.com/inverse-matrix-4x4/5
and CubicVR.inverse$2 the slowest of the bunch is a near 1-1 match (unrolled) to gluInvertMatrix.

The highest score I achieved here was with osgjsInverse, which gives me  2,477,413 ops/sec


The C version is here: https://gist.github.com/1005551 and performs 1 million calls to gluInvertMatrix

Note that it has the added handicap of allocating and de-allocating the matrix in the inner loop, using 64-bit doubles and a determinant multiplication loop that isn't unrolled.

I tried with both no optimization, and -O3 optimization, here are the results:

ccliffe@galaxy:~$ ./inverse_test 
Total Milliseconds: 311
Total Seconds: 0.311000
Cycles/Second: 3215434.083601

ccliffe@galaxy:~$ ./inverse_test-O3 
Total Milliseconds: 122
Total Seconds: 0.122000
Cycles/Second: 8196721.311475

Cheers,
-CJ
(In reply to comment #7)
> Also, even in the 4x4 case, while it's true that SIMD intrinsics (bug
> 644389) would allow the JS developer to do that, it is quite difficult to
> implement 4x4 matrix inverse efficiently using SIMD intrinsics.

How would that be efficiently implemented, then? I mean does it matter whether SIMD intrinsics are used by JS code or compiled C++, or do you use another unit or units (CPU + GPU + short-vector-FPU)?

> > There's also a political problem getting these spec'd.  Which group's
> > purview do they fall under?  ECMA?
> 
> Did Apple go though any committee for WebKitCSSMatrix or is that just their
> own thing?

Apple has done stuff through various standards bodies but for the iPhone they had to blaze some trails.

This is not in general the model for Mozilla to follow. In general (that phrase again), we prefer to propose draft standards and then prototype-implement them, with a process that invites other vendors to edit and implement too.

Still, http://developer.apple.com/library/safari/#documentation/AudioVideo/Reference/WebKitCSSMatrixClassReference/WebKitCSSMatrix/WebKitCSSMatrix.html looks pretty tasty. Does it have momentum?

Ecma TC39 has definitely looked at the general area of "value types" and "binary data" but not (yet) with a focus on data parallelism. We will get there pretty soon, I think, and more data and experience before we go to Ecma would be good.

We're working with Intel folks on data parallel extensions to JS. More about that another time.

/be
(In reply to comment #11)
> (In reply to comment #7)
> > Also, even in the 4x4 case, while it's true that SIMD intrinsics (bug
> > 644389) would allow the JS developer to do that, it is quite difficult to
> > implement 4x4 matrix inverse efficiently using SIMD intrinsics.
> 
> How would that be efficiently implemented, then? I mean does it matter
> whether SIMD intrinsics are used by JS code or compiled C++, or do you use
> another unit or units (CPU + GPU + short-vector-FPU)?

I didn't mean that SIMD was not the best way to implement 4x4 matrix inverse --- it is. I meant that efficiently implementing this is nontrivial. Intel has published an efficient implementation, see section 5.3 in

  http://download.intel.com/design/PentiumIII/sml/24504301.pdf

The basic problem is that this involves computing 16 3x3 determinants, and SIMD isn't suited to computing 3x3 determinants because packets contain 4 floats. I believe the best approach is to vectorize that horizontally i.e. computing 4 3x3 determinants at once.

> 
> > > There's also a political problem getting these spec'd.  Which group's
> > > purview do they fall under?  ECMA?
> > 
> > Did Apple go though any committee for WebKitCSSMatrix or is that just their
> > own thing?
> 
> Apple has done stuff through various standards bodies but for the iPhone
> they had to blaze some trails.
> 
> This is not in general the model for Mozilla to follow. In general (that
> phrase again), we prefer to propose draft standards and then
> prototype-implement them, with a process that invites other vendors to edit
> and implement too.
> 
> Still,
> http://developer.apple.com/library/safari/#documentation/AudioVideo/
> Reference/WebKitCSSMatrixClassReference/WebKitCSSMatrix/WebKitCSSMatrix.html
> looks pretty tasty. Does it have momentum?

Given that WebKitCSSMatrix methods return new WebKitCSSMatrix objects, losing a lot of performance creating these objects, I don't think it's the right approach. As an Apple engineer wrote on the WebGL list:

https://www.khronos.org/webgl/public-mailing-list/archives/1012/msg00152.html

Quote: "Not a benchmark, but I've done informal testing. In J3DI I can turn on and off CSSMatrix and the native copy operation. I called the normal matrix operations many times per frame to get the percentage of the matrix ops where they are significant. Using CSSMatrix without copy gave moderate speedup, maybe 30%. But using copy without CSSMatrix almost doubled performance! The test I didn't do was to make CSSMatrix mutable so a new matrix didn't have to be created for each operation. I'm willing to bet that doing copy + native, mutable CSSMatrix would at least triple performance from today's best effort. But that's just a guess."

There's been a discussion on the WebGL list:
https://www.khronos.org/webgl/public-mailing-list/archives/1012/msg00080.html
This WebKit bug is relevant to this discussion:
https://bugs.webkit.org/show_bug.cgi?id=50633
> Did Apple go though any committee for WebKitCSSMatrix or is that just their
> own thing?

WebKitCSSMatrix is an implementation of the CSSMatrix interface defined in both the CSS3 2d and 3d transform modules:

http://www.w3.org/TR/css3-2d-transforms/#cssmatrix-interface
http://www.w3.org/TR/css3-3d-transforms/#cssmatrix-interface
(In reply to comment #14)

Oh, the CSS 3D transforms spec :-) Thanks, good to know that CSSMatrix is described there, and that that part of the spec is actually precise enough to be useful (contrary to  e.g. http://www.w3.org/TR/css3-3d-transforms/#perspective-origin )
I'm not sure if I can find enough time for it, but since it doesn't seem to be high prio for anyone I'd like to take this bug.
Assignee: general → ejpbruel
I haven't been able to make time for this bug so far, unfortunately. Moreover, it looks like there's no clear consensus on how we want to implement this anyway, so I'm dropping this for the time being.
Assignee: ejpbruel → general
Assignee: general → nobody

Resolving as Won't Fix, because any ECMA-262 changes should now happen through the TC39 process.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.