The default bug view has changed. See this FAQ.

IonMonkey: gl-matrix.js: use NEON to optimize vectorized functions.

RESOLVED DUPLICATE of bug 832718

Status

()

Core
JavaScript Engine
RESOLVED DUPLICATE of bug 832718
4 years ago
4 years ago

People

(Reporter: snorp, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

I've been profiling some WebGL stuff on Mobile recently, and noticed that we frequently spend a lot of time doing matrix arithmetic. We could accelerate these operations with SSE and NEON and improve things quite a bit, though of course we'd have to expose some "native" matrix types.
One example of this can be found here: http://media.tojicode.com/webgl-samples/md5Mesh.html

And here's the profile: http://people.mozilla.com/~bgirard/cleopatra/#report=a32b6fad57ee69e801aafafc3c4dc45eebdce7be
Essentially we need to do this, but for js: http://tirania.org/blog/archive/2008/Nov-03.ht
Wow, copy/paste fail: http://tirania.org/blog/archive/2008/Nov-03.html
Created attachment 709782 [details]
JIT Profiler output.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #3)
> Wow, copy/paste fail: http://tirania.org/blog/archive/2008/Nov-03.html

Bug 832718

Then, we should look at a dump of http://media.tojicode.com/webgl-samples/js/util/gl-matrix-min.js (multipleVec3 function)

multiplyVec3:function(a,b,c){
  c||(c=b);
  var d=b[0],e=b[1],b=b[2];
  c[0]=d*a[0]+e*a[3]+b*a[6];
  c[1]=d*a[1]+e*a[4]+b*a[7];
  c[2]=d*a[2]+e*a[5]+b*a[8];
  return c
}

I think there is another issue before playing with any SSE / NEON instruction, which is that we need to get rid of the CallSetElement first.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 644389
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #5)
> 
> *** This bug has been marked as a duplicate of bug 644389 ***

As mentioned previously, the problem here is not that we are not using SIMD instructions.  The problem is that we produce a CallSetElement which cause a huge slow down in this function.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Summary: Add support for accelerated matrix operations → IonMonkey: gl-matrix.js is slow because it uses a CallSetElement.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #6)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #5)
> > 
> > *** This bug has been marked as a duplicate of bug 644389 ***
> 
> As mentioned previously, the problem here is not that we are not using SIMD
> instructions.  The problem is that we produce a CallSetElement which cause a
> huge slow down in this function.

Oh, I somehow missed that comment, sorry! How much of an improvement do you think it will make if that is fixed? I do still think we should use SIMD, because it should make an operation like this something like 4 instructions (plus js overhead), but obviously anything we can do in the mean time would help too.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #7)
> (In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #6)
> > (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #5)
> > > 
> > > *** This bug has been marked as a duplicate of bug 644389 ***
> > 
> > As mentioned previously, the problem here is not that we are not using SIMD
> > instructions.  The problem is that we produce a CallSetElement which cause a
> > huge slow down in this function.
> 
> Oh, I somehow missed that comment, sorry! How much of an improvement do you
> think it will make if that is fixed?

Something around a factor of at least 20, is the approximate cost I consider for these kind of function calls, as they can be substitute by a few lines of assembly instead of a VM function wrapper call which handle any generic case.

> I do still think we should use SIMD,
> because it should make an operation like this something like 4 instructions
> (plus js overhead)

I agree, but we are not yet there in this test case, and I don't think this would be the following blocker after fixing CallSetElement.
Ok, it seems that the problem I noticed with CallSetElement no longer exists.  I cannot reproduces it with a newer version of the browser (not a 6 week old).

Both the JIT inspector and stepping through the code generation of the function confirm that we are no longer producing any calls there, hopefully.  Still profiling in the browser (not on mobile) show that the cost of multipleyVec3 relative to _loadMesh is 32.4% (similar to the one attached here, which is 36.8%)

I am also surprised, because gl-matric has a switch to change if it use a type array or normal arrays, but all the setelem done in multiplyVec3 (line 33) are made on dense array, and not typed arrays.

Brian: (see next paragraph)

Even the jit-inspector reports that all getelem of this function are expecting dense arrays (because we attempt to unbox int32 before converting them to doubles), which is weird knowing the latest brian hackett patch which is enforcing double type if the array contains some doubles.  I would expect this kind of operations to mostly be double operations, and thus the storage too.

// line 33
multiplyVec3 = function(a,b,c) {
  c||(c=b);
  var d = b[0], e = b[1], g = b[2],
      b = a[0], f = a[1], h = a[2], a = a[3],
      j = a * d + f * g - h * e,
      i = a * e + h * d - b * g,
      k = a * g + b * e - f * d,
      d = -b* d - f * e - h * g;
  c[0] = j * a + d *-b + i *-h - k *-f;
  c[1] = i * a + d *-f + k *-b - j *-h;
  c[2] = k * a + d *-h + j *-f - i *-b;
  return c
};

For SIMD instructions, as long as we are manipulating dense arrays in this test case, this bug would be a duplicate of Bug 832718, otherwise if we start manipulating Float32Arrays this bug would be a duplicate of Bug 644389.
Flags: needinfo?(bhackett1024)
Testing this test case on 3 different platforms:

- Nightly, Desktop, x86-64 Linux, Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz, performance scheduler: 60 Fps (call frequency of requestAnimationFrame)
- Firefox for Android (Nightly), Galaxy Nexus, 1.2 GHZ dual core processor, on battery: 29 Fps
- B2G, Test driver, on battery: 8 Fps

The test driver device is not a gaming platform, at this point this deserve some optimization in the code which is used to do this matrices multiplication.  Optimizing it would be to first enable TI and then Ion.

The galaxy nexus corresponds to what you seems to have in your profile (> 24 FPS), in addition I will note that you have something which is above 24 FPS while profiling and taking samples every 1ms.

Profiles between desktop and firefox for Android are similar (relative to _loadMesh function), so we can safely assume that firefox for Android is also producing the same MIR, which got specialized for ARM instead of x64.  The small delta between 32.4% and 36.8% does not seems significant to highlight a major issue which would be ARM specific.

I am marking this bug as a duplicate of the SIMD bug for general JS (Bug 832718), as this test case is not using Typed Arrays.  I cancel brian's need-info as the performances are not critically bad on the reported profile as well as on a galaxy nexus phone.

Curently SIMD support is a low priority, do you have anything which render the same kind of animation at a faster rate, while running exclusively under the dalvik VM, such as we can decide to raise the priority of SIMD support?
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Flags: needinfo?(bhackett1024) → needinfo?(snorp)
Resolution: --- → DUPLICATE
Duplicate of bug: 832718
OS: All → Android
Hardware: All → ARM
Summary: IonMonkey: gl-matrix.js is slow because it uses a CallSetElement. → IonMonkey: gl-matrix.js: use NEON to optimize vectorized functions.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #10)
<snip>
> I am marking this bug as a duplicate of the SIMD bug for general JS (Bug
> 832718), as this test case is not using Typed Arrays.  I cancel brian's
> need-info as the performances are not critically bad on the reported profile
> as well as on a galaxy nexus phone.

Maybe I'm not reading bug 832718 right, but it sounds like you want to use some heuristic to figure out whether or not you can vectorize a path like this one? While that would be great, it also sounds pretty difficult to me. Wouldn't it be simpler to just have some explicit Vector types of various sizes that people can use directly? The mono guys did something similar a while back and it has been very successful with their game developer crowd: http://docs.go-mono.com/index.aspx?link=N:Mono.Simd

I'm obviously not an expert in JS engines, but it seems something like that could be knocked out in a week and have a pretty dramatic effect on WebGL apps.

> 
> Curently SIMD support is a low priority, do you have anything which render
> the same kind of animation at a faster rate, while running exclusively under
> the dalvik VM, such as we can decide to raise the priority of SIMD support?

Sorry, I don't. Do you think dalvik already uses NEON for this kind of operation?
Flags: needinfo?(snorp)
Since we seem to be commenting on this bug, I'd like to bring up the fact that NEON is likely not applicable here, since NEON works on integer types, and single precision floating point, but *not* double precision floating point.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #11)
> (In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #10)
> <snip>
> > I am marking this bug as a duplicate of the SIMD bug for general JS (Bug
> > 832718), as this test case is not using Typed Arrays.  I cancel brian's
> > need-info as the performances are not critically bad on the reported profile
> > as well as on a galaxy nexus phone.
> 
> Maybe I'm not reading bug 832718 right, but it sounds like you want to use
> some heuristic to figure out whether or not you can vectorize a path like
> this one? While that would be great, it also sounds pretty difficult to me.
> Wouldn't it be simpler to just have some explicit Vector types of various
> sizes that people can use directly? The mono guys did something similar a
> while back and it has been very successful with their game developer crowd:
> http://docs.go-mono.com/index.aspx?link=N:Mono.Simd

The problem that you raised is one test case with a profile, on a test case which use an out-dated version of gl-matrix (at least I cannot find multipleVec3 in the git repository of gl-matrix).  What I am saying in this Bug, is that there is no more issue related to IonMonkey.  If you want to make the current code working fast, then you need non-typed arrays with SIMD, and this is not a priority.

If you want to fix it, fell free to contribute and to justify your approach for multiple platform in a way which can be standardized.  In the mean time, I am not the right person to discuss standard functions if they are justified or not.
(In reply to Marty Rosenberg [:mjrosenb] from comment #12)
> Since we seem to be commenting on this bug, I'd like to bring up the fact
> that NEON is likely not applicable here, since NEON works on integer types,
> and single precision floating point, but *not* double precision floating
> point.

Yeah, I guess we'd have to demote to single precision. This happens anyway with the values that go into OpenGL, I'd assume. Not sure if that negates the performance gain of SIMD or not.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #14)
> (In reply to Marty Rosenberg [:mjrosenb] from comment #12)
> > Since we seem to be commenting on this bug, I'd like to bring up the fact
> > that NEON is likely not applicable here, since NEON works on integer types,
> > and single precision floating point, but *not* double precision floating
> > point.
> 
> Yeah, I guess we'd have to demote to single precision. This happens anyway
> with the values that go into OpenGL, I'd assume. Not sure if that negates
> the performance gain of SIMD or not.

Oh wait, we actually have 32bit floats as part of Float32Array anyway, so I guess there's no problem here.
afiact, JS semantics with Float32Array is "the values in the array are 32 bit floats, but as soon as we remove it from the array, we promote it to a double, then do all math with doubles.  I believe there are even tests in place to make sure that we don't accidentally do single precision math.
(In reply to Marty Rosenberg [:mjrosenb] from comment #16)
> afiact, JS semantics with Float32Array is "the values in the array are 32
> bit floats, but as soon as we remove it from the array, we promote it to a
> double, then do all math with doubles.  I believe there are even tests in
> place to make sure that we don't accidentally do single precision math.

That makes sense I guess. The vectorized version would need to circumvent that, which could be viewed as "wrong", I suppose. I'm imagining something like the following:

// 4x4 matrix
let a = new Float32Array(16);
let b = new Float32Array(16);

// populate a & b with stuff

// Multiply a * b at offset 0, length 4
a.multiply(b, 0, 4);
a.multiply(b, 4, 4);
a.multiply(b, 8, 4);
a.multiply(b, 12, 4);

We would never "unpack" the elements from the array, so I think it could work out.
You need to log in before you can comment on or make changes to this bug.