Closed Bug 644389 Opened 13 years ago Closed 9 years ago

Add a vector/SIMD API for web content

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 894105

People

(Reporter: cjones, Unassigned)

References

Details

(Whiteboard: [tech-p1])

Attachments

(2 files)

dmandelin and I discussed this briefly at the last all-hands.  The motivation is, once scripts are JIT'd and are using typed arrays, there's still an "easy" performance win for perf-critical code: SIMD.  For example, it's already known that DFTs/DCTs benefit from vectorization; see FFTW.  I CC'd humph who might have run into some of these walls in his work.  (humph, ISTR that you guys have a pure-JS DFT implementation, right?)  JS that manipulates images is another "easy" winner with this kind of API.  I think everyone would agree that having a web app that could filter images as fast as Photoshop would be a pretty compelling demo.

Note that this is *not* auto-vectorization of scripts, although that might be an interesting project too (probably not though ;)).  This would be a way for "power users" to harness SIMD from JS to make their hot loops run faster.

So, the hard part is spec'ing this.  Some initial considerations were brought up in the context of a simpler problem, bug 486800, and dmandelin and I discussed some more

 - Should this have "SIMD" or "vector" semantics?  The former would be fixed-width, the latter variable-width.  I strongly lean towards a vector API; I would hate to pollute the web with another NaCl travesty, tied to a particular ISA as of ~2010.  A compromise might be exposing a SIMD_WIDTH constant or something that's set at runtime, based on SIMD extensions present.  SIMD semantics is easier to spec and implement, but puts a higher burden on authors.  Vector semantics puts more burden on the compiler.

 - How would the SIMD/vector programs be written?  Typed arrays would definitely be involved.  One possibility is to implement this API with something like the Math object; say, Vector.add, etc.  This would make it easy for JS libraries to provide cross-browser compat, by creating fallback implementations.  It would be harder to make this fast; it would be easy to drown the wins from SIMD in overhead from function calls or dynamic typing.  Another possibility is to create a (statically typed) DSL and offer an eval()-like interface to compile the DSL.  Another possibility (which I think would be coolest) is to follow the example of WebGL shaders and make a new <script> type for a DSL.  It's not entirely clear what interface those programs would offer to <script type="text/javascript">.

Rob Cameron has done some low-level work, linked in bug 486800, that definitely seems worth referencing.  Intel's Ct language is also relevant.  There are a bajillion other data-parallel languages from the past that could be cribbed from.  If there's sufficient interest, I'd be happy to noodle around with a strawman spec, but I'd be equally happy handing this project off to someone else.
(In reply to comment #0)
> 
>  - Should this have "SIMD" or "vector" semantics?  The former would be
> fixed-width, the latter variable-width.  I strongly lean towards a vector API;
> I would hate to pollute the web with another NaCl travesty, tied to a
> particular ISA as of ~2010.  A compromise might be exposing a SIMD_WIDTH
> constant or something that's set at runtime, based on SIMD extensions present. 
> SIMD semantics is easier to spec and implement, but puts a higher burden on
> authors.  Vector semantics puts more burden on the compiler.

As a compiler-writer my immediate inclination is towards SIMD semantics :)  Pretty much all the important architectures have pretty similar SIMD implementations now, no?  The Adobe guys have been talking about eg. adding a 'float4' (4 x 32-bit float) type to Nanojit.  That (and other similar types) seems a reasonable way to do it, because it maps simply to common SIMD implementations without explicitly guaranteeing/requiring anything from the hardware.
Yes, I wrote a bunch of shell tests for sayrer and Kraken with dsp.js, dft being one of them.  I've attached all of them.  They are basically just math on typed arrays, so might be useful to you.  I've simulated audio samples by extracting a bunch from a real song, and then just statically including the resulting framebuffer arrays.
AFAIK there hasn't been an actual vector *ISA* for many many years; they're all SIMD nowadays.  I'm not sure how best to add SIMD support to nanojit; that's bug 486800, and we need to figure that out.  However, both of those are distinct from the API we present to web content.

My worry with a SIMD API is that SIMD widths change from ISA to ISA and from year to year, and tying the API to, say, 128-bit width is bad for forward compatibility.  Another problem with a SIMD API is that the granularity of operations is finer than what they would be in a vector API, which means more JS code has to run which brings in more type checks etc.  We could speculate or optimize those away, but that puts a higher burden on the JITs.
(In reply to comment #2)
> Created attachment 521349 [details]
> stand-alone dsp.js jsshell tests
> 
> Yes, I wrote a bunch of shell tests for sayrer and Kraken with dsp.js, dft
> being one of them.  I've attached all of them.  They are basically just math on
> typed arrays, so might be useful to you.  I've simulated audio samples by
> extracting a bunch from a real song, and then just statically including the
> resulting framebuffer arrays.

Awesome!  To answer some of these questions, we'll probably need trial and error on good examples.  In exchange, hopefully we can hand back code that's 4x faster :).
If you can guide us a bit, I'm sure we can supply you with good use cases.  Corban's the master of this stuff on our team, and is CC'ed.
(In reply to comment #3)
> 
> My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> year to year, and tying the API to, say, 128-bit width is bad for forward
> compatibility.

The float4 instructions supported now will still be supported in future versions.  And we can add float8 later if necessary.

I don't have strong opinions about this, it's just that I can imagine how a float4-style approach would work, whereas I can't see how other approaches would work.  But I'm no guru on parallelism.
(In reply to comment #5)
> If you can guide us a bit, I'm sure we can supply you with good use cases. 
> Corban's the master of this stuff on our team, and is CC'ed.

The ideal use cases are hot loops, kernels from real code, that have been carefully written to run as fast as possible through JIT'ing and typed arrays.  Also, problems where you find yourself saying "I wish I could use X but it's just not fast enough" are great too.

The task on those would be to design a SIMD/vector API that won perf where it wasn't obviously possible to win more, or "change the game" by enabling the use of expensive algorithms that just can't be used today.  (Like JITs did :).)
(In reply to comment #6)
> (In reply to comment #3)
> > 
> > My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> > year to year, and tying the API to, say, 128-bit width is bad for forward
> > compatibility.
> 
> The float4 instructions supported now will still be supported in future
> versions.  And we can add float8 later if necessary.

Then your kernel needs float4, float8, floatK implementations, and you need runtime detection of available float* extensions.

I'm not arguing that this is inherently bad, it's obviously what we have to do in C++, it's just that if it's possible to expose a friendlier API to JS with equivalent (or better!) performance, that seems a preferable route.
It is _very_ important to decide on what our goals are. "SIMD" is a pretty generic term, and a lot of people mean a lot of different things when they use that word. If you want to speed up audio processing, you may accept one set of trade-offs, while if you want to allow a decently performing video codec (or Photoshop-equivalent) written in JavaScript, you'll need a very different set.

For audio processing things tend to be relatively straightforward, because you really are just applying operations across huge vectors (an FFT is a bit more complicated, but not much: you have to operate at varying strides, but you're still applying common butterfly kernels across a reasonably large vector).

For image processing things are a lot different. You're typically working with integers, not floating point, dynamic range and type promotion become _hugely_ important, and there are lots of special-purpose instructions designed solely to make certain algorithms faster (PSADBW, PMOVMSKB, to name a few of the useful ones, MPSADBW, PHMINPOSUW to name some of the more ridiculous ones). The available operations (including mixes of sizes and types) vary from platform to platform, and aren't always available on older processors, or are simply very slow, and emulation can be quite slow as well. The challenge here is that you don't spend so much time shuffling around your data and packing and unpacking it to and from different bit-widths that you don't wipe out the benefits of parallelism.

Just to give one example, NEON lacks an 8x8->high 8 bits unsigned multiply (important for things like bilinear blending, alpha compositing, etc.), and the 8x8->16 bit multiply it does have has a latency of over 6 cycles. Meaning unless you have something else to do during those cycles, you pretty much might as well just do the multiplies with scalar code. At 16-bits things are a little better: there's a 16x16 -> bits [15:30] signed multiply (i.e., instead of the high 16 bits, it returns the 16 bits underneath the highest bit; this instruction was obviously designed by an audio DSP engineer who did a lot of Q15 fixed-point processing, and nothing else). You could almost make this do what you want with a few sacrifices (i.e., 7-bit blend weights would get the result in the right place, but requires fix-ups for using a signed vs. unsigned multiply), but nothing like it is available on Intel chips, and any algorithm you designed to use it would be slow to emulate there (and vice versa).
What Mono did, and which got them a lot of mileage with the physics/game applications, was simply expose some types and specialize with SIMD when generating code that used them: http://tirania.org/blog/archive/2008/Nov-03.html

We could express different operations via SIMD.shuffle or SIMD.m16x16to16 and detect those at the JIT levels.  I don't think we need to make it a lot easier to write cross-platform SIMD stuff in JS than it is in C++, to start, so if people have to do a bunch of |"m16x16to16" in SIMD| tests to condition their algorithms, I would take that as an initial victory condition.
(In reply to comment #9)
> It is _very_ important to decide on what our goals are. "SIMD" is a pretty
> generic term, and a lot of people mean a lot of different things when they use
> that word. If you want to speed up audio processing, you may accept one set of
> trade-offs, while if you want to allow a decently performing video codec (or
> Photoshop-equivalent) written in JavaScript, you'll need a very different set.

Agreed, and with the rest.  I don't think it's wise to take the union of every SIMD ISA under the sun and try to emulate what's not natively supported.  I'm not really sure initially what trade-offs we want to make; it's probably best to start with a set of kernels we want to ensure are fast and let those guide the design.  We have what humph posted, and there's also [1] and [2],
which overlap I'm sure.  I think another reasonable target would be to enable writing a JS-BLAS kind of library that's not embarrassingly slow compared to BLAS.  (I'd also love to allow using Workers to optimize something like JS-BLAS, but that's another bug.)  There's been some rumble about adding a matrix-multiplication web API, but the OS-designer side of me would prefer to add the primitives needed to write something like that in JS.

[1] http://people.mozilla.com/~vladimir/demos/darkroom/darkroom.html
[2] http://people.mozilla.com/~vladimir/demos/jsfft/jsfft.html

(In reply to comment #10)
> What Mono did, and which got them a lot of mileage with the physics/game
> applications, was simply expose some types and specialize with SIMD when
> generating code that used them:
> http://tirania.org/blog/archive/2008/Nov-03.html

Very interesting, thanks.  (And moar benchmarks to test against!)

> We could express different operations via SIMD.shuffle or SIMD.m16x16to16 and
> detect those at the JIT levels.  I don't think we need to make it a lot easier
> to write cross-platform SIMD stuff in JS than it is in C++, to start, so if
> people have to do a bunch of |"m16x16to16" in SIMD| tests to condition their
> algorithms, I would take that as an initial victory condition.

That's a fair point, and is certainly better scoped.  I suspect we would still win in many cases by emitting optimal scalar-instruction emulations, but that's I guess still orthogonal to |"m16x16to16" in SIMD|-kinds of tests, since we'll probably want to evolve the set.
(In reply to comment #11)
> There's been some rumble about adding a
> matrix-multiplication web API, but the OS-designer side of me would prefer to
> add the primitives needed to write something like that in JS.

At least in the AAA-and-similar game development community, there is not much consensus on how matrix libraries should look.  Standardizing them is a Quixotian undertaking periodically brought up by some purist middleware person, and I've only seen it end in tears.
(In reply to comment #3)
> My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> year to year, and tying the API to, say, 128-bit width is bad for forward
> compatibility.

FWIW Intel's new AVX instructions already support 256-bit widths.

(In reply to comment #10)
> We could express different operations via SIMD.shuffle or SIMD.m16x16to16 and
> detect those at the JIT levels.  I don't think we need to make it a lot easier
> to write cross-platform SIMD stuff in JS than it is in C++, to start, so if
> people have to do a bunch of |"m16x16to16" in SIMD| tests to condition their
> algorithms, I would take that as an initial victory condition.

Won't there be interop problems? People writing JS code that only runs on SSE>=2? Urgh.

I think we might have to limit the scope to just operations that can be implemented reasonably efficiently everywhere.
(In reply to comment #13)
> (In reply to comment #3)
> > My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> > year to year, and tying the API to, say, 128-bit width is bad for forward
> > compatibility.
> 
> FWIW Intel's new AVX instructions already support 256-bit widths.

I know.

> (In reply to comment #10)
> > We could express different operations via SIMD.shuffle or SIMD.m16x16to16 and
> > detect those at the JIT levels.  I don't think we need to make it a lot easier
> > to write cross-platform SIMD stuff in JS than it is in C++, to start, so if
> > people have to do a bunch of |"m16x16to16" in SIMD| tests to condition their
> > algorithms, I would take that as an initial victory condition.
> 
> Won't there be interop problems? People writing JS code that only runs on
> SSE>=2? Urgh.

Most basic instructions should be efficiently emulate-able.  It's not clear yet that we need to add anything exotic.

> I think we might have to limit the scope to just operations that can be
> implemented reasonably efficiently everywhere.

I don't disagree.  But, as I mentioned above, if we ever decided to grow the instruction set, authors would need these kinds of tests.
(In reply to comment #13)
> Won't there be interop problems? People writing JS code that only runs on
> SSE>=2? Urgh.

Another option would be to put a .accelerated property on all of the operations, so that people could still tune, but could also just use emulated (read: omg slow) implementations if a platform doesn't natively support the operation.

Mono.SIMD provides CIL backups for its (limited, admittedly) set of SIMD-aware operations, and would probably be a good start.

(Inferring SIMD-able types for addition and such might even be feasible in some cases; I don't quite understand from their example why it's not possible.)
dsp.js https://github.com/corbanbrook/dsp.js contains DFT and 2 separate FFT algos (FFT and RFFT). The repo contains tests and benchmarks for these classes.

Efforts have been done to keep these on trace.
Couple of hand-wavy thoughts

 - It might be better to explicitly version instruction sets, which would be accumulative, instead of adding instructions piecemeal.  Then authors could do something like
  if ("SIMD" not in window) slow();
  else if (SIMD.version < 2) fast();
  else if (SIMD.version < 3) faster();
  ...

 - I'm not convinced that scalar emulations of common SIMD instructions will necessarily be omgslow.  At best, when data depedencies work out well, the author essentially just told us to unroll their inner loop K times, and we could reorder the scalar instructions to hopefully win.  At worst, with complicated instructions or inconvenient data dependencies, we could hit omgslow even from simple emulations from bad register allocation and instruction scheduling.  Not sure off-hand which would be more common.
(In reply to comment #12)
> At least in the AAA-and-similar game development community, there is not much
> consensus on how matrix libraries should look.  Standardizing them is a
> Quixotian undertaking periodically brought up by some purist middleware person,
> and I've only seen it end in tears.

+1. As someone who wrote a matrix library and spent a lot of time discussing its API, I confirm that there is endless room for debate there, whereas web APIs need to be a little consensual.
Adding some SIMD intrinsics, that are general enough to be equally implementable on SSE2/NEON and have good non-SIMD fallbacks, is a good idea, and is AFAICS the only useful thing to do here.

As Tim says, SIMD support for integers is very sparse (e.g. multiplications of packets of 4 32bit ints was only added in SSE4!) so I think that it would be a lot more practical to sacrifice integers and focus on floating point. That would make everything easier to implement, more consistent performance, at the cost of giving up on certain use cases.

For users who want more than just low-level intrinsics, there's going to be WebCL anyway.

I would like to stress that any high-level use case, e.g. a matrix library, is going to need compiler collaboration to be really efficient [1]. This would make this is an unreasonably huge endeavour. Meanwhile, WebCL offers a nice approach and brings its own specialized computation kernel compiler, so it solves the compiler problem.

I still agree that "writing a JS-BLAS that's not embarrassingly slow" would be a worthwhile design goal here to guide the design of this intrinsics system, but to be clear WebCL is going to be the right tool for implementing a BLAS.

[1]: If you want to be efficient with arbitrary arithmetic expressions involving arrays, you need to collaborate with the compiler to let it know how to e.g. eliminate temporaries, traverse arrays, etc. This is typically done in C++ using template metaprogramming. Doing that in JS would require built-in support in the compiler, so that doesn't seem realistic to me. Meanwhile WebCL brings a dedicated language and compiler specialized for this use case.
(In reply to comment #8)
> (In reply to comment #6)
> > (In reply to comment #3)
> > > 
> > > My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> > > year to year, and tying the API to, say, 128-bit width is bad for forward
> > > compatibility.
> > 
> > The float4 instructions supported now will still be supported in future
> > versions.  And we can add float8 later if necessary.
> 
> Then your kernel needs float4, float8, floatK implementations, and you need
> runtime detection of available float* extensions.

I think that we really don't want to get there! Instead, we should focus on a small set of things that can be supported everywhere.

SIMD instruction sets will just have to keep supporting small 128bit packets because many very important use cases require them, for example float[4] vectors are hugely important to games and that alone guarantees that SIMD support for 128bit packets isn't going away anytime soon.

So, a packet-of-4-floats type would have long-term SIMD support.

Then we could also have a packet-of-8-floats type available everywhere, using emulation with 2 registers if needed. Surely the compiler would be able to reorder instructions appropriately, making this fast everywhere.

I wouldn't consider even wider packets ("floatK") at this point, as emulation using multiple registers would be less and less practical as the packet size increases.
Thanks for bringing up WebCL.  It's not clear to me how WebCL and a SIMD extension for JS like what's proposed here would interact.  Let's chat about this soon-ish.

(In reply to comment #20)
> (In reply to comment #8)
> > (In reply to comment #6)
> > > (In reply to comment #3)
> > > > 
> > > > My worry with a SIMD API is that SIMD widths change from ISA to ISA and from
> > > > year to year, and tying the API to, say, 128-bit width is bad for forward
> > > > compatibility.
> > > 
> > > The float4 instructions supported now will still be supported in future
> > > versions.  And we can add float8 later if necessary.
> > 
> > Then your kernel needs float4, float8, floatK implementations, and you need
> > runtime detection of available float* extensions.
> 
> I think that we really don't want to get there! Instead, we should focus on a
> small set of things that can be supported everywhere.

What you propose in the rest of this comment sounds like "going there" :).  I don't think we disagree though.
(In reply to comment #21)
> Thanks for bringing up WebCL.  It's not clear to me how WebCL and a SIMD
> extension for JS like what's proposed here would interact.  Let's chat about
> this soon-ish.

I don't think that WebGL and JS.SIMD would really interact directly; instead, they would be two separate ways of performing computations on JS Typed Arrays.

> 
> (In reply to comment #20)
> > (In reply to comment #8)
> > I think that we really don't want to get there! Instead, we should focus on a
> > small set of things that can be supported everywhere.
> 
> What you propose in the rest of this comment sounds like "going there" :).  I
> don't think we disagree though.

What I was referring to was your proposal to let the JS programmer query the SIMD packet size at runtime and act accordingly. I was saying that I disagree about having JS programmers do this kind of low-level platform detection. My argument is that we don't have to put that burden on the JS programmer, since it is actually possible to support the same packet sizes everywhere: 16 bytes is supported everywhere and will remain so; 32 bytes could be emulated with 2 registers on platforms that only have 16 bytes registers. Also, my fear with having the JS programmer do runtime packet size detection is that many JS programmers won't bother and will end up writing non-portable code.
(In reply to comment #22)
> (In reply to comment #21)
> > Thanks for bringing up WebCL.  It's not clear to me how WebCL and a SIMD
> > extension for JS like what's proposed here would interact.  Let's chat about
> > this soon-ish.
> 
> I don't think that WebGL and JS.SIMD would really interact directly; instead,
> they would be two separate ways of performing computations on JS Typed Arrays.

I meant whether the goals overlap enough that it would be better to just invest in WebCL (I'm not very familiar with OpenCL).  Judging by https://bitbucket.org/tallion/webclworker/wiki/Home, that won't be a feasible option cross-platform for a long time.

> > 
> > (In reply to comment #20)
> > > (In reply to comment #8)
> > > I think that we really don't want to get there! Instead, we should focus on a
> > > small set of things that can be supported everywhere.
> > 
> > What you propose in the rest of this comment sounds like "going there" :).  I
> > don't think we disagree though.
> 
> What I was referring to was your proposal to let the JS programmer query the
> SIMD packet size at runtime and act accordingly. I was saying that I disagree
> about having JS programmers do this kind of low-level platform detection. My
> argument is that we don't have to put that burden on the JS programmer, since
> it is actually possible to support the same packet sizes everywhere: 16 bytes
> is supported everywhere and will remain so; 32 bytes could be emulated with 2
> registers on platforms that only have 16 bytes registers. Also, my fear with
> having the JS programmer do runtime packet size detection is that many JS
> programmers won't bother and will end up writing non-portable code.

Ah, I misunderstood "Then we could also have a packet-of-8-floats" to mean "In the future we could also ...".

What you're proposing isn't forwards compatible.  Arguing against forwards compatibility because "JS programmers sux0r" isn't very compelling IMHO; we expect them to use feature detection all over the place.  And this is an API targeted at very advanced users.  Note too that we might want to expand this to support more kinds of instructions, not just increased width.  Maybe.

A more interesting argument IMHO might be that by the time we start seeing 512-bit-wide SIMD on CPUs, WebCL and platform/device support will be mature enough to cover the most interesting use cases of |window.SIMD|.  I suspect there'll be room for both, but if WebCL is the forwards compatibility plan then I'd be less concerned about forwards compatibility for window.SIMD.
Feature detection is kind of a stop-gap measure that Web developers often fail to use correctly. If we can avoid it, we should.

I like the plan of offering 128-bit and 256-bit primitives everywhere. If/when 512-bit primitives appear, we can add them. It would make sense to also offer a hint as to what the largest "fast" size is.
Blocks: 661624
Justm out of curiosity to anyone here that may know. How much would this sort of feature benefit to projects like this: http://badassjs.com/post/12035631618/broadway-an-h-264-decoder-in-javascript-running-at
Assignee: general → evilpies
I am going to write a blog post until, I really start working on this, but for your pleasure here is one example.#

With the help of TI, this is able to inline simd.add(a, 0, b, 0, n^2). The main takeaway, as you could expect is, 
a) JM + TI is already very fast with typed arrays
b) we need to hoist as much as possible (index + length check, maybe the dataslot, too)
c) If we have huge arrays SSE really pays off (still some win possible by loop unrolling and more clever code)
Are you planning to add vector primitives like in comment #20?
Currently no, I like the approach of using typed arrays and a count parameter. This way we might be able to optimize more, and people would not start thinking about the performance benefits and we as compiler writers can do our best.
River Trail is basically this approach, isn't it? Perhaps we should just take that?

Personally I think that relying on heroic compiler optimizations like large-scale loop fusion will be less appealing for many people than hand-writing vector code. The latter should give more predictable performance.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #29)
> River Trail is basically this approach, isn't it? Perhaps we should just
> take that?

RiverTrail goes farther: it works on immutable array objects (ParallelArray) which allows for more compiler optimization, and it brings a dedicated not-quite-javascript-subset language that it can compiler to OpenCL kernels. See recent discussions on dev-platform.

Without the dedicated language and the immutability of arrays, I don't believe that it's possible to offer a useful, efficient arrays-computation API for JavaScript. Indeed, you could offer some primitives, but arrays would be retraversed for each primitive. It would be very hard to have the compiler optimize across multiple primitives to do array traversal in the right way.

That's why I'm been advocating here that we focus on just plain SIMD intrinsics, on 128/256 bit vectors ("packets") matching the size of a register. Leaving all the array traversal up to the JS programmer. The spot for JS fast arrays libraries is already taken by RiverTrail and WebCL proposals.

> Personally I think that relying on heroic compiler optimizations like
> large-scale loop fusion will be less appealing for many people than
> hand-writing vector code. The latter should give more predictable
> performance.

Agree; I don't even think that the latter would be practical at all, because of the useless temporaries and lack of control on array traversal.

Another aspect, that is hugely important for gaming, is tiny objects. People will want to implement fast quaternion multiplication, fast 4x4 matrix inverse, etc. This can easily be done using plain SIMD intrinsics as I've been advocating; but falls outside of the scope of a runtime-size-arrays library.
(In reply to Benoit Jacob [:bjacob] from comment #30)
> (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #29)
> > River Trail is basically this approach, isn't it? Perhaps we should just
> > take that?
> 
> RiverTrail goes farther: it works on immutable array objects (ParallelArray)
> which allows for more compiler optimization, and it brings a dedicated
> not-quite-javascript-subset language that it can compiler to OpenCL kernels.
> See recent discussions on dev-platform.
> 

RiverTrail and simpler SIMD stuff are similar, but both are useful for different things I think.

Regarding RiverTrail, I believe I heard from dherman that the immutability was actually not necessary. Also a new type, ParallelArray, is not needed. Instead, it is enough to do just add a method like .parallelMap to typed arrays. So when you have

  myTypedArray.parallelMap(function(x) { .. });

then if the lambda is a subset of JavaScript that is safe to compile in a parallel way, it will be optimized, otherwise it will run normally. In both cases a new array is returned (which is why immutability is not necessary).
(In reply to Benoit Jacob [:bjacob] from comment #30)
> > Personally I think that relying on heroic compiler optimizations like
> > large-scale loop fusion will be less appealing for many people than
> > hand-writing vector code. The latter should give more predictable
> > performance.
> 
> Agree; I don't even think that the latter would be practical at all, because
> of the useless temporaries and lack of control n array traversal.

Just as a matter of perspective, vectorization was first added to gcc over four years ago, and after four years of work, pretty much the only loops it can vectorize in most people's code are simple initialization loops that set an array to a constant value. The FFmpeg/libav people ran some tests recently and found enabling auto-vectorization actually regressed performance by something like 30%. Heroic compiler optimizations sounds like an okay avenue of exploration for a "nice to have", but I don't think we should count on them as plan A.
(In reply to Timothy B. Terriberry (:derf) from comment #32)
> (In reply to Benoit Jacob [:bjacob] from comment #30)
> > > Personally I think that relying on heroic compiler optimizations like
> > > large-scale loop fusion will be less appealing for many people than
> > > hand-writing vector code. The latter should give more predictable
> > > performance.
> > 
> > Agree; I don't even think that the latter would be practical at all, because
> > of the useless temporaries and lack of control n array traversal.
> 
> Just as a matter of perspective, vectorization was first added to gcc over
> four years ago, and after four years of work, pretty much the only loops it
> can vectorize in most people's code are simple initialization loops that set
> an array to a constant value. The FFmpeg/libav people ran some tests
> recently and found enabling auto-vectorization actually regressed
> performance by something like 30%. Heroic compiler optimizations sounds like
> an okay avenue of exploration for a "nice to have", but I don't think we
> should count on them as plan A.

River Trail, IIUC, doesn't involve auto-vectorization or heroic optimization in general. It's about taking things like a |map| call and parallelizing it in a relatively straightforward way.

I agree that by definition heroic compiler optimizations are researchy and not something we want to develop in the near future.
(In reply to Alon Zakai (:azakai) from comment #31)
> (In reply to Benoit Jacob [:bjacob] from comment #30)
> > (In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #29)
> > > River Trail is basically this approach, isn't it? Perhaps we should just
> > > take that?
> > 
> > RiverTrail goes farther: it works on immutable array objects (ParallelArray)
> > which allows for more compiler optimization, and it brings a dedicated
> > not-quite-javascript-subset language that it can compiler to OpenCL kernels.
> > See recent discussions on dev-platform.
> > 
> 
> RiverTrail and simpler SIMD stuff are similar, but both are useful for
> different things I think.
> 
> Regarding RiverTrail, I believe I heard from dherman that the immutability
> was actually not necessary. Also a new type, ParallelArray, is not needed.
> Instead, it is enough to do just add a method like .parallelMap to typed
> arrays. So when you have
> 
>   myTypedArray.parallelMap(function(x) { .. });
> 
> then if the lambda is a subset of JavaScript that is safe to compile in a
> parallel way, it will be optimized, otherwise it will run normally. In both
> cases a new array is returned (which is why immutability is not necessary).

I think the key property they want is that things can be parallelized without (a) crashing, (b) changing the output, and (c) thinking about it too hard. So, for map, for example, you want to know that the closure argument doesn't mutate anything, including the input array. In that case, it can be run in parallel without risk of bad stuff.

So yes, it seems that immutability is not a fundamental requirement. I can't quite remember now why they made it immutable. It might have been so that users don't even expect to be able to mutate such arrays in closure arguments. Or maybe it related to the idea that the implementation would use some special buffer (OpenCL or something) that couldn't be conveniently mutated by regular sets.
(In reply to David Mandelin from comment #33)
> River Trail, IIUC, doesn't involve auto-vectorization or heroic optimization
> in general. It's about taking things like a |map| call and parallelizing it
> in a relatively straightforward way.

Yes, that's what it does, but a key thing to keep in mind is that few real-world problems can be solved by just one array operation like that; instead, real-world problems require performing a /sequence/ of such array operations, and the key to efficiency is to remove temporaries and minimize the number of times that arrays are traversed.

This is already crucial on a CPU, where one wants to avoid useless memory accesses (to/from useless temporary buffers). And when one targets GPUs this is even more important as the cost of transferring data to the GPU is very high, and the cost of getting data back, syncing with the GPU, is even far higher.

For example (coming back to CPU world for simplicity), it is crucial to avoid generating code like

    for(index i) tmp[i] = input1[i] + input2[i];
    for(index i) result[i] = tmp[i] * input3[i];

and instead generate

    for(index i) result[i] = (input1[i] + input2[i]) * input3[i];

I was initially skeptical that RiverTrail would be able to do that (let alone the GPU case), so I asked question in this dev-platform thread, and Rick's and Brendan's answers, IIUC, were about how immutability was key to allowing such optimizations (including running fast on a GPU by avoiding doing memory transfers for each RiverTrail statement). See dev-platform emails from 24--25 September, "RiverTrail, parallelizability, and sync points".
(In reply to Alon Zakai (:azakai) from comment #31)
> Regarding RiverTrail, I believe I heard from dherman that the immutability
> was actually not necessary. Also a new type, ParallelArray, is not needed.

Regardless of immutability, in the case of RiverTrail, a new type _is_ needed if only to abstract the fact that it might live on the GPU side and, as I tried to explain in my previous comment, the key to performance is to allow it to stay on the GPU across multiple RiverTrail commands instead of having to read it back everytime.
> Yes, that's what it does, but a key thing to keep in mind is that few real-world problems can be solved by just one array operation like that; instead, real-world problems require performing a /sequence/ of such array operations, and the key to efficiency is to remove temporaries and minimize the number of times that arrays are traversed.

Some of this can be done without immutability in the RiverTrail approach, if

  vec2 = vec1.parallelMap(..).parallelMap(..).parallelMap(..);

(say, multiple passes of a blurring transformation) then it is clear that no temporary state needs to be saved, and this can be optimized more than if the intermediary stages were saved to a variable. So all the work could be done on the GPU. Immutability helps, but you can still do stuff without it. I don't know much that changes things, though.
(In reply to Alon Zakai (:azakai) from comment #37)
> > Yes, that's what it does, but a key thing to keep in mind is that few real-world problems can be solved by just one array operation like that; instead, real-world problems require performing a /sequence/ of such array operations, and the key to efficiency is to remove temporaries and minimize the number of times that arrays are traversed.
> 
> Some of this can be done without immutability in the RiverTrail approach, if
> 
>   vec2 = vec1.parallelMap(..).parallelMap(..).parallelMap(..);

In RiverTrail, anything that could be done by such a chain of parallelMap(), could also be done with a single parallelMap(), as you can just put more code in a single RiverTrail kernel.

So this doesn't start touching the case where multiple RiverTrail operations would be needed. An example of such cases would be if you wanted to implement matrix product (GEMM) or some other parallel matrix algorithm.
In some cases one might do such chaining for readability (to not write one huge function), but I agree that that is not the typical case. The examples you just gave are clearly more important.
Like Bug 661624, it doesn't look like if I could do much here. It might be interesting to implement RiverTrail in Ion, though.
Assignee: evilpies → general
I believe that this conversation is obsoleted by asm.js now. Shall we close it? Is there a bug filed already to have simd-friendly constructs in asm.js?
I think RiverTrail is a closer fit for SIMD/vector operations in JS, basically if the parallel map is run on a simple kernel, it should be compiled into SIMD. (asm.js doesn't have a plan for SIMD type stuff yet, it's far too early.)
There is not a bug atm; we are still working on v.1.  Also, asm.js doesn't necessarily obviate the need for builtin functions.  Take e.g. the proposal to add Math.imul (bug 808148); there was no good way to express int32 multiplication in pure JS.  With SIMD, it's not clear we could express all SIMD operations as pure JS patterns or whether we'd want to add new builtins that would be incorporated by asm.js.
We really need a plan here, as WebGL on mobile can easily burn a lot of time doing matrix multiplication.
We'll get a lot of the way here with asm.js and rivertrail, but I still think this would be useful.
Blocks: b2g-v-next
Access to a SIMD engine is needed for competitive parity on major use cases, but whether we get it through something more like a direct API or more like rivertrail is an "implementation detail".
Whiteboard: [tech-p1]
W3C DSP group looking at this, so is Ecma TC39. See bug 894105.

/be
Assignee: general → nobody
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: