Closed Bug 486800 Opened 16 years ago Closed 12 years ago

nanojit: Add vector instruction support

Categories

(Core Graveyard :: Nanojit, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: cjones, Unassigned)

References

Details

Soon, some platform code will start using vector parallelism.  This is likely to happen first in the charset converters.  Dealing with vector/SIMD extensions in a cross-platform way is a real pain in the @$$, so for engineering reasons, it would be great if this code could be centralized.  nanojit seems to be a good place to put this code, and additionally, it could allow some JavaScript codes to be auto-vectorized.

The usage I have in mind is something like:

  class CharsetConvertUTF16ToUTF8 {
    LIRInstructon LIRVectorConvert[] = {
      // vectorized converter code
    };
    Fn* VectorConvert;
    
    CharsetConvertUTF16ToUTF8 () {
      VectorConvert = nanojit.compile(LIRVectorConvert);
    }

    UTF8Str Convert(UTF16String str) {
      // setup args
      VectorConvert(str)
      // postprocess result
      return UTF8Str(convertedData);
    }
  };

where |LIRVectorConvert| is a hand-written LIR subroutine.

This seems to entail:

  (1) Extend the LIR instruction set with vector instructions (a non-trivial design problem)
  (2) Detect available CPU vector/SIMD features
  (3) Emit vector/SIMD instructions when possible, tight loops around scalar instructions when not
  ??? (4) Add an auto-vectorization pass to the nanojit compiler

Items (1)-(3) would satisfy me.  Item (4) might be useful to SpiderMonkey guys, I don't know.

I don't mind doing this work, but it's low on my TODO list right now.  If anyone else is interested, please take it.
Summary: nanojit: Add vector instruction support to LIR → nanojit: Add vector instruction support
I think we'll probably want to do this at the same time that we create the first SIMD-using nanojit application. I'm glad you filed this now, though, because the LIR binary recoding might be redesigned soon, and it will be good to have this in mind. 

Do you have any sample programs using SIMD that could at least start the thinking process?
> Do you have any sample programs using SIMD that could at least start the
> thinking process?

Sure, I can write out a few vectorized string/charset processing codes.  That's what I'll be interested in first (down the road).

The biggest design obstacle I see here is how to handle the vector width.  We have three options:

 (1) Fixed width (probably 16 bytes, since that's the common width today)
 (2) Specify the width through an immediate param to the instruction, or if the number of instruction operands are fixed and too small, through a 'vwidth' instruction.  For example:
      vaddw 16 target src1 src2
   --or--
      vsetw 16
      vaddw vtarget vsrc1 vsrc2
 (3) Fully data-dependent vector width, like some older vector processors would handle (and some people at Berkeley and MIT are reincarnating).  This is like the example above, except the vector width doesn't have to be an immediate param.

(Designing the vector registers/stack might also be a pain, but should be a easier if the width problem is solved first.)

After some thought this weekend, I lean towards option (2) above.  I think it's a good tradeoff between portability/future-proofing and ease of compilation.

But I agree that the discussion would be more productive with some concrete examples in mind.  This is now on my todo list.
I would lean towards (1). If we see a move towards 32 byte vectors, i think that move will happen across the board, at which point we can switch internally to 16 byte vectors. There is an inherent overhead for non-fixed width vectors, why take that hit for the possibility that at some point in the future there might be a switch. It will be already hard enough to find a reasonable common API for all the different vector instruction sets. So my vote is for 1), with a very clear portable use of a VECTOR_WIDTH constant which can easily be bumped in the future if needed.
(In reply to comment #3)
> There is an inherent overhead for non-fixed width vectors,
> why take that hit for the possibility that at some point in the future there
> might be a switch. It will be already hard enough to find a reasonable common
> API for all the different vector instruction sets. 

Are you talking about an inherent compile-time or run-time overhead?  Option (2) imposes a compile-time overhead, but probably no worse than that of compiling (1) on a machine without vector instructions (assuming that the immediate vector width is limited to multiples of 16 bytes).  Option (3) definitely has a potentially non-trivial run-time penalty, agreed.

> So my vote is for 1), with a
> very clear portable use of a VECTOR_WIDTH constant which can easily be bumped
> in the future if needed.

This is fair, and suits me.  For the sake of argument, if (2) doesn't incur a run-time penalty, and only a vanishing compile-time penalty, would you still lean towards (1)?
I think (2) has a compile-time and implementation-time penalty since there is no guarantee the vector sizes are identical across code. So every instruction has to check and compensate for vector size. Also, I would question the usefulness of switching around vector sizes. At most I would do (2) using a code-local (but not per-instruction) vector size configuration word (i.e. per trace or fragment or something).
(In reply to comment #5)
> I think (2) has a compile-time and implementation-time penalty since there is
> no guarantee the vector sizes are identical across code. So every instruction
> has to check and compensate for vector size. Also, I would question the
> usefulness of switching around vector sizes. At most I would do (2) using a
> code-local (but not per-instruction) vector size configuration word (i.e. per
> trace or fragment or something).

I imagined a lowering pass to eliminate variable vector widths, emitting multiple instructions of the architecture's width when the variable width is wider than the architecture width.  (I just realized that a |vsetwidth| instruction couldn't exist, since it would make the vector width context and path sensitive.  The width would have to be immediate to each vector instruction.)  This seems to be approximately what would be necessary for machines without vector instructions even if the vector width is fixed.

The usefulness of a variable width is another issue.  Since I can't think of a compelling example at the moment, I'll stick with option (1) in examples.
if our target ISA's all have 16-byte vector instructions as a defacto standard, then i dont think we gain much with option (2) over option (1) that we couldn't gain later, as well.  And in the mean time, it seems simpler to have the data-width baked into the instructions just like it is for the existing 4-byte and 8-byte instructions.

My second thought is that you're not going to want to write

    LIRInstructon LIRVectorConvert[] = {
      // vectorized converter code
    };

by hand, with hex numbers, right?  this argues for at least for a LIR assembler that generates snippets of LIR code, for use cases like this.

(I'm aware of Graydon's prototype but not the latest status)...
> My second thought is that you're not going to want to write
> 
>     LIRInstructon LIRVectorConvert[] = {
>       // vectorized converter code
>     };
> 
> by hand, with hex numbers, right? 

An assembler would be great, but if it's too much work, I wouldn't mind writing something like

  LIRInstruction fn[] = {
    { LIR_VADDW, ... },
    ...
  };

(If that's possible with your current representation ... I admit to complete ignorance of the internals of nanojit.)

But yeah, hex is not going to cut it.
It would be very nice to have a set of standards for SIMD support 
configuration, detection and selection.

We're in the process of integrating our SIMD UTF-8 to UTF-16 conversion routines (u8u16.costar.sfu.ca) into the Mozilla code base.

We have versions for MMX, SSE, Altivec and SPU.  We use one set of generic SIMD operations that we use, with different implementations for each architecture. 
For the most part, the very same code used for little-endian 64-bit MMX SIMD
also is used for big-endian 128-bit Altivec.   However, there are some algorithm
components that are chosen based on the specific target architecture.
  
However, we will very soon go beyond 128-bit (16-byte) SIMD, with Intel's
256-bit AVX and 512-bit Larrabee coming.
See bug 475779 for some comments on general SWAR (SIMD within a register)
detection.  I'm not quite sure of the ettiquette of what to put in 
which bug, but I assume we'll figure it out over time.
Assignee: general → nobody
Component: JavaScript Engine → Nanojit
QA Contact: general → nanojit
Would still love to see jseng grow vector support, but nanojit isn't the route anymore.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.