Closed Bug 566767 Opened 15 years ago Closed 13 years ago

Type-specific JS arrays

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 689745

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Attachments

(2 files, 4 obsolete files)

The aim of this bug is to special-case array-of-int and probably array-of-double, and use them in preference to array-of-jsval when possible.  This will avoid lots of boxing/unboxing and tag checks and should be a big performance win on SS, V8 and real web code.  It will build on work already done by Shaver and Gal (ssh://hg.mozilla.org/users/shaver_mozilla.com/array-ng).
Note also bug 486356.
I did some work on this just over a year ago. I think we want to first introduce a "solid" array specialization which is like dense but holeless. Then we can specialize these solid arrays' stores without needing to reserve a spot in the value space for holes (unless you can think of a way around it!). If you do go the holey route, note that you can use a whole range of NaNs to indicate a double hole so you can avoid loading an extra 32 bits in this case. That's all my wisdom that probably still relevant.
Attached patch WIP patch, v1 (obsolete) — Splinter Review
Just a snapshot of where I am.  I've converted most of the relevant array paths to cope with dense arrays of ints, but I'm not actually instantiating any such arrays at the moment, so the dense-array-of-int code isn't being exercised at all.
Attached patch WIP patch, v2 (obsolete) — Splinter Review
Dense-array-of-int code is now being exercised;  when njn_DEFAULT_INT is defined I get 42 trace-test failures.  When njn_DEFAULT_INT is not defined the dense-array-of-int code isn't exercised and both trace-tests and jstests pass.

On some microbenchmarks I'm getting a roughly 2x speedup, which is promising.
Attachment #450027 - Attachment is obsolete: true
Attached patch WIP patch, v3 (obsolete) — Splinter Review
Down to 11 trace-test failures and 55 jstests failures.  And that's it from me until Tuesday, because it's the Queen's Birthday holiday in Australia on Monday.
Attachment #450598 - Attachment is obsolete: true
Attached patch WIP patch, v4 (obsolete) — Splinter Review
Down to 0 trace-test failures and 8 jstests failures, 6 of which are because splice() isn't working yet.
Attachment #450615 - Attachment is obsolete: true
Attached patch WIP patch, v5Splinter Review
This patch is rebased against the current tip, among other changes.  Same test results as the last patch.
Attachment #451200 - Attachment is obsolete: true
=============================================================================

** TOTAL **:           -                 699.3ms +/- 0.6%   697.1ms +/- 0.6% 

=============================================================================

  3d:                  *1.122x as slow*  111.5ms +/- 0.6%   125.1ms +/- 0.6%     significant
    cube:              *1.180x as slow*   38.5ms +/- 0.7%    45.4ms +/- 0.7%     significant
    morph:             *1.044x as slow*   24.2ms +/- 0.6%    25.3ms +/- 0.6%     significant
    raytrace:          *1.115x as slow*   48.9ms +/- 0.7%    54.5ms +/- 0.6%     significant

  access:              1.171x as fast    108.0ms +/- 0.6%    92.3ms +/- 0.7%     significant
    binary-trees:      ??                 20.3ms +/- 0.7%    20.4ms +/- 0.7%     not conclusive: might be *1.004x as slow*
    fannkuch:          1.40x as fast      51.3ms +/- 0.6%    36.6ms +/- 0.8%     significant
    nbody:             -                  22.2ms +/- 0.7%    22.0ms +/- 0.7% 
    nsieve:            1.075x as fast     14.2ms +/- 0.7%    13.2ms +/- 0.8%     significant

  bitops:              *1.091x as slow*   32.1ms +/- 0.8%    35.1ms +/- 0.8%     significant
    3bit-bits-in-byte: -                   0.8ms +/- 9.3%     0.8ms +/- 9.9% 
    bits-in-byte:      *1.37x as slow*     9.3ms +/- 1.0%    12.7ms +/- 0.8%     significant
    bitwise-and:       ??                  2.1ms +/- 2.7%     2.1ms +/- 3.3%     not conclusive: might be *1.029x as slow*
    nsieve-bits:       1.028x as fast     19.9ms +/- 0.8%    19.4ms +/- 0.9%     significant

  controlflow:         1.023x as fast      8.2ms +/- 1.0%     8.0ms +/- 1.1%     significant
    recursive:         1.023x as fast      8.2ms +/- 1.0%     8.0ms +/- 1.1%     significant

  crypto:              1.102x as fast     48.6ms +/- 0.9%    44.2ms +/- 0.9%     significant
    aes:               1.145x as fast     28.4ms +/- 1.2%    24.8ms +/- 1.2%     significant
    md5:               1.031x as fast     12.8ms +/- 1.0%    12.4ms +/- 0.8%     significant
    sha1:              1.072x as fast      7.4ms +/- 1.3%     6.9ms +/- 1.2%     significant

  date:                ??                 83.1ms +/- 0.7%    83.7ms +/- 0.7%     not conclusive: might be *1.007x as slow*
    format-tofte:      *1.017x as slow*   48.6ms +/- 0.8%    49.5ms +/- 0.8%     significant
    format-xparb:      -                  34.5ms +/- 0.8%    34.2ms +/- 0.7% 

  math:                *1.100x as slow*   29.3ms +/- 0.7%    32.2ms +/- 0.7%     significant
    cordic:            *1.27x as slow*     9.7ms +/- 1.1%    12.3ms +/- 0.7%     significant
    partial-sums:      -                  13.5ms +/- 0.7%    13.4ms +/- 0.7% 
    spectral-norm:     *1.069x as slow*    6.1ms +/- 1.3%     6.5ms +/- 1.5%     significant

  regexp:              -                  42.9ms +/- 0.5%    42.8ms +/- 0.5% 
    dna:               -                  42.9ms +/- 0.5%    42.8ms +/- 0.5% 

  string:              -                 235.5ms +/- 0.6%   233.8ms +/- 0.6% 
    base64:            1.055x as fast     11.7ms +/- 0.9%    11.1ms +/- 1.0%     significant
    fasta:             1.037x as fast     32.5ms +/- 0.7%    31.4ms +/- 0.7%     significant
    tagcloud:          -                  83.1ms +/- 0.6%    82.8ms +/- 0.6% 
    unpack-code:       ??                 81.8ms +/- 0.7%    82.2ms +/- 0.7%     not conclusive: might be *1.006x as slow*
    validate-input:    -                  26.5ms +/- 0.6%    26.4ms +/- 0.7%
Results are mixed at the moment.  fannkuch is the clearest winner, 1.4x faster.  There are moderate wins elsewhere, esp. the crypto ones.  In V8, v8-crypto is about 1.1x faster, the rest are in the noise.

But there are losses too, esp. the 3d ones.  These all involve arrays of doubles, and AFAICT there are two causes of slowdown.  

First, demotion of dense-array-of-int to dense-array-of-jsval.  This is a small effect, but it accounts for most of 3d-morph's slowdown, for example.  (3d-morph creates an array of ~40,000 zero elements and ~25,000 holes and then fills it up with doubles, so there's an expensive demotion there.)

Second, and more important, we generate code assuming arrays-of-int and then they change to arrays-of-jsval which requires (a) early exits and (b) generating more code.  This is at play with 3d-cube, 3d-raytrace, math-spectral-norm.

I have no idea what's happening with bits-in-byte and math-cordic, the generated code has barely changed and Cachegrind says instructions counts et al have also barely changed, maybe they're just noise.

I guess the obvious thing to do is to add dense-arrays-of-doubles as well, and hope that removes most/all of the slowdowns while not slowing down other things too much.  One danger with that is that it'll remove a lot of doubles from the heap, which could be a big win, but that is also a win that will occur anyway with fatvals so measurements could be misleading.

I'm also interested to see what happens if I replace arrays-of-ints with arrays-of-doubles.  Probably the crypto ones will go badly.
Another point is that this patch makes GETELEM much better in the good cases, but SETELEM isn't much better.  Even in the fastest case, we have these steps (on the right is the actions if each test fails):

  - test isDenseInt32Array (on-trace)  --> fall off trace
  - call js_DenseInt32Array_setelem_int
    - test value != INT32_HOLE         --> demote array and continue
    - test index < capacity            --> grow and continue, or slowify and stop
    - test getelem != HOLE             --> bookkeep and continue, or stop
    - setElement
  - test the return value              --> fall off trace if we hit a "stop" case

Two on-trace guards, a C call, and three tests in the C function, yuk.
FWIW, here's how GETELEM improves in the best case:

       ldi7 = ldi.o $stack5[4]
       ~JSSLOT_CLASS_MASK_BITS = immi -4
-      andi6 = andi ldi7, ~JSSLOT_CLASS_MASK_BITS/*-4*/
+      andi4 = andi ldi7, ~JSSLOT_CLASS_MASK_BITS/*-4*/
       clasp = immi ......
-      guard(class is Array) = eqi andi6, clasp/*......*/
-      xf7: xf guard(class is Array) -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=002)
+      guard(class is DenseInt32Array) = eqi andi4, clasp/*......*/
+      xf7: xf guard(class is DenseInt32Array) -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=002)
       dslots = ldi.o $stack5[28]
       minLenCap = ldi.o $stack5[24]
       ltui1 = ltui rshi1, minLenCap
       xf6: xf ltui1 -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=003) 
-      JSVAL_DOUBLE = immi 2
-      lshi2 = lshi rshi1, JSVAL_DOUBLE/*2*/
+      immi8 = immi 2
+      lshi2 = lshi rshi1, immi8/*2*/
       addi1 = addi dslots, lshi2
       ldi6 = ldi.o addi1[0]
-      JSVAL_TAGMASK = immi 7
-      andi5 = andi ldi6, JSVAL_TAGMASK/*7*/
-      eqi11 = eqi andi5, JSVAL_DOUBLE/*2*/
-      JSVAL_INT = immi 1
-      andi4 = andi ldi6, JSVAL_INT/*1*/
-      ori2 = ori andi4, eqi11
-      eqi10 = eqi ori2, immi4/*0*/
+      INT32_HOLE = immi ......
+      eqi10 = eqi ldi6, INT32_HOLE/*......*/
       xt6: xt eqi10 -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=004)
-      js_UnboxDouble1 = calld. #js_UnboxDouble ( ldi6 )
-      std.s sp[16] = js_UnboxDouble1
+      sti.s sp[16] = ldi6

There are two main improvements.  First, the awful is-it-an-int-or-a-double checking is replaced with is-it-INT32_HOLE checking.  Second, no need for the js_UnboxDouble() call.
(In reply to comment #10)
>
> I guess the obvious thing to do is to add dense-arrays-of-doubles as well, and
> hope that removes most/all of the slowdowns while not slowing down other things
> too much.  One danger with that is that it'll remove a lot of doubles from the
> heap, which could be a big win, but that is also a win that will occur anyway
> with fatvals so measurements could be misleading.

I just hit an unexpected problem with this.  We have a template class:

  template <typename T, T H> class DenseTHArrayObject;

For dense-array-of-int we instantiate it as follows:

  typedef DenseTHArrayObject<int32, INT32_HOLE> DenseInt32ArrayObject;

For dense-array-of-double we correspondingly do this:

  typedef DenseTHArrayObject<jsdouble, DOUBLE_HOLE> DenseDoubleArrayObject;

But that's not legal C++;  you can't use a floating-point value as a non-type template parameter.  I didn't know that.  It's because of the usual issues surrounding equality of floating-point types.

Hmm, quite a stumbling block.  Why do we have the 'T H' parameter anyway?  Why not just have:

  template <typename T> class DenseTArrayObject;

and then make 'T hole' a static member of that class?  It's because we need these two:

  typedef DenseTHArrayObject<int32> DenseInt32ArrayObject;
  typedef DenseTHArrayObject<jsval> DenseJsvalArrayObject;

which doesn't seem like a problem until you realize that on 32-bit machines jsval is equivalent to int32.  The 'T H' parameter is required to distinguish these two types.

Perhaps fatvals will provide salvation, because jsval will be a non-integral type, so the 'T H' parameter won't be necessary.
(In reply to comment #13)
> But that's not legal C++;  you can't use a floating-point value as a non-type
> template parameter.  I didn't know that.  It's because of the usual issues
> surrounding equality of floating-point types.

You can avoid this if you implement my suggested double hole strategy from comment #2. There will probably be a perf speedup as well since you'd be comparing only 4 bytes instead of 8 (and an integer compare at that).

> Hmm, quite a stumbling block.  Why do we have the 'T H' parameter anyway?  Why
> not just have:
> 
>   template <typename T> class DenseTArrayObject;
> 
> and then make 'T hole' a static member of that class?  It's because we need
> these two:
> 
>   typedef DenseTHArrayObject<int32> DenseInt32ArrayObject;
>   typedef DenseTHArrayObject<jsval> DenseJsvalArrayObject;
> 
> which doesn't seem like a problem until you realize that on 32-bit machines
> jsval is equivalent to int32.  The 'T H' parameter is required to distinguish
> these two types.
> 
> Perhaps fatvals will provide salvation, because jsval will be a non-integral
> type, so the 'T H' parameter won't be necessary.

I ended up making my integer specialization type unsigned but still did signed operations on it to work around this unfortunate C++ behavior.
(In reply to comment #14)
> (In reply to comment #13)
> > But that's not legal C++;  you can't use a floating-point value as a non-type
> > template parameter.  I didn't know that.  It's because of the usual issues
> > surrounding equality of floating-point types.
> 
> You can avoid this if you implement my suggested double hole strategy from
> comment #2. There will probably be a perf speedup as well since you'd be
> comparing only 4 bytes instead of 8 (and an integer compare at that).

I didn't understand that strategy.  Can you elaborate?
Rather than have one particular hole value that indicates an empty slot in the double-array, use a class of values, in particular use a subset of the NaNs. If you look at the bit encoding of NaNs, they've got all the exponent bits set and some nonzero number of mantissa bits set. You can pick a subset of these NaNs to be your hole such that the first 32 bits are all the same. Thus you can perform the hole check by comparing only the first 32 bits of the stored value and the value-to-store.

The actual bits set in the mantissa of a naturally occurring NaN vary depending on the processor and the operation causing the NaN so when you try to store a naturally occurring NaN that's in this special set you can either rewrite it to be a different NaN or do the usual hole-store fallback. I don't think it's possible to determine the exact NaN from JS so the rewriting strategy is probably OK (not sure if embedders might care).
(In reply to comment #16)
> Rather than have one particular hole value that indicates an empty slot in the
> double-array, use a class of values, in particular use a subset of the NaNs.

I don't see how this helps with the FP non-type template parameter problem.  To start using this strategy I'd have to solve that anyway, by either using unsigned ints or moving to fatvals.  And if I've got that far it doesn't seem like comparing only the first 32 bits would save much.  Am I missing something?
Oh, this is annoying... the heart of bitops-nsieve-bits is this:

function primes(isPrime, n) {
  var i, count = 0, m = 10000<<n, size = m+31>>5;
    
  for (i=0; i<size; i++) isPrime[i] = 0xffffffff;

  for (i=2; i<m; i++)
    if (isPrime[i>>5] & 1<<(i&31)) {
      for (var j=i+i; j<m; j+=i)
        isPrime[j>>5] &= ~(1<<(j&31));
      count++;
    }
}

It's clearly all integral code, but 0xffffffff is outside the range of signed integers so we end up doing it all via jsvals, with or without the patch.  Hmm.  I wonder if it's worth/possible allowing a dense-array-of-int to convert to a dense-array-of-uint.
(In reply to comment #17)
> (In reply to comment #16)
> > Rather than have one particular hole value that indicates an empty slot in the
> > double-array, use a class of values, in particular use a subset of the NaNs.
> 
> I don't see how this helps with the FP non-type template parameter problem.  To
> start using this strategy I'd have to solve that anyway, by either using
> unsigned ints or moving to fatvals.  And if I've got that far it doesn't seem
> like comparing only the first 32 bits would save much.  Am I missing something?

You can change the template to:
template <typename T, unsigned int32 H> class DenseTHArrayObject;

and that's enough to distinguish the jsval and int32 cases and solve the jsdouble case, right? It would also seem to allow an unsigned int case if you wanted that too.

I think you're going to want to compare only the first 32 bits for performance reasons anyways be it the fatval tag or doubles. I remember that bz found it rather easily when he ran a profiler over my earlier attempt last year.
I've tried a different tack, doing dense-array-of-doubles instead of
dense-array-of-ints by default.  The wins are smaller on the int-heavy cases,
but there aren't any losses on benchmarks that actually use arrays-of-doubles:


=============================================================================
** TOTAL **:           1.062x as fast    535.1ms +/- 3.0%   503.8ms +/- 2.2%     significant
=============================================================================

  3d:                  1.096x as fast     91.1ms +/- 2.2%    83.1ms +/- 1.8%     significant
    cube:              1.074x as fast     32.0ms +/- 2.0%    29.8ms +/- 2.1%     significant
    morph:             1.093x as fast     24.4ms +/- 2.7%    22.3ms +/- 2.1%     significant
    raytrace:          1.119x as fast     34.7ms +/- 2.7%    31.0ms +/- 2.1%     significant

  access:              1.141x as fast     71.5ms +/- 2.8%    62.7ms +/- 2.0%     significant
    binary-trees:      -                  14.1ms +/- 2.6%    13.7ms +/- 2.2% 
    fannkuch:          1.28x as fast      33.8ms +/- 3.0%    26.4ms +/- 2.2%     significant
    nbody:             1.068x as fast     13.3ms +/- 2.8%    12.4ms +/- 1.9%     significant
    nsieve:            -                  10.4ms +/- 3.2%    10.1ms +/- 2.4% 

  bitops:              1.155x as fast     25.6ms +/- 3.3%    22.2ms +/- 3.7%     significant
    3bit-bits-in-byte: -                   0.8ms +/- 10.9%     0.7ms +/- 14.3% 
    bits-in-byte:      -                   8.8ms +/- 4.4%     8.5ms +/- 3.4% 
    bitwise-and:       -                   1.9ms +/- 5.9%     1.9ms +/- 7.5% 
    nsieve-bits:       1.27x as fast      14.2ms +/- 3.0%    11.2ms +/- 5.7%     significant

  controlflow:         -                   7.0ms +/- 7.2%     6.5ms +/- 2.7% 
    recursive:         -                   7.0ms +/- 7.2%     6.5ms +/- 2.7% 

  crypto:              1.082x as fast     36.4ms +/- 4.7%    33.6ms +/- 3.4%     significant
    aes:               1.111x as fast     20.9ms +/- 5.0%    18.8ms +/- 3.2%     significant
    md5:               -                  10.5ms +/- 5.4%    10.0ms +/- 5.1% 
    sha1:              -                   5.0ms +/- 3.7%     4.8ms +/- 2.9% 

  date:                -                  67.6ms +/- 3.9%    66.7ms +/- 3.0% 
    format-tofte:      -                  41.5ms +/- 4.0%    41.2ms +/- 3.1% 
    format-xparb:      -                  26.1ms +/- 3.8%    25.6ms +/- 2.8% 

  math:                1.067x as fast     31.0ms +/- 3.2%    29.1ms +/- 2.7%     significant
    cordic:            1.101x as fast     12.0ms +/- 3.4%    10.9ms +/- 2.8%     significant
    partial-sums:      -                  14.2ms +/- 3.3%    13.6ms +/- 2.7% 
    spectral-norm:     1.070x as fast      4.9ms +/- 3.7%     4.6ms +/- 6.0%     significant

  regexp:              -                  30.4ms +/- 2.9%    29.7ms +/- 2.5% 
    dna:               -                  30.4ms +/- 2.9%    29.7ms +/- 2.5% 

  string:              -                 174.4ms +/- 3.1%   170.2ms +/- 2.6% 
    base64:            -                   8.1ms +/- 2.8%     8.0ms +/- 5.5% 
    fasta:             -                  20.2ms +/- 3.2%    19.7ms +/- 3.2% 
    tagcloud:          -                  61.4ms +/- 2.8%    60.3ms +/- 2.6% 
    unpack-code:       -                  69.6ms +/- 3.7%    67.5ms +/- 2.9% 
    validate-input:    -                  15.1ms +/- 3.3%    14.7ms +/- 2.7% 


Rob is right, FP equality comparisons on x86 suck (the code generated by NJ
on i386 is particularly bad, the above numbers are for X64).  So either a
compare-only-the-first-32-bits strategy or a don't-allow-holes-in-the-middle
strategy would help some more.

But it's hard to know how fatvals will affect the above numbers.  How much
of the improvement is due to avoiding boxing, and how much is due to not
putting doubles on the heap?  I don't know.
Blocks: 551368
Just a ping to see if this is still being pursued after fatvals landed...
(In reply to comment #21)
> Just a ping to see if this is still being pursued after fatvals landed...

Not yet.  Andreas has been overhauling array accesses a great deal, changing the slots and capacity arrangement, and thus how GETELEM and SETELEM work, and that is still ongoing.  Once that's done it might be feasible to take this patch up again, though I also wonder how many massive patches we can absorb during the current beta period -- JM still has to be merged. 

Sayre, do you have any thoughts about this?
(In reply to comment #12)
> FWIW, here's how GETELEM improves in the best case:
>
> [...] 
> 
> There are two main improvements.  First, the awful is-it-an-int-or-a-double
> checking is replaced with is-it-INT32_HOLE checking.  Second, no need for the
> js_UnboxDouble() call.

With fatvals merged, the is-it-an-int-or-a-double test is now *much* nicer:

  JSVAL_UPPER_INCL_TAG_OF_NUMBER_SET = immi ......
  leui1 = leui ldi3, JSVAL_UPPER_INCL_TAG_OF_NUMBER_SET/*......*/
  xf3: xf leui1 -> pc=...... imacpc=(nil) sp+24 rp+0 (GuardID=004)

And when bug 580534 lands, the js_UnboxDouble() call will become this:

  i = ldi v[offset + sPayloadOffset]
  d1 = i2d i
  d2 = ldd v[offset]
  c = eqi ldi3, JSVAL_TAG_INT32
  cmovd c, d1, d2

(cmovd causes a small diamond to be generated by the back-end.)

And if/when phi nodes land (bug 575527), it'll probably change to an explicit diamond in the LIR, so we only have to do one of the alternatives, instead of doing both and then picking the right one.  (In other words we'll move to short-circuit evaluation.)  So we'll execute a test and a load (if the jsval is a double) or a test and a load and a d2i (if the jsval is an int).  With type-specific arrays, assuming we use arrays-of-doubles, we'll just do a load.

Also, fatvals means that doubles are no longer stored on the heap.  So the advantages of type-specific arrays have dropped dramatically.

If we were able to specialize to arrays-of-ints or arrays-of-doubles accurately, then I suspect the advantage would be bigger.  Let's hope type inference (bug 557407) will give us that!

Maybe with the optimized setlem (bug 580752) type-specific arrays might gain some more advantage, not sure.
Nick, sounds like a great number of *large* perf wins are on the verge of landing!
When do you expect JM to be merged?
(In reply to comment #24)
> Nick, sounds like a great number of *large* perf wins are on the verge of
> landing!

Some have landed, some are close, some are less close.

> When do you expect JM to be merged?

No idea.  Bug 578133 might be a better place to ask.
(In reply to comment #25)
> (In reply to comment #24)
> 
> > When do you expect JM to be merged?

September 1st.
This bug is dead.  Bug 689745 is the new hope;  maybe type inference will finally bring this old idea to fruition.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Resolution: WONTFIX → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: