Closed Bug 580752 Opened 10 years ago Closed 9 years ago

TM: optimize setelem

Categories

(Core :: JavaScript Engine, defect)

Other Branch
x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: gal, Assigned: njn)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 3 obsolete files)

We currently do 2809193 array stores in SS. Each of these is a builtin call. 2493752 of these (88%) overwrite an existing value and don't require a proto walk to deal with holes. Only 29876 of these calls grows the array (1%). We should be able to emit a lot more specialized code here.
Does JM need a version of this bug?
Not yet, I think - we have one on a specific array-related slowdown though, bug 580355
(In reply to comment #0)
> 2493752 of these (88%) overwrite an existing value and don't require a proto
> walk to deal with holes. Only 29876 of these calls grows the array (1%). We
> should be able to emit a lot more specialized code here.

That would be nice but do you have any suggestions how?
Attached patch patch (obsolete) — Splinter Review
4.5% speedup on SS
Depends on: 580877
Comment on attachment 459305 [details] [diff] [review]
patch

>diff --git a/js/src/jsarray.cpp b/js/src/jsarray.cpp
>--- a/js/src/jsarray.cpp
>+++ b/js/src/jsarray.cpp
>@@ -482,16 +482,30 @@ SetArrayElement(JSContext *cx, JSObject 
>     if (!IndexToId(cx, obj, index, NULL, idr.addr(), JS_TRUE))
>         return JS_FALSE;
>     JS_ASSERT(!JSID_IS_VOID(idr.id()));
> 
>     Value tmp = v;
>     return obj->setProperty(cx, idr.id(), &tmp);
> }
> 
>+#ifdef JS_TRACER
>+JSBool JS_FASTCALL
>+js_EnsureDenseArrayCapacity(JSContext *cx, JSObject *obj, jsint i)
>+{
>+    jsuint u = jsuint(i);
>+    jsuint capacity = obj->getDenseArrayCapacity();
>+    if ((u >= capacity) && (INDEX_TOO_SPARSE(obj, u) || !obj->ensureDenseArrayElements(cx, u + 1)))
>+        return false;
>+    return true;

Those gratuitous parens won't save you :-P, and they just help hide the fact that you have a silent failure (false return with no error reported or exception thrown).

It would be much more clear that the false return means an error if you invert and put predicted/prefetched best foot forward:

    if (u < capacity)
        return true;
    if (INDEX_TOO_SPARSE(obj, u))
        return true;
    return obj->ensureDenseArrayElements(cx, u + 1);    

Righteous speedup!

/be
I'll look at this in the morning.  In the meantime, a short description of the patch would be nice!
The patch inlines the array set code. I branch around a call to the grow call, which is only needed in 1% of the invocations. This is about 5% slower than guarding but manages with a single trace (guarding would need a 2nd trace). This patch depends on the removal of MinLenCap and Count.
I wanted to try this patch.  I applied it on top of the patches from bug 580846 and bug 580877 but it didn't apply cleanly.  I think you have a patch removing js_UnboxDouble in your patch stack below this one?

If you could update this patch to account for changes to the two preceding bugs that'd be great!  I really want to see what the LIR looks like.
Attached patch patch (obsolete) — Splinter Review
Attachment #459305 - Attachment is obsolete: true
Depends on: 582081
Attachment #460393 - Flags: review?(nnethercote)
With this patch I am seeing sub 400ms times on my laptop now. Exciting.

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                  398.8ms +/- 0.4%
--------------------------------------------
Patch still doesn't apply cleanly to TM tip.  Do you still have UnboxDouble removal code in there?
Yes, dependent bug is linked.
The patch for bug 580877 has landed on TM.  The patch for bug 582081 has not, but AFAICT isn't what's causing this patch to fail.  I suspect the patch for bug 580534 is the problem, ie. that you have it in your workspace, but you haven't marked that as blocking this bug.

In general, if I write a patch that applies on top of a patch in another bug, I'll mention that and explain the sequencing.  I won't expect a reviewer to work out which of my current in-flight patches need to be applied first, and I don't think others should have to work that out.  Am I being unreasonable?
Sorry. Other patch should be in your queue too. I meant to mark it as dependent. Not at my desktop right now.
Summary: TM: optimize array stores → TM: optimize setelem
Depends on: 580534
Attached patch patch (obsolete) — Splinter Review
Attachment #460393 - Attachment is obsolete: true
Attachment #460393 - Flags: review?(nnethercote)
Attachment #460984 - Flags: review?(nnethercote)
580534 must be applied first
Comment on attachment 460984 [details] [diff] [review]
patch

I get 20 trace-tests failures on x86 and 28 on x86-64.  I applied the patch on top of the patch from 580534.  I'll review again when these failures are fixed.
Attachment #460984 - Flags: review?(nnethercote) → review-
This bug now depends on bug 585525.
There's a bug in the patch -- Array.length isn't updated when required.  I'm working on fixing it now.
I'm attaching two patches.  Both pass trace-tests.  They only differ in how
they handle setting a hole element -- the first one deals with it entirely
on-trace, the second one falls back to a helper function.  The performance
of the two is very similar, about a 1.045x speedup on 32-bit (with TM+JM),
and less on 64-bit.

I'm leaning towards the second one because I find the LIR generated by the
first one long-winded and hard to read, and also because there's potential
to improve the code generated when we have a C call on a slow path.


Here's example code before either patch:

      # isDenseArray?
      ldi2 = ldi.o $global0[4]
      clasp = immi 0x8338a00
      guard(class is Array) = eqi ldi2, clasp/*0x8338a00*/
      xf2: xf guard(class is Array) -> pc=0x94e1c82 imacpc=(nil) sp+24 rp+0 (GuardID=002)

      # set elem, via a C call
      allocp1 = allocp 8
      tag = immi 0xffff0005
      sti.o allocp1[4] = tag/*0xffff0005*/
      sti.o allocp1[0] = ATOM_TO_STRING(atom)/*0x833cee8*/
      js_Array_dense_setelem1 = calli.all #js_Array_dense_setelem ( cx $global0 ldi1 allocp1 )
      eqi1 = eqi js_Array_dense_setelem1, immi3/*0*/
      xt1: xt eqi1 -> pc=0x94e1c82 imacpc=(nil) sp+24 rp+0 (GuardID=003)


Here's an example with the first patch:

      # isDenseArray?
      ldi6 = ldi.o $global0[4]
      clasp = immi 0x83389e0
      guard(class is Array) = eqi ldi6, clasp/*0x83389e0*/
      xf5: xf guard(class is Array) -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=002)

      # check capacity, grow if necessary
      capacity = ldi.o $global0[40]
      ltui2 = ltui ldi1, capacity
      jt ltui2 -> label3
      js_EnsureDenseArrayCapacity1 = calli.all #js_EnsureDenseArrayCapacity ( cx $global0 ldi1 )
      eqi2 = eqi js_EnsureDenseArrayCapacity1, immi8/*0*/
      xt3: xt eqi2 -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=003)

      # get address of the element
      label3:
      dslots = ldi.o $global0[24]
      immi7 = immi 3
      lshi1 = lshi ldi1, immi7/*3*/
      addi1 = addi dslots, lshi1

      # if element is a hole...
      JSVAL_TAG_MAGIC = immi 0xffff0004
      ldi5 = ldi.o addi1[4]
      eqi1 = eqi ldi5, JSVAL_TAG_MAGIC/*0xffff0004*/
      jf eqi1 -> label2

      # ...check the proto chain has no indexed properties...
      ldi4 = ldi.o $global0[16]
      immi4 = immi 0
      guard(proto-not-null) = eqi ldi4, immi4/*0*/
      xt2: xt guard(proto-not-null) -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=004)
      objShape = ldi.o ldi4[12]
      immi6 = immi 212
      guard(shape) = eqi objShape, immi6/*212*/
      xf4: xf guard(shape) -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=005)
      ldi3 = ldi.o ldi4[16]
      guard(proto-not-null)2 = eqi ldi3, immi4/*0*/
      xt1: xt guard(proto-not-null)2 -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=006)
      objShape2 = ldi.o ldi3[12]
      immi5 = immi 173
      guard(shape)2 = eqi objShape2, immi5/*173*/
      xf3: xf guard(shape)2 -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=007)
      ldi2 = ldi.o ldi3[16]
      guard(proto-not-null)3 = eqi ldi2, immi4/*0*/
      xf2: xf guard(proto-not-null)3 -> pc=0x9858c82 imacpc=(nil) sp+24 rp+0 (GuardID=008)

      # ...and update .length if necessary
      length = ldi.o $global0[32]
      ltui1 = ltui ldi1, length
      jt ltui1 -> label2
      immi3 = immi 1
      addi2 = addi ldi1, immi3/*1*/
      sti.o $global0[32] = addi2

      # actually do the setelem
      label2:
      tag = immi 0xffff0005
      sti.o addi1[4] = tag/*0xffff0005*/
      sti.o addi1[0] = ATOM_TO_STRING(atom)/*0x833cec8*/

It would be nice if that proto-chain walk could be optimized somehow.


Here's an example of the hole-related part with the second patch:

      # if element is a hole...
      # ...check the proto chain has no indexed properties...
      # ...and update .length if necessary
      JSVAL_TAG_MAGIC = immi 0xffff0004
      ldi2 = ldi.o addi1[4]
      eqi2 = eqi ldi2, JSVAL_TAG_MAGIC/*0xffff0004*/
      jf eqi2 -> label2
      hasNoIndexedProperties = calli.all #js_Array_dense_setelem_hole ( cx $global0 mulxovi1 )
      immi3 = immi 0
      eqi1 = eqi hasNoIndexedProperties, immi3/*0*/
      xt1: xt eqi1 -> pc=0x9369c89 imacpc=(nil) sp+24 rp+0 (GuardID=006)


This patch also removes the dependency on bug 580534, which wasn't
fundamental but just a reflection of Gal's queue at a particular point in
time.

Here are Cachegrind results.  The timings reflected these fairly closely, but
with a bit more noise.  I don't include crypto-aes in my Cachegrind results
because its non-deterministic but it gets about a 1.04x speedup on timings.

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|   106.795   109.805 (0.973x) |    44.765    45.176 (0.991x) | 3d-cube
|    56.590    46.433 (1.219x) |    22.315    24.778 (0.901x) | 3d-morph
|   116.183   116.985 (0.993x) |    43.885    43.909 (0.999x) | 3d-raytrace
|    73.028    73.026 (------) |    15.220    15.220 (------) | access-binary-
|   130.540   108.369 (1.205x) |    79.768    89.534 (0.891x) | access-fannkuc
|    37.455    37.454 (------) |    16.430    16.430 (------) | access-nbody
|    59.788    49.928 (1.197x) |    27.009    29.741 (0.908x) | access-nsieve
|    15.956    15.954 (------) |     2.863     2.863 (------) | bitops-3bit-bi
|    43.480    43.479 (------) |    30.281    30.281 (------) | bitops-bits-in
|    22.978    22.977 (------) |    10.219    10.219 (------) | bitops-bitwise
|    61.859    54.633 (1.132x) |    36.551    40.406 (0.905x) | bitops-nsieve-
|    49.009    49.008 (------) |    19.438    19.438 (------) | controlflow-re
|    47.551    47.398 (1.003x) |     5.916     6.110 (0.968x) | crypto-md5
|    30.243    30.060 (1.006x) |     6.741     6.952 (0.970x) | crypto-sha1
|    98.330    98.329 (------) |    23.053    23.053 (------) | date-format-to
|    83.673    83.671 (------) |     9.747     9.747 (------) | date-format-xp
|    53.289    53.287 (------) |    30.317    30.317 (------) | math-cordic
|    29.439    29.438 (------) |     6.137     6.137 (------) | math-partial-s
|    30.573    30.599 (0.999x) |    12.941    12.973 (0.998x) | math-spectral-
|    58.519    58.517 (------) |    34.592    34.592 (------) | regexp-dna
|    39.539    39.538 (------) |     9.220     9.220 (------) | string-base64
|   116.996   115.498 (1.013x) |    24.787    25.125 (0.987x) | string-fasta
|   138.698   138.697 (------) |    17.588    17.588 (------) | string-tagclou
|   180.813   181.014 (0.999x) |    21.896    21.896 (------) | string-unpack-
|    51.365    51.363 (------) |     8.346     8.346 (------) | string-validat
-------
|  1732.702  1685.473 (1.028x) |   560.038   580.064 (0.965x) | all


For the ones like 3d-cube where we do worse it's due to extra time in
Nanojit compiling the additional code.
Attached patch njn's 1st patchSplinter Review
Assignee: gal → nnethercote
Attachment #460984 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #471425 - Flags: feedback?(gal)
Something else worth mentioning:  this patch might make the patch in bug 542905 worthwhile, as it exposes more internal array access operations to CSE, but there are lots of branches involved.
I'm nominating this for blocking2.0 because a 4% speedup on SunSpider for a single patch is pretty impressive.
blocking2.0: --- → ?
Nick thanks so much for stealing this from me. You rock.
Attachment #471426 - Flags: feedback?(gal) → review?(gal)
(In reply to comment #25)
> Nick thanks so much for stealing this from me. You rock.

Thanks!  Do you want to review?  It's mostly your code so it shouldn't take long :)
Andreas is on varcaton or something like that (I didn't understand what he was talking about), so someone else might want to steal the review.
Comment on attachment 471426 [details] [diff] [review]
njn's 2nd patch (against TM 52722:b360dd204dc6)

>-static JS_ALWAYS_INLINE JSBool FASTCALL
>-dense_grow(JSContext* cx, JSObject* obj, jsint i, const Value &v)
>+JSBool FASTCALL
>+js_Array_dense_setelem_hole(JSContext* cx, JSObject* obj, jsint i)
> {
>-    JS_ASSERT(obj->isDenseArray());
>+    if (js_PrototypeHasIndexedProperties(cx, obj))
>+        return false;
> 
>-    /*
>-     * Let the interpreter worry about negative array indexes.
>-     */
>-    JS_ASSERT((MAX_DSLOTS_LENGTH > MAX_DSLOTS_LENGTH32) == (sizeof(intptr_t) != sizeof(uint32)));
>-    if (MAX_DSLOTS_LENGTH > MAX_DSLOTS_LENGTH32) {
>-        /*
>-         * Have to check for negative values bleeding through on 64-bit machines only,
>-         * since we can't allocate large enough arrays for this on 32-bit machines.
>-         */
>-        if (i < 0)
>-            return JS_FALSE;
>-    }
>-
>-    /*
>-     * If needed, grow the array as long it remains dense, otherwise fall off trace.
>-     */
>     jsuint u = jsuint(i);
>-    jsuint capacity = obj->getDenseArrayCapacity();
>-    if ((u >= capacity) && (INDEX_TOO_SPARSE(obj, u) || !obj->ensureDenseArrayElements(cx, u + 1)))
>-        return JS_FALSE;
>-
>-    if (obj->getDenseArrayElement(u).isMagic()) {
>-        if (js_PrototypeHasIndexedProperties(cx, obj))
>-            return JS_FALSE;
>-
>-        if (u >= obj->getArrayLength())
>-            obj->setArrayLength(u + 1);
>-    }
>-
>-    obj->setDenseArrayElement(u, v);
>-    return JS_TRUE;
>+    if (u >= obj->getArrayLength())
>+        obj->setArrayLength(u + 1);
>+    return true;

We could inline this and then only call the above check. Separate bug if you prefer or no bug if you think its stupid.
Attachment #471426 - Flags: review?(gal) → review+
I figure if we have to call a C function, there's no point doing part of the work in LIR, that's just more LIR instructions to process.
We could easily inline that though. For arrays its always exactly 2 loops. There is a loop for this in the array read path.
Attachment #471425 - Flags: feedback?(gal)
(In reply to comment #30)
> We could easily inline that though. For arrays its always exactly 2 loops.
> There is a loop for this in the array read path.

See comment 20 -- I liked the C call better.  It's slower when we hit a hole, but we end up with a lot less LIR to compile.  For SunSpider inlining the code makes some tests slightly faster and some slightly slower, and it ends up a wash.

http://hg.mozilla.org/tracemonkey/rev/03fbde5c25b6
Whiteboard: fixed-in-tracemonkey
blocking2.0: ? → ---
http://hg.mozilla.org/mozilla-central/rev/03fbde5c25b6
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.