Closed Bug 584279 Opened 14 years ago Closed 14 years ago

TM: improve alias analysis by adding many more access regions

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 2 obsolete files)

This patch increase the accuracy of alias analysis in TM by increasing the
number of access regions from 3 to 23.

- Introduce loadFromState/storeToState macros because 'state' accesses are
  very common.

- Note that most of the AccSet regions in jsregexp.cpp and jsrecursion.cpp
  are coarse because these files are going to disappear soon.

- Add AccSet args where needed, eg. import(), importImpl().

- The 32-bit and 64-bit versions of box_value_into_alloc were identical, I
  merged them.

- Beefs up checkAccSet() a lot.

The patch will look a little different once bug 584275 lands.

I'm marking this as blocked by bug 558451, to save Brendan having to rebase his patch queue yet again.

LICM (bug 545406) needs this to have any hope of doing well.
I forgot to mention: this is a clear 10% win for v8-richards.js on TM tip, other wins are minor.
This patch:

- Replaces the existing three access regions (STACK, RSTACK, OTHER) with 25
  regions.  That may seem like overkill -- only about half a dozen actually
  help CSE -- but I wanted greatly to avoid having an "OTHER" annotation as
  such an annotation cannot be sanity checked, and it's really important
  these annotations not be incorrect.

- Beefs up ValidateWriter::checkAccSet() a *lot*.  It's a bit ugly, but I'm
  not happy adding all these new access regions without some heavy checking.

  I'm not completely happy with how C++ functions are annotated with
  AccSets, as there it zero checking of the annotations, but I can't see how
  to add checking.  The current approach is to be very conservative and add
  comments, yuk.

- Adds some macros/functions for common cases:  loadFromState(),
  storeToState(), dslots(), getStringLengthAndFlags().

- Renames stobj_get_fslot_private_ptr() as stobj_get_const_fslot_ptr(),
  which is a more accurate name.

- Merges the two versions of box_value_into_alloc(), they are identical.

- Uses addName() to label instructions getting "parent", "private" and "proto".

- Takes advantage of the CseFilter's new suspend/resume feature to allow CSE
  to occur across a couple of diamonds that occur in SETELEM.  There will be
  other places in jstracer.cpp where this feature will be useful, this is
  just a start.


The main performance effects are:

- If a single array has both GETELEM and SETELEM operations performed on it,
  the "guard(class is Array)" test can be CSE'd out so it's only performed
  once.

- CSE is very slightly slower in some cases because when we hit a label (and
  we're not suspended) 25 tables have to be reset instead of 3.

Cachegrind results are a wash for SS - a big win for access-nbody is
mostly countered by small losses elsewhere.

There are moderate improvements for V8 and Kraken.

More importantly, LICM (bug 545406) -- which hopefully will be a big win for
Kraken -- cannot work without this more precise alias analysis from this
patch.

This patch will conflict a bit with bhackett's flexible-length-JSObject
patch (bug 584917).  My bug number is older, maybe I should get precedence? :)

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|   100.629   100.582 (------) |    47.026    47.028 (------) | 3d-cube
|    47.134    47.120 (------) |    25.480    25.436 (1.002x) | 3d-morph
|   107.433   107.287 (1.001x) |    43.185    43.184 (------) | 3d-raytrace
|    43.791    43.809 (------) |    14.619    14.620 (------) | access-binary-
|   111.998   111.096 (1.008x) |    92.713    91.703 (1.011x) | access-fannkuc
|    38.341    36.428 (1.052x) |    17.191    15.284 (1.125x) | access-nbody
|    48.442    48.460 (------) |    28.185    28.186 (------) | access-nsieve
|    16.408    16.417 (0.999x) |     3.251     3.251 (------) | bitops-3bit-bi
|    45.790    45.806 (------) |    32.524    32.524 (------) | bitops-bits-in
|    24.829    24.833 (------) |    12.020    12.020 (------) | bitops-bitwise
|    52.260    53.120 (0.984x) |    37.968    38.810 (0.978x) | bitops-nsieve-
|    30.218    30.218 (------) |    17.051    17.051 (------) | controlflow-re
|    41.704    41.765 (0.999x) |     4.792     4.792 (------) | crypto-md5
|    28.948    28.995 (0.998x) |     6.427     6.472 (0.993x) | crypto-sha1
|    78.478    78.508 (------) |    21.933    21.933 (------) | date-format-to
|    80.647    80.872 (0.997x) |     9.773     9.771 (------) | date-format-xp
|    53.906    53.925 (------) |    31.071    31.071 (------) | math-cordic
|    29.591    29.365 (1.008x) |     6.405     6.175 (1.037x) | math-partial-s
|    31.084    31.164 (0.997x) |    13.415    13.416 (------) | math-spectral-
|    58.335    58.345 (------) |    34.590    34.590 (------) | regexp-dna
|    39.812    39.843 (0.999x) |     9.414     9.415 (------) | string-base64
|    95.420    95.478 (0.999x) |    24.622    24.622 (------) | string-fasta
|   121.573   121.390 (1.002x) |    17.369    17.369 (------) | string-tagclou
|   174.464   174.529 (------) |    21.495    21.496 (------) | string-unpack-
|    52.084    52.156 (0.999x) |     8.572     8.572 (------) | string-validat
-------
|  1553.329  1551.523 (1.001x) |   581.104   578.802 (1.004x) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  1809.695  1706.399 (1.061x) |  1661.257  1514.630 (1.097x) | v8-crypto
|  1628.462  1623.650 (1.003x) |  1097.576  1093.388 (1.004x) | v8-deltablue
|  1207.800  1207.970 (------) |   519.115   519.107 (------) | v8-earley-boye
|  1244.078  1244.520 (------) |   303.633   303.634 (------) | v8-raytrace
|   913.511   914.842 (0.999x) |   374.896   374.899 (------) | v8-regexp
|  1203.803  1175.489 (1.024x) |  1162.872  1134.652 (1.025x) | v8-richards
|  1204.657  1187.404 (1.015x) |   125.468   125.245 (1.002x) | v8-splay
-------
|  9212.009  9060.276 (1.017x) |  5244.819  5065.558 (1.035x) | all

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  3326.435  3326.479 (------) |  3023.173  3023.153 (------) | ai-astar
|  3162.861  3096.349 (1.021x) |  1768.434  1702.343 (1.039x) | audio-beat-det
|  1539.755  1540.244 (------) |  1344.271  1344.736 (------) | audio-dft
|  3104.504  3023.259 (1.027x) |  1745.518  1664.196 (1.049x) | audio-fft
|  2807.480  2815.702 (0.997x) |  1856.166  1864.357 (0.996x) | audio-oscillat
|  7854.903  7851.785 (------) |  5634.202  5631.104 (1.001x) | imaging-gaussi
|  3494.170  3440.144 (1.016x) |   892.334   838.598 (1.064x) | imaging-darkro
|  6850.624  6893.119 (0.994x) |  4739.456  4781.961 (0.991x) | imaging-desatu
|   722.685   722.689 (------) |    12.113    12.113 (------) | json-parse-fin
|   519.799   519.778 (------) |     6.327     6.327 (------) | json-stringify
|   ...
|   800.933   800.791 (------) |   373.848   373.952 (------) | stanford-crypt
|  1916.932  1916.762 (------) |  1171.965  1171.703 (------) | stanford-crypt
|   572.820   573.300 (0.999x) |   233.727   234.384 (0.997x) | stanford-crypt


The Kraken results are incomplete, one of the stanford tests doesn't run to
completion, I haven't yet worked out why.  (It happens even without this
patch.)
Attachment #462676 - Attachment is obsolete: true
Attachment #478142 - Flags: review?(gal)
Blocks: 600143
gal: ping?
Blocks: 602703
Attachment #478142 - Flags: review?(gal)
Rebased.  The description from comment 2 still applies.  Perf results are very slightly better now, not sure why, maybe the JSObject rearrangement helped things.

This bug is blocking bug 602703, which we'd like to get done before the Fx 4.0 branch, so a quick review would be very welcome!  Thanks.
Attachment #478142 - Attachment is obsolete: true
Attachment #483074 - Flags: review?(bhackett1024)
Comment on attachment 483074 [details] [diff] [review]
patch, v3 (against TM 55602:0b754642eedb)

High level: the ACCSET annotations are great, and really help readability of the code.  I'm not too familiar with the tracer code, but having these makes it a lot easier to follow along.

>+    /* Partially check the CallInfo's storeAccSet is correct. */
>+    JS_ASSERT(obj->clasp == origObjClasp);

It would be really cool to check the correctness of the ACCSET annotations by asserting when the clasp of a pre-existing object is updated.  For followup I think, but it seems you could (when #ifdef DEBUG) add a flag vector to thread data for which areas can currently be written and write immediates to it before and after calling from the JIT code.  Then (incrementally) add JSObject::setClass and so forth which check those flags.  I'm worried that work unrelated to the tracer could subtly introduce a bug by  changing the access set of a native.

>+static const nanojit::AccSet ACCSET_STATE         = (1 <<  0);
>+static const nanojit::AccSet ACCSET_STACK         = (1 <<  1);
>+static const nanojit::AccSet ACCSET_RSTACK        = (1 <<  2);
>+static const nanojit::AccSet ACCSET_CX            = (1 <<  3);
>+static const nanojit::AccSet ACCSET_EOS           = (1 <<  4);
>+static const nanojit::AccSet ACCSET_ALLOC         = (1 <<  5);
>+static const nanojit::AccSet ACCSET_FRAMEREGS     = (1 <<  6);
>+static const nanojit::AccSet ACCSET_STACKFRAME    = (1 <<  7);
>+static const nanojit::AccSet ACCSET_RUNTIME       = (1 <<  8);
>+
>+// Nb: JSObject::{lastProp,map,flags} don't have an AccSet because they are never accessed on trace
>+static const nanojit::AccSet ACCSET_OBJ_CLASP     = (1 <<  9);
>+static const nanojit::AccSet ACCSET_OBJ_SHAPE     = (1 << 10);
>+static const nanojit::AccSet ACCSET_OBJ_PROTO     = (1 << 11);
>+static const nanojit::AccSet ACCSET_OBJ_PARENT    = (1 << 12);
>+static const nanojit::AccSet ACCSET_OBJ_PRIVATE   = (1 << 13);
>+static const nanojit::AccSet ACCSET_OBJ_CAPACITY  = (1 << 14);
>+static const nanojit::AccSet ACCSET_OBJ_SLOTS     = (1 << 15);  // the pointer to the slots
>+
>+static const nanojit::AccSet ACCSET_SLOTS         = (1 << 16);  // the slots themselves
>+static const nanojit::AccSet ACCSET_TARRAY        = (1 << 17);
>+static const nanojit::AccSet ACCSET_TARRAY_DATA   = (1 << 18);
>+static const nanojit::AccSet ACCSET_ITER          = (1 << 19);
>+static const nanojit::AccSet ACCSET_ITER_PROPS    = (1 << 20);
>+static const nanojit::AccSet ACCSET_STRING        = (1 << 21);
>+static const nanojit::AccSet ACCSET_STRING_MCHARS = (1 << 22);
>+static const nanojit::AccSet ACCSET_TYPEMAP       = (1 << 23);
>+static const nanojit::AccSet ACCSET_FCSLOTS       = (1 << 24);
>+static const nanojit::AccSet ACCSET_ARGS_DATA     = (1 << 25);

Short comments on the rest of these would be good.

> const char*
> nanojit::LInsPrinter::accNames[] = {
>-    "s",    // (1 << 0) == ACCSET_STACK
>-    "r",    // (1 << 1) == ACCSET_RSTACK
>-    "o",    // (1 << 2) == ACCSET_OTHER
>-              "?", "?", "?", "?", "?", "?", "?", "?",   //  3..10 (unused)
>-    "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",   // 11..20 (unused)
>-    "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",   // 21..30 (unused)
>-    "?"                                                 //     31 (unused)
>+    "state",        // (1 <<  0) == ACCSET_STATE
>+    "sp",           // (1 <<  1) == ACCSET_STACK
>+    "rp",           // (1 <<  2) == ACCSET_RSTACK
>+    "cx",           // (1 <<  3) == ACCSET_CX
>+    "eos",          // (1 <<  4) == ACCSET_EOS
>+    "alloc",        // (1 <<  5) == ACCSET_ALLOC
>+    "regs",         // (1 <<  6) == ACCSET_FRAMEREGS
>+    "sf",           // (1 <<  7) == ACCSET_STACKFRAME
>+    "rt",           // (1 <<  8) == ACCSET_RUNTIME
>+
>+    "objclasp",     // (1 <<  9) == ACCSET_OBJ_CLASP
>+    "objshape",     // (1 << 10) == ACCSET_OBJ_SHAPE
>+    "objproto",     // (1 << 11) == ACCSET_OBJ_PROTO
>+    "objparent",    // (1 << 12) == ACCSET_OBJ_PARENT
>+    "objprivate",   // (1 << 13) == ACCSET_OBJ_PRIVATE
>+    "objcapacity",  // (1 << 14) == ACCSET_OBJ_CAPACITY
>+    "objslots",     // (1 << 15) == ACCSET_OBJ_SLOTS
>+
>+    "slots",        // (1 << 16) == ACCSET_SLOTS
>+    "tarray",       // (1 << 17) == ACCSET_TARRAY
>+    "tdata",        // (1 << 18) == ACCSET_TARRAY_DATA
>+    "iter",         // (1 << 19) == ACCSET_ITER
>+    "iterprops",    // (1 << 20) == ACCSET_ITER_PROPS
>+    "str",          // (1 << 21) == ACCSET_STRING
>+    "strmchars",    // (1 << 22) == ACCSET_STRING_MCHARS
>+    "typemap",      // (1 << 23) == ACCSET_TYPEMAP
>+    "fcslots",      // (1 << 24) == ACCSET_FCSLOTS
>+    "argsdata",     // (1 << 25) == ACCSET_ARGS_DATA
>+
>+    "?!"            // this entry should never be used, have it just in case
> };

Maybe remove the "?!" entry and JS_STATIC_ASSERT the array length is TM_NUM_USED_ACCS?

>         LIns *vshape_ins = addName(
>             lir->insLoad(LIR_ldi,
>                          addName(lir->insLoad(LIR_ldp, cx_ins, offsetof(JSContext, runtime),
>-                                              ACCSET_OTHER, LOAD_CONST),
>+                                              ACCSET_CX, LOAD_CONST),
>                                  "runtime"),
>-                         offsetof(JSRuntime, protoHazardShape), ACCSET_OTHER),
>+                         offsetof(JSRuntime, protoHazardShape), ACCSET_RUNTIME),
>             "protoHazardShape");

Not related to this patch, but why isn't the address of the runtime baked in?

>+        suspendCSE();   // It's important that CSE works around this control-flow diamond.
>         LIns* br;
>         if (condBranch(LIR_jt, lir->ins2(LIR_ltui, idx_ins, capacity_ins), &br)) {
>             LIns* args[] = { idx_ins, obj_ins, cx_ins };
>             LIns* res_ins = lir->insCall(&js_EnsureDenseArrayCapacity_ci, args);
>             guard(false, lir->insEqI_0(res_ins), mismatchExit);
>             labelForBranch(br);
>         }
>+        resumeCSE();

It would be good to have more of an explanation why this is important, either a short example or a bug #.

>+        suspendCSE();   // It's important that CSE works around this control-flow diamond.
>         LIns* br2;
>         if (condBranch(LIR_jf, cond, &br2)) {
>             LIns* args[] = { idx_ins, obj_ins, cx_ins };
>             LIns* res_ins = addName(lir->insCall(&js_Array_dense_setelem_hole_ci, args),
>                                      "hasNoIndexedProperties");
>             guard(false, lir->insEqI_0(res_ins), mismatchExit);
>             labelForBranch(br2);
>         }
>+        resumeCSE();

Ditto.

>+void ValidateWriter::checkAccSet(LOpcode op, LIns* base, int32_t disp, AccSet accSet)
>+{
>+    bool ok;
...
>+}

This is really nice.  It seems like you could get more complete checking for e.g. the object and string access regions by annotating the return types of natives and tagging instructions (if DEBUG) when they are pointers to particular regions.  Don't know if this would be worth the effort.

>+    /* 
>+     * These can be put around a control-flow diamond if it's important that
>+     * CSE work across the diamond.
>+     */
>+    void suspendCSE()   { if (cse_filter) cse_filter->suspend(); }
>+    void resumeCSE()    { if (cse_filter) cse_filter->resume();  }

Are there any restrictions on when these can be used, other than making sure the control flow follows a diamond pattern (besides side exits from guards)?  I take it that for any call or store inside the diamond, the CSE filter will still throw away expressions involving modified access regions.
Attachment #483074 - Flags: review?(bhackett1024) → review+
> Not related to this patch, but why isn't the address of the runtime baked in?

Silly oversight. protoHazardShape and all shape state (heap, tree, well-known empty shape singletons) should move into comparments, but is it true that due to JITted code being compiled per global if not per compartment (and I hope we end up with a compartment per global), we can still bake in compartment pointers?

/be
Both JITs should be able to bake in compartment pointers.  The trace JIT can also bake in pointers to the thread data, while the method JIT can't (see bug 589193).  Having one compartment per global would be nice, but COMPILE_N_GO already lets JIT code bake in per-global objects like standard class prototypes.
(In reply to comment #5)
> 
> High level: the ACCSET annotations are great, and really help readability of
> the code.  I'm not too familiar with the tracer code, but having these makes it
> a lot easier to follow along.

Good!  I was worried that the opposite would happen, that it would seem like over-engineering.


> >+    /* Partially check the CallInfo's storeAccSet is correct. */
> >+    JS_ASSERT(obj->clasp == origObjClasp);
> 
> It would be really cool to check the correctness of the ACCSET annotations by
> asserting when the clasp of a pre-existing object is updated.  For followup I
> think, but it seems you could (when #ifdef DEBUG) add a flag vector to thread
> data for which areas can currently be written and write immediates to it before
> and after calling from the JIT code.  Then (incrementally) add
> JSObject::setClass and so forth which check those flags.  I'm worried that work
> unrelated to the tracer could subtly introduce a bug by  changing the access
> set of a native.

I agree that checking the AccSets on natives is highly desirable, esp. when a native calls additional C++ functions.  Currently the AccSets on natives are very conservative, in all cases where the native is impure the accset is ACCSET_STORE_ANY, the only exceptions were added in this patch because they helped performance.  So I'd like to do better checking but I don't understand your proposal... more specifically, I don't see how/when this flag vector would be updated.  I guess you'd somehow incorporate it into the code that calls natives?  Also, where would the immediates be written -- into the flag vector itself?

Pushed with minor modifications made as suggested:

http://hg.mozilla.org/tracemonkey/rev/c961a413660c
Whiteboard: fixed-in-tracemonkey
Second attempt, first one was backed out due to assertion failures:

http://hg.mozilla.org/tracemonkey/rev/0084457d2cfc
(In reply to comment #8)
> I agree that checking the AccSets on natives is highly desirable, esp. when a
> native calls additional C++ functions.  Currently the AccSets on natives are
> very conservative, in all cases where the native is impure the accset is
> ACCSET_STORE_ANY, the only exceptions were added in this patch because they
> helped performance.  So I'd like to do better checking but I don't understand
> your proposal... more specifically, I don't see how/when this flag vector would
> be updated.  I guess you'd somehow incorporate it into the code that calls
> natives?  Also, where would the immediates be written -- into the flag vector
> itself?

Yeah, you would have to modify the code generator to do the writes before and after making a native call.  When emitting a call to a native Foo with a given ACCSET (again, only in DEBUG mode):

mov(ACCSET, &cx->threadData->accessSets) // this pointer can be baked in
call(...)
mov(ACCSET_STORE_ANY, &cx->threadData->accessSets) // stores anywhere now allowed

Something would probably have to be done wrt bail/deep-bail (I don't know what), and there might be other complexities.
(In reply to comment #10)
> 
> mov(ACCSET, &cx->threadData->accessSets) // this pointer can be baked in
> call(...)
> mov(ACCSET_STORE_ANY, &cx->threadData->accessSets) // stores anywhere now
> allowed

So that sets the flag vector... how then does the checking work?  Eg. if the function is marked as never touching the stack (ACCSET_STACK) how is that checked?
For complete coverage, there would need to be asserts on the thread data's current allowed regions whenever a region is written to; this is more feasible for isolated regions that can be put into accessor functions (e.g. obj->clasp) rather than things like ACCSET_STACK where there are ad-hoc writes all over the place.  ACCSET_STACK probably isn't worth it, but other regions might be.  I don't know if this is something that should be done, but at least it could be (sort of) done.
http://hg.mozilla.org/mozilla-central/rev/0084457d2cfc
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: