TM: more implicit constant propagation after guards

RESOLVED WONTFIX

Status

()

RESOLVED WONTFIX
8 years ago
7 years ago

People

(Reporter: njn, Assigned: njn)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

8 years ago
Continuing on from the example in bug 600127 comment 0, the code for
"a[i] += 3":

    #------------------------------------------------------------------------
    # getelem
    #------------------------------------------------------------------------
    ldi5 = ldi.o $stack5[4]
    clasp = immi 0x2ee060
    guard(class is Array) = eqi ldi5, clasp/*0x2ee060*/
    xf4: xf guard(class is Array) -> pc=0x613532 imacpc=0x0 sp+32 rp+0
(GuardID=002)
    capacity = ldi.o $stack5[40]
    ltui1 = ltui ldi1, capacity
    xf3: xf ltui1 -> pc=0x613532 imacpc=0x0 sp+32 rp+0 (GuardID=003)
    dslots = ldi.o $stack5[24]
    immi5 = immi 3
    lshi2 = lshi ldi1, immi5/*3*/
    addi2 = addi dslots, lshi2
    ldi4 = ldi.o addi2[4]
    JSVAL_TAG_INT32 = immi 0xffff0001
*   eqi4 = eqi ldi4, JSVAL_TAG_INT32/*0xffff0001*/
    xf2: xf eqi4 -> pc=0x613532 imacpc=0x0 sp+32 rp+0 (GuardID=004)
    ldi3 = ldi.o addi2[0]
    sti.s sp[16] = ldi3
    #--------------------------------------------------------------------
    sti.s sp[24] = immi5/*3*/
    addxovi2 = addxovi ldi3, immi5/*3*/ -> pc=0x613535 imacpc=0x0 sp+32 rp+0
(GuardID=005)
    sti.s sp[16] = addxovi2
    #------------------------------------------------------------------------
    # setelem
    #------------------------------------------------------------------------
    JSVAL_TAG_MAGIC = immi 0xffff0004
**  eqi2 = eqi ldi4, JSVAL_TAG_MAGIC/*0xffff0004*/
**  jf eqi2 -> label2
**  hasNoIndexedProperties = calli.all #js_Array_dense_setelem_hole ( cx
$stack5 ldi1 )
**  immi3 = immi 0
**  eqi1 = eqi hasNoIndexedProperties, immi3/*0*/
**  xt1: xt eqi1 -> pc=0x613536 imacpc=0x0 sp+24 rp+0 (GuardID=008)
**  label2:
    JSVAL_TAG_INT32_2 = immi 0xffff0001
    sti.o addi2[4] = JSVAL_TAG_INT32_2/*0xffff0001*/
    sti.o addi2[0] = addxovi2
    #------------------------------------------------------------------------

We can see that ldi4 must be equal to JSVAL_TAG_INT32 after xf2 (marked with
a '*').  Therefore eqi2 must be false.  Therefore all the lines marked '**'
can be removed.  All this leaves for the SETELEM is writing the value and
tag.  (The tag write is redundant but that's harder to determine and fodder
for a follow-up bug.)  Also, by removing the js_Array_dense_setelem_hole()
it subsequently makes life easier for CSE.

Attached is a half-working patch, but one that doesn't correctly handle
non-linear control flow.  It might be better to do this in CseFilter.

Kraken's imaging-desaturate.js looks to benefit hugely from this.  It's main
loop looks like this:

            while (p--) {
                data[pix-=4] = data[pix1=pix+1] = data[pix2=[pix+2] = (data[pix]*0.3 + data[pix1]*0.59 + data[pix2]*0.11);

ie. it does three GETELEMS followed by three SETELEMS of the same elements;
this patch will allow the SETELEMS to be done almost for free;  however, bug
584279 will need to land as well to make CSE strong enough for all this to
fall into place.  I suspect other Kraken benchmarks will benefit a lot too,
eg. audio-fft and audio-beat-detection are both dominated by a single loop
where some array elements are get and then set.
(Assignee)

Comment 1

8 years ago
Created attachment 489387 [details] [diff] [review]
patch (against TM 57064:805c1a5d5cc6)

Kraken results:

---------------------------------------------------------------
| millions of instructions executed                           |
| total                        | on-trace (may overestimate)  |
---------------------------------------------------------------
|  4433.097  4433.098 (------) |  4142.906  4142.906 (------) | ai-astar
|  2934.580  2864.798 (1.024x) |  1391.178  1320.148 (1.054x) | audio-beat-det
|  1307.400  1307.410 (------) |  1140.652  1140.652 (------) | audio-dft
|  2683.524  2612.184 (1.027x) |  1347.583  1276.516 (1.056x) | audio-fft
|  2665.113  2665.127 (------) |  1856.255  1856.255 (------) | audio-oscillat
|  6950.825  6950.836 (------) |  4784.012  4784.013 (------) | imaging-gaussi
|  3315.591  3305.341 (1.003x) |   748.440   738.383 (1.014x) | imaging-darkro
|  5995.455  5952.050 (1.007x) |  3836.520  3793.159 (1.011x) | imaging-desatu
|   681.309   681.310 (------) |    10.073    10.073 (------) | json-parse-fin
|   497.739   497.739 (------) |     5.954     5.954 (------) | json-stringify
|  1272.720  1272.851 (------) |   748.719   748.719 (------) | stanford-crypt
|   710.550   710.096 (1.001x) |   362.991   362.697 (1.001x) | stanford-crypt
|  1152.121  1137.421 (1.013x) |   585.029   571.448 (1.024x) | stanford-crypt
|   503.629   497.694 (1.012x) |   199.421   195.863 (1.018x) | stanford-crypt
-------
| 35103.659 34887.962 (1.006x) | 21159.740 20946.793 (1.010x) | all

imaging-desaturate didn't improve much, turns out there's enough slots aliasing in the GETELEM/GETELEM/GETELEM/SETELEM/SETELEM/SETELEM sequence that this optimization doesn't hit for the most part, which is annoying.

I'm ambivalent about this patch.  It's a small Kraken speed-up, but it puts some quite TM-specific code into NJ.
(Assignee)

Comment 2

8 years ago
Comment on attachment 489387 [details] [diff] [review]
patch (against TM 57064:805c1a5d5cc6)

Ed, do you think this too gross for words?
Attachment #489387 - Flags: feedback?(edwsmith)

Comment 3

8 years ago
Sort of pushing the envelope on CSE a bit, but the cure could be worse than the disease -- trying to factor that stuff out into a subclass would introduce some pretty weird hooks.

I see where you're going with it; at the point of a use of an expression, we know more information than at the point of a def, because of predicates in the surrounding control flow.  So, you're building up tables of information about values at certain points, that get cleared when CSE resets.

In TR we have a similar hashtable for tracking the possible-null-ness of a bunch of pointer values (see VarTracker in CodegenLIR).  Managing the hashtable is similar to managing the CSE tables, except we work harder to merge states at control-flow joins and not blow it away at labels with only one predecessor (so either branch of an if is handled well... not applicable for guards).

As long as the tables stay fairly small, it is manageable.  But when they get big because of long sequences of code, I wish there were a cleaner way to do it, maybe with pseudo- LIR instructions, rather than a side table.  But, we'd be inserting pseudo-instructions at the exact points we're inserting constants, using similar side tables.

SSI representation aims to solve this exact problem, by inserting special instructions at control-flow splits, which allow more exact information to be stored with the instruction. (like what value it must have).  but its too heavweight.  Jeremy Singer's dissertation is a great read on the topic.
http://www.dcs.gla.ac.uk/~jsinger/phd.html

Updated

8 years ago
Attachment #489387 - Flags: feedback?(edwsmith) → feedback+

Updated

8 years ago
Attachment #489387 - Flags: feedback?(wmaddox)
(Assignee)

Comment 4

8 years ago
(In reply to comment #3)
> 
> SSI representation aims to solve this exact problem, by inserting special
> instructions at control-flow splits, which allow more exact information to be
> stored with the instruction. (like what value it must have).  but its too
> heavweight.  Jeremy Singer's dissertation is a great read on the topic.
> http://www.dcs.gla.ac.uk/~jsinger/phd.html

Ha, I know Jeremy well, I did my PhD at the same time as him, he was in the office next door and we'd play cricket in the hallways :)  He used to go on about SSI all the time, though I never absorbed much of the details.

Comment 5

8 years ago
Comment on attachment 489387 [details] [diff] [review]
patch (against TM 57064:805c1a5d5cc6)

Since this is in a similar vein to the existing knownCmpValues machinery, I won't raise an objection here, though it would be nice if there were some way to factor out these sorts of specialized optimizations so they are included only with the front-end that needs them.

GCC has a pass that does a fairly general optimization of this sort, decomposing the condition in an if-statement to generate a set of conditions known to be true or false in each branch.  Since matching of candidate expressions against these is done by CSE-style hash table lookup rather than a backward-chaining inference engine, all of the possible expressions that will be eligible for this optimization have to be generated up-front (forward-chaining).  This allows interesting optimizations for comparisons not involving a constant, for example.  I can't imagine this approach is very fast, nor represents a good cost/benefit tradeoff for Nanojit, however, since CSE is already a bottleneck for us.
Attachment #489387 - Flags: feedback?(wmaddox) → feedback+
(Assignee)

Comment 6

8 years ago
(In reply to comment #5)
> 
> Since this is in a similar vein to the existing knownCmpValues machinery, I
> won't raise an objection here, though it would be nice if there were some way
> to factor out these sorts of specialized optimizations so they are included
> only with the front-end that needs them.

I agree, it's just that this fits perfectly into the CSE pass :/  Either way, I don't plan on landing it any time soon as it's not enough of a win to be worth it.
(Assignee)

Comment 7

7 years ago
TM's days are numbered:  WONTFIX.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.