Closed Bug 536458 Opened 15 years ago Closed 15 years ago

Too many charCodeAt builtin calls when using bitops and arithmetic

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 551039

People

(Reporter: bzbarsky, Assigned: gal)

References

Details

(Keywords: perf)

Attachments

(3 files)

Attached file shell testcase
BUILD: tracemonkey js shell built from rev 9d0fe1237c26

STEPS TO REPRODUCE:
1)  Add printfs to js_String_p_charCodeAt_int and js_String_p_charCodeAt so you
    can see when they're called (and in particular so you can tell which one is
    called).
2)  Run the attached shell testcase with jit on

EXPECTED RESULTS: 3 calls to js_String_p_charCodeAt_int

ACTUAL RESULTS: 2 calls to js_String_p_charCodeAt_int and 2 calls to
                js_String_p_charCodeAt

If you increase the iteration count, both numbers increase with it.  It looks like we're calling _both_ builtins for each charCodeAt call.  The "+1" is key to get this behavior; if I take it out, I just get calls to the int builtin.
Blocks: 534364
blocking2.0: --- → ?
Keywords: perf
And in particular, if I take out the +1, I only get 2 calls to js_String_p_charCodeAt_int.
I guess only 2 calls are expected, since we seem to now trace on loop #3 and run on loop #4, instead of 2 and 3 respectively...
Ignore comment 3; one iteration is effectively "lost" due to some global shape evolution stuff.  So we still only expect 2 calls, instead of 2 and 2.
OK, a TMFLAGS=liveness log shows that with the +1 the LIR after liveness analysis has:

  js_String_p_charCodeAt1 = fcall #js_String_p_charCodeAt ( $global0 $global1 )
                                 $global0 state sp eos ld2 js_String_p_charCodeAt1 
  stqi sp[0] = js_String_p_charCodeAt1
                                 $global0 state sp eos ld2 
  js_String_p_charCodeAt_int1 = icall #js_String_p_charCodeAt_int ( $global0 ld2 )
                                 state sp js_String_p_charCodeAt_int1 eos ld2 
  sti sp[0] = js_String_p_charCodeAt_int1

As in, the liveness analysis fails to DCE the non-int call.
I should note that I don't quite see where we treat CSE-able calls as not liveness roots in the first place...
OK. So what happens in the "working" case is that the stqi to sp[0] is filtered out by StackFilter because d <= top (both are 0, in fact).  

In the "non-working" case, d < top, and in particular top - d == 4.  And we have bit 4 set due to the sti from js_String_p_charCodeAt_int1, but we do NOT have bit 3 set (because the js_String_p_charCodeAt_int1 only writes 4 bytes to the stack).  So the StackFilter leaves the stqi, and then liveness treats it as a statement and entrains the js_String_p_charCodeAt1.
And the reason top > d in that case is that with the + 1 there we have:

  0x8137c8  and1 = and js_String_p_charCodeAt_int_int1, 255
                                 state sp and1 eos ld2 
  0x8137e0  sti sp[0] = and1               state sp and1 eos ld2 
  0x813804  sti sp[8] = 1                  state sp and1 eos ld2 
  0x813810  add1 = add and1, 1             add1 state sp eos ld2 
  0x813818  ov1 = ov add1                  add1 ov1 state sp eos ld2 
  0x813824  xt1: xt ov1 -> pc=0x40e690 imacpc=0x0 sp+16 rp+0 (GuardID=004)

and the StackFilter looks at stack-top whenever it sees a guard (so at that xt guard).  And at that point stack-top is in fact at 16 (which we shift right by 2 to get the 4 we're seeing).  In other words, our attempt to do integer addition here is actually screwing us over in terms of the builtins we call.
If I hack StackFilter to assume all stores are effectively 4-byte stores, the attached HTML testcase gets much faster and sunspider gets an 8% speedup in browser (though only about 1% in shell, for reasons I don't understand).
bz, this is huge (well, in-browser SS speedup -- more info craved on diff there vs. shell). Cc'ing NJ peeps.

/be
bz, let me see if I understand correctly.  You have a sequence like this:

  stqi sp[0] = x
  ...
  sti sp[0] = y

Ie. the sti clobbers half of the value written by the stqi.  And you'd like the stqi to be removed, in this case because computing x is expensive but x isn't used subsequently.  I can see why you'd want that, but what if the instruction immediately after the sti was 'z = ld sp[4]'?
Nicholas, that's exactly the sequence I have, and that's exactly what I'd like to happen.  I understand why StackFilter currently doesn't do it, but if I understand correctly, in our actual usage of nanojit all stack access is 8-byte aligned (perhaps modulo function calls?), so the next instruction cannot be |ld sp[4]|.  I might be understanding incorrectly, of course...
(In reply to comment #13)
> if I
> understand correctly, in our actual usage of nanojit all stack access is 8-byte
> aligned (perhaps modulo function calls?), so the next instruction cannot be |ld
> sp[4]|.

I don't know if that's true, but I imagine that the optimisation would be safe w.r.t. how TM uses the stack.  But expecting NJ to know about TM's use of the stack doesn't seem right.  Maybe the bytecode-to-LIR translation should be better.
I just reran sunspider in browser a few times with and without a hacked StackFilter, and the 8% might have been a fluke.  Even with the hack to StackFilter I have never gotten a run that fast again....  That makes me a little happier at least in terms of us measuring shell sunspider as an ok proxy for browser.

The 1% improvement in shell might be real, though.
I suggest waiting until another commit has gone in, then time the shell again.  That helps separate noise from real changes.
Assignee: general → gal
(In reply to comment #12)
> bz, let me see if I understand correctly.  You have a sequence like this:
> 
>   stqi sp[0] = x
>   ...
>   sti sp[0] = y

Thinking about this more, this is a very strange sequence.
Strange in what sense?
Why does sp[0] have two different types in quick succession?  (The stqi is now stfi, btw.)  I guess the float value is stored and then is dead?  In which case it seems surprising that it was stored.  Maybe it's an artifact of the stack machine.  I looked at the bytecode but had some trouble understanding how it got mapped to the LIR.
> Why does sp[0] have two different types in quick succession? 

Because of the code landed in bug 534364 (which this bug is a regression from; see dependencies), which basically ends up with us making two insCall calls in a row, one for a function returning a double and one for a function returning an int.
Aha, so it is strange code!  Is there a plan to avoid the two calls at the jstracer level?
blocking2.0: ? → -
I have no idea.  Ask Andreas or dvander?  That would fix this bug as filed, of course.
I've seen this show up in a few more testcases out on the web.
blocking2.0: - → ?
Marking this as a dup of 551039 because both bugs have the same root cause -- shortcomings in StackFilter.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
blocking2.0: ? → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: