Last Comment Bug 670493 - JM: for loop with "i--" as loop condition is 2x slower than same loop with "i-- > 0" as loop condition
: JM: for loop with "i--" as loop condition is 2x slower than same loop with "i...
Status: RESOLVED FIXED
js-triage-done
: perf
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: x86 Mac OS X
: -- normal (vote)
: mozilla9
Assigned To: Jan de Mooij [:jandem]
:
Mentors:
http://jsperf.com/forloopspeed
Depends on: 684789 684797 686919 707141 716900
Blocks: WebJSPerf 679710
  Show dependency treegraph
 
Reported: 2011-07-09 21:35 PDT by Boris Zbarsky [:bz]
Modified: 2012-09-03 19:31 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Testcase (331 bytes, text/plain)
2011-07-09 21:35 PDT, Boris Zbarsky [:bz]
no flags Details
WIP (5.12 KB, patch)
2011-09-02 12:44 PDT, Jan de Mooij [:jandem]
no flags Details | Diff | Splinter Review
Patch (5.19 KB, patch)
2011-09-03 07:55 PDT, Jan de Mooij [:jandem]
bhackett1024: review+
Details | Diff | Splinter Review

Description Boris Zbarsky [:bz] 2011-07-09 21:35:15 PDT
Created attachment 545046 [details]
Testcase

See attached testcase.  I see f2() taking 2x longer than f1(), for both JM and JM+TI.

Running under -j doesn't seem to show this 2x slowdown.
Comment 1 Marty Rosenberg [:mjrosenb] 2011-08-01 18:01:11 PDT
f2 generates the opcodes:
00038:   5  zero
00039:   5  gt
00040:   5  ifne 32 (-8)

where f2 generates the opcodes:
00038:  13  ifne 32 (-6)

Then gt; ifne gets fused into a single operation that we generate code for, 
whereas the lone ifne code mostly only works if the argument is a boolean.  In this case, It is an Int32, so we fall back to the slow C++ code.
The relevant code is in booleanJumpScript:
>    if (type.isSet()) {
>        jmpNotBool.setJump(masm.testBoolean(Assembler::NotEqual, type.reg()));
>    } else {
Comment 2 David Mandelin [:dmandelin] 2011-08-02 11:14:10 PDT
Looks like a job for a peephole optimizer

It'd be interesting to see what other engines do with this.
Comment 3 Marty Rosenberg [:mjrosenb] 2011-08-03 07:53:43 PDT
Sorry if I was not clear.  It looks like the root cause of the problem is that we don't have a fast path for doing integer checks.  I suspect it would be about +10 lines to fix this (and we'd add in about 2 instructions), but I have bigger fish to fry for the time being.  

sstangl had the idea to put asserts in our ool slow paths to make sure that we were not falling back to C++ for integer/double operations which could be trivially jitted.
Comment 4 Jan de Mooij [:jandem] 2011-08-03 08:05:26 PDT
booleanJumpScript should handle known int32 values (my first JM patch, hah), so on the TI branch we should be able to handle this inline. Looking at booleanJumpScript there seems to be more low-hanging fruit (regalloc for instance).
Comment 5 Boris Zbarsky [:bz] 2011-08-03 09:15:44 PDT
> so on the TI branch we should be able to handle this inline

As in, you don't yet but you could?

The timings I see on TI branch are:

-m:
  419
  905
-j:
  176
  178
-m -n:
  133
  291

Note that having an integer as test is the common case for loops, so I think we should seriously look into making it fast.
Comment 6 Jan de Mooij [:jandem] 2011-09-02 05:17:58 PDT
(In reply to Boris Zbarsky (:bz) from comment #5)
> > so on the TI branch we should be able to handle this inline
> 
> As in, you don't yet but you could?

We do, but somehow it's less efficient than the code generated for i-- > 0.

The fix for this will also help bug 679710. With Ion coming up I don't want to spend too much time fixing JM perf issues, but this one seems to be quite common.
Comment 7 Jan de Mooij [:jandem] 2011-09-02 12:44:28 PDT
Created attachment 557919 [details] [diff] [review]
WIP

This refactors booleanJumpScript and gets rid of some stack stores and jumps in the generated code. And it removes some code: 1 files changed, 43 insertions(+), 55 deletions(-)

For the attached testcase:

old: 132 310
new: 133 148

More than 2x faster, but still 15 ms slower than version 1. The reason is that we are able to hoist the overflow check for i-- > 0 but not for the i--. The generated code for the two versions is very similar. The only difference is test/jne vs. cmp/jg and the overflow check:
--
   0x8a1710:	mov    %edi,%esi
   0x8a1712:	add    $0xffffffff,%esi
>> 0x8a1715:	jo     0x8a1bc7
   0x8a171b:	mov    %esi,0x48(%ebp)
   0x8a171e:	mov    %edi,%ebx
   0x8a1720:	mov    %esi,%edi
   0x8a1722:	test   %ebx,%ebx
   0x8a1724:	jne    0x8a16ec
   0x8a172a:	jmp    0x8a173c
--
I will add the TI optimizations in bug 679710. This patch seems to be a nice win so I would like to get it in first
Comment 8 Emanuel Hoogeveen [:ehoogeveen] 2011-09-02 13:06:48 PDT
(In reply to Jan de Mooij from comment #7)
> ... The reason is
> that we are able to hoist the overflow check for i-- > 0 but not for the
> i--.
As someone following along at home, why is that? i-- can overflow if i == 0 and i is unsigned, but both loops test the value before the decrement, right? Is this a peculiarity of JS, or am I missing something simple?
Comment 9 Boris Zbarsky [:bz] 2011-09-02 13:15:54 PDT
Seems like if we know that i > 0 then i-- can't overflow (we work only with signed ints for purposes of working with ints).

But if all we know is |i != 0|, then i-- could overflow on a signed int i, right?

The above is total guess; jandem is the one who knows what's going on here.  ;)
Comment 10 Emanuel Hoogeveen [:ehoogeveen] 2011-09-02 13:35:30 PDT
Hmm, I suppose it would overflow from the maximum negative value to.. something else in that case (not sure what exactly, bit rusty on integer representations). I guess it makes sense that the overflow check has to be made in that case - though I don't know what happens on overflow :) Does TI have type barriers for unsigned values, or is signed assumed throughout the codebase? (not to start some lengthy discussion on the JS engine, I think I understand the behavior difference now)
Comment 11 Brian Hackett (:bhackett) 2011-09-02 13:43:04 PDT
We can only remove the overflow check if i is always >= 0 within the loop body, else we could get the overflow on INT_MIN and need to start using the equivalent double (all numbers in JS are really doubles, we go through this integer song and dance for perf).  With some analysis we should be able to determine that if i >= 0 when we enter the loop, either when it initially starts or via an interpreter transition, that it will stay >= 0.  Then we just have to do that one test when entering the loop, and don't need to test each iteration.  LoopState does similar analysis to this for other loop forms, but needs to be more robust for this case.
Comment 12 Boris Zbarsky [:bz] 2011-09-02 13:49:10 PDT
> not sure what exactly

In practice on most hardware using 2s-complement, INT_MIN - 1 = INT_MAX.  But JS engine behavior should be that INT_MIN-1 = double(INT_MIN) - 1.
Comment 13 Emanuel Hoogeveen [:ehoogeveen] 2011-09-02 13:50:50 PDT
Thanks, that just about answers all my questions!
Comment 14 Jan de Mooij [:jandem] 2011-09-03 07:55:31 PDT
Created attachment 558066 [details] [diff] [review]
Patch

Can you especially check whether this comment is still correct: "Note: this cannot overwrite slots holding loop invariants."?
Comment 15 Brian Hackett (:bhackett) 2011-09-03 09:45:25 PDT
Comment on attachment 558066 [details] [diff] [review]
Patch

Nice.
Comment 16 Jan de Mooij [:jandem] 2011-09-03 13:18:07 PDT
Adding checkin-needed. The tracemonkey repo required level 2 commit access but for inbound I need level 3.

This was green on try (except some seemingly unrelated Android orange).
Comment 18 Ed Morley [:emorley] 2011-09-04 16:48:34 PDT
http://hg.mozilla.org/mozilla-central/rev/6e67cafd75c8

Note You need to log in before you can comment on or make changes to this bug.