Open Bug 895127 Opened 11 years ago Updated 2 years ago

ARM: Chrome faster than Firefox on nsieve asm.js port

Categories

(Core :: JavaScript Engine, defect)

ARM
Android
defect

Tracking

()

REOPENED

People

(Reporter: djvj, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(7 files, 1 obsolete file)

This shows up in ARM only (ran on my android system).  On my Mac Firefox is faster.

I ported nsieve to C and built with emscripten at -O2.  The asm.js code can be accessed at:

vijayan.ca/nsieve/nsieve.html

Chrome on Android preforms about 1.4x faster than Firefox (nightly) on Android.  Given that we do better than Chrome on x64, this seems like it might be an ARM codegen issue.
Summary: Chrome faster than Firefox on nsieve asm.js port → ARM: Chrome faster than Firefox on nsieve asm.js port
Attached file C source code.
What do you see for x86?  This could be the issue that ARM/x86 loads/store bounds checks aren't hoisted.
Profiling results:
 30.29% nsieve.html:2570:1221: Function aS - Block 25
 13.96% nsieve.html:2570:1010: Function aS - Block 15
 12.81% nsieve.html:2570:1296: Function aS - Block 29
  6.71% nsieve.html:2570:807: Function aS - Block 5
  5.50% nsieve.html:2570:1085: Function aS - Block 19
  3.12% nsieve.html:2570:882: Function aS - Block 9
  2.73% nsieve.html:2570:1268: Function aS - Block 28
  2.71% nsieve.html:2570:1057: Function aS - Block 18
  2.18% nsieve.html:0:0: Function aS - Block 26
  2.13% nsieve.html:2570:1192: Function aS - Block 23
  1.72% nsieve.html:2570:1183: Function aS - Block 22
  1.69% nsieve.html:0:0: Function aS - Block 16

-=-
The hottest block.

Original source: for (k = i + i; k <= m; k += i) isPrime[k] = false;

Minified JS: {g=e;do{a[b+g|0]=0;g=g+f|0;}while((g|0)<=8e4)}


400335f8 28 nsieve.html:2570:1221: Function aS - Block 25
   0x400335f8: ldr	r3, [sp, #8]
   0x400335fc: adds	r4, r3, r0
   0x40033600: mov	r5, #0
   0x40033604: lsrs	r12, r4, #24     <- bounds check
   0x40033608: strbeq	r5, [r11, r4]    <- heap store
   0x4003360c: adds	r4, r0, r1
   0x40033610: movw	r12, #14464
   0x40033614: movt	r12, #1
   0x40033618: cmp	r4, r12
   0x4003361c: bgt	0x4003362c

Note the low bit masking is not needed in this case because this is a byte size operation.


* Removing the low bit masking and the bounds check gives around a 10% reduction in run time.

400cb578 24 nsieve.html:2570:1221: Function aS - Block 25
   0x400cb578: ldr	r3, [sp, #8]
   0x400cb57c: adds	r4, r3, r0
   0x400cb580: mov	r5, #0
   0x400cb584: strb	r5, [r11, r4]
   0x400cb588: adds	r4, r0, r1
   0x400cb58c: movw	r12, #14464	; 0x3880
   0x400cb590: movt	r12, #1
   0x400cb594: cmp	r4, r12
   0x400cb598: bgt	0x400cb5a8


* Using high bit masking, rather than the heap bounds check, gives little change.  Note that for byte size heap access the low bit mask was not necessary so the bounds check is just being replaced by the high bit mask.

4001d5c0 28 nsieve.html:2570:1221: Function aS - Block 25
   0x4001d5c0: ldr	r3, [sp, #8]
   0x4001d5c4: adds	r4, r3, r0
   0x4001d5c8: bic	r5, r4, #-67108864	; 0xfc000000
   0x4001d5cc: mov	r4, #0
   0x4001d5d0: strb	r4, [r11, r5]
   0x4001d5d4: adds	r4, r0, r1
   0x4001d5d8: movw	r12, #14464	; 0x3880
   0x4001d5dc: movt	r12, #1
   0x4001d5e0: cmp	r4, r12
   0x4001d5e4: bgt	0x4001d5f4


* Clang generated code:

	lsl	r3, r2, #1
	b	.LBB0_5
.LBB0_4:
	strb	r1, [r4, r3]
	add	r3, r3, r2
.LBB0_5:
	cmp	r3, r5
	ble	.LBB0_4

Note that the constants are hoisted, the loop blocks are better ordered, and the store instruction is able to make better use of its 'base plus index' encoding.

* GCC generated code

.L22:
	strb	ip, [r1, r3]
	add	r3, r3, r2
	cmp	r0, r3
	bge	.L22

Very similar to the clang code.
If you remove bounds checking altogether, how does our perf compare the Chrome's? (I'm wondering if this is due to bounds-check hoisting.)
(In reply to Luke Wagner [:luke] from comment #6)
> If you remove bounds checking altogether, how does our perf compare the
> Chrome's? (I'm wondering if this is due to bounds-check hoisting.)

With the bounds checking removed altogether the performance improved by around 10%.  I have not actually compared the results to Chrome, but based on the reported 1.4x speed difference this still leaves a good gap.

For the best performance it might be necessary to consider allowing the heap base to be hoisted too, to allow better use of the instruction encoding within such loops.
Depends on: 897412
Depends on: 897425
Already mentioned in IRC, but perhaps we should look into the run time characteristics of malloc/memset.
I replaced the malloc() call with a use of a static array, and here were the times:
mjrosenb@eve:~/bench$ js.ion.opt  ./nsieve_static.js 
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 23.450000  (count=143020000)
mjrosenb@eve:~/bench$ js.ion.opt  ./nsieve.10000.js 
warning: successfully compiled asm.js code (total compilation time 42ms)
CXX Time: 34.527000  (count=143020000)
I used emscripten to compile the new test, but not the old one, so I should make sure there aren't differences in emscripten that are causing these timing differences.
More information!
I had a theory that emscripten simply generates better code for static array accesses than it does for dynamically allocated arrays.  To test this, I used the original code, along with a custom malloc implementation (in C) that can basically handle nsieve, and nothing else.
The results surprised me in that the dynamically allocated test was a non-trivial amount *faster* than the statically allocated test.

mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_static.js 
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 23.754000  (count=143020000)
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_neutered_malloc.js 
warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 20.274000  (count=143020000)
I am confused about why my testing shows that we're spending some absurd amount of time in malloc()(or at least malloc slows us down by an ABSURD amount), and yet profiling doesnt't show any time spent there.  Does the profiling code not emit information for js that doesn't have source maps back to C++ code?
More finger pointing:
mjrosenb@eve:~/bench$ perf stat js.ion.opt nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js 
warning: successfully compiled asm.js code (total compilation time 320ms)
CXX Time: 34.346000  (count=143020000)

 Performance counter stats for 'js.ion.opt nsieve_malloc.js':

      34884.707897 task-clock                #    0.995 CPUs utilized          
               814 context-switches          #    0.023 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
             3,169 page-faults               #    0.091 K/sec                  
    58,297,313,868 cycles                    #    1.671 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    83,104,489,736 instructions              #    1.43  insns per cycle        
    13,868,335,782 branches                  #  397.548 M/sec                  
       207,125,841 branch-misses             #    1.49% of all branches        

      35.049975289 seconds time elapsed

warning: successfully compiled asm.js code (total compilation time 6ms)
CXX Time: 20.271000  (count=143020000)

 Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js':

      20192.141046 task-clock                #    0.993 CPUs utilized          
               372 context-switches          #    0.018 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
             2,331 page-faults               #    0.115 K/sec                  
    34,323,928,941 cycles                    #    1.700 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
    61,163,271,375 instructions              #    1.78  insns per cycle        
    14,229,247,827 branches                  #  704.692 M/sec                  
       205,919,866 branch-misses             #    1.45% of all branches        

      20.333820067 seconds time elapsed

mjrosenb@eve:~/bench$ 

It looks like 1/4 of our instructions are spent in malloc, which is a totally unreasonable amount.

For comparison, running the same test on x86 gives:
~/bench; perf stat js.ion.opt ./nsieve_malloc.js; perf stat js.ion.opt ./nsieve_neutered_malloc.js
warning: successfully compiled asm.js code (total compilation time 13ms)
CXX Time: 7.029000  (count=143020000)

 Performance counter stats for 'js.ion.opt ./nsieve_malloc.js':

       7077.914139 task-clock                #    1.001 CPUs utilized          
                28 context-switches          #    0.004 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
            11,953 page-faults               #    0.002 M/sec                  
    22,930,948,302 cycles                    #    3.240 GHz                     [83.32%]
     5,258,175,147 stalled-cycles-frontend   #   22.93% frontend cycles idle    [83.32%]
     3,418,331,293 stalled-cycles-backend    #   14.91% backend  cycles idle    [66.68%]
    43,925,367,676 instructions              #    1.92  insns per cycle        
                                             #    0.12  stalled cycles per insn [83.37%]
     8,572,309,463 branches                  # 1211.135 M/sec                   [83.37%]
       174,790,702 branch-misses             #    2.04% of all branches         [83.32%]

       7.073529927 seconds time elapsed

warning: successfully compiled asm.js code (total compilation time 1ms)
CXX Time: 6.744000  (count=143020000)

 Performance counter stats for 'js.ion.opt ./nsieve_neutered_malloc.js':

       6779.736561 task-clock                #    1.001 CPUs utilized          
                17 context-switches          #    0.003 K/sec                  
                 0 CPU-migrations            #    0.000 K/sec                  
            10,547 page-faults               #    0.002 M/sec                  
    22,163,384,075 cycles                    #    3.269 GHz                     [83.33%]
     8,100,893,711 stalled-cycles-frontend   #   36.55% frontend cycles idle    [83.35%]
     4,770,117,610 stalled-cycles-backend    #   21.52% backend  cycles idle    [66.70%]
    36,908,598,483 instructions              #    1.67  insns per cycle        
                                             #    0.22  stalled cycles per insn [83.35%]
     7,167,986,291 branches                  # 1057.266 M/sec                   [83.35%]
       179,414,425 branch-misses             #    2.50% of all branches         [83.30%]

       6.775314255 seconds time elapsed

I am concerned that we are executing almost twice the number of instructions arm as we are on x86, even in the neutered testcase.  Theories currently are
1) codegen for dlmalloc is *awful* on arm
2) codegen for dlmalloc is broken on arm, and we're taking some branches that we shouldn't be (and somehow or other aren't tripping any asserts or corrupting memory)
3) dlmalloc is somehow not a good fit for arm's memory characteristics

I can likely test 2 by dumping memory / an execution trace, and making sure they are the same on x86 and arm.  Profiling and/or inspecting the code generated for dlmalloc is probably the only way to distinguish 1 from 3.
Perhaps ffi cost is higher on arm? Did you measure the time spent in ffi calls (luke has a patch)?
It's hard to measure FFI calls now since they've been optimized (to be direct asm.js-to-Ion calls).  If you put 'return false' at the top of TryEnablingIon, you can put a counter in InvokeFromAsmJS_(Ignore,ToInt32,ToNumber) and see how many FFI calls there are; if there are more than a small number then we should probably change the benchmark to not do that.
I can likely just instrument all of the ffi calls that we can make from dlmalloc, and see how many times we call each of those.

The obvious choice was sbrk.  Since only 4 sizes are being allocated, and there is only one thing live at any given time, it is not unreasonable for a malloc implementation to just return all of the freed memory ass soon as it is freed.

I instrumented the ffi sbrk function with a print statement, and the result was:
mjrosenb@eve:~/bench$ js.ion.opt ./nsieve_malloc_sbrkcnt.js
warning: successfully compiled asm.js code (total compilation time 46ms)
_SBRK
_SBRK
_SBRK
_SBRK
_SBRK
CXX Time: 34.288000  (count=143020000)
mjrosenb@eve:~/bench$
so we call sbrk 5 times total.

Out of curiosity, I poked around in emscripten to see if HAVE_MMAP is unset/set to 0, or MORECORE_CANNOT_TRIM is set.  I didn't see either of them.  Turning off code that we can't possibly use seems like a good strategy no matter what.
Attached file C source - noline. (obsolete) —
Attached file C source - noinline
Attachment #787315 - Attachment is obsolete: true
This example hoists the 'malloc' operation, reusing the flags vector, to help measure the impact of 'malloc' on the run time.  Timing results on the ARM show little difference which suggests malloc is not hot and this is consistent with the perf results which did not show malloc as a hot function.

These were compiled with:
em++ -O2 -g2 -o nsieve.js nsieve.c
em++ -O2 -g2 -o nsieves.js nsieves.c
This test combines four variations. The first two use a static vector for isPrime.  The first had a larger offset for isPrime and the second had a small offset.  The high impact of loading large integer constants on the ARM might be significant.  Perhaps hoisting such load needs to be explored, and also hoisting the heap base+offset calculation.  The third case hoists the 'malloc', and the last is the original, and these a very close so malloc does not appear to be hot.

Static vector 1:        CXX Time: 121.194000  (count=143020000)
Static vector 2:        CXX Time: 103.530000  (count=143020000)
Dynamic hoisted vector: CXX Time: 108.764000  (count=143020000)
Dynamic vector:         CXX Time: 109.119000  (count=143020000)
It has been noted that if there is less unrolling or perhaps less inlining then the code runs faster.  The hot block is as follows, and can be compared with the hot block noted in comment 5.

   0xb64f0538:	adds	r5, r2, r3
   0xb64f053c:	mov	r6, #0
   0xb64f0540:	lsrs	r12, r5, #24
   0xb64f0544:	strbeq	r6, [r11, r5]
   0xb64f0548:	adds	r5, r3, r1
   0xb64f054c:	cmp	r5, r4
   0xb64f0550:	bgt	0xb64f055c
   0xb64f0554:	mov	r3, r5
   0xb64f0558:	b	0xb64f0538

A significant difference is that the loop termination count is stored in a register (r4) rather than being loaded on each iteration.  There also appears to be less register pressure as there is not the stack load and store on each iteration.  The loop block order might still use some improvement - move the termination test to the tail to reduce the number of branches.
Jan has a fix for the issue of storing the variable isPrime back to the heap on every iteration of the loop, but there is also the problem that we *load* it on every iteration of the loop.  Some of my findings relayed on IRC:

13:29 < mjrosenb|ARM> jandem: so investigation results: backtracking does a bit better because it pulls the load/store combo out of a tight loop.
13:30 < mjrosenb|ARM> it looks like bt also needs the store elimination logic, although I'm not sure such a simple test will be sufficent.
13:31 < mjrosenb|ARM> lastly, I want to try both of the interval splitting optimizations.  BT has a simple form of one, but it could likely be improved
13:32 < mjrosenb|ARM> I think getting these optimizations in will be a bit annoying, since nothing seems to rely on the actual location/nesting level of 
                      blocks.
thoroughly annotated partial nsieve-unrolled dump, using the backtracking allocator:

0x76892450:  push    {lr}            ; (str lr, [sp, #-4]!)
0x76892454:  sub     sp, sp, #36     ; 0x24
0x76892458:  movw    r12, #28404     ; 0x6ef4
0x7689245c:  movt    r12, #121       ; 0x79
0x76892460:  ldr     lr, [r12]
0x76892464:  cmp     lr, sp
0x76892468:  bcs     0x76899b54
0x7689246c:  movw    r0, #20001      ; 0x4e21
0x76892470:  tst     sp, #7
0x76892474:  beq     0x7689247c
0x76892478:  bkpt    0x0001
0x7689247c:  bl      0x76892988     // call malloc
0x76892480:  str     r0, [sp, #16]  // store result, due to memset
0x76892484:  mov     r1, r0         // this happens because we *CANNOT* specify the same register on alu ops.
0x76892488:  adds    r0, r1, #2     // this should just be add r0, r0, 2
0x7689248c:  mov     r1, #1         // setting each byte to 1
0x76892490:  movw    r2, #19999      ; 0x4e1f
0x76892494:  tst     sp, #7
0x76892498:  beq     0x768924a0
0x7689249c:  bkpt    0x0002
0x768924a0:  bl      0x76897800      // call memset
0x768924a4:  mov     r2, #0          // not exactly like the C code, we start the loop by assuming isPrime[...] is true
0x768924a8:  mov     r0, #0
0x768924ac:  mov     r1, #2          // initialize i to 2
outer head:
0x768924b0:  cmp     r2, #0          // isPrime is inverted before we get to this point.
0x768924b4:  beq     0x768924c0      // jump to tight loop prehead
0x768924b8:  str     r0, [sp, #32]   // update stack value of count :(
0x768924bc:  b       0x76892510      // jump to outer loopcont.
tight loop prehead:
0x768924c0:  lsl     r2, r1, #1       // initialize k to 2i
0x768924c4:  movw    r12, #20000
0x768924c8:  cmp     r2, r12         // initial check to see if k > m
0x768924cc:  bgt     0x76892508
0x768924d0:  str     r0, [sp, #32]   // update stack value of count
0x768924d4:  ldr     r0, [sp, #16]   // load isPrime :(
tight loop head:
0x768924d8:  adds    r3, r0, r2      // calculate &isPrime[k]
0x768924dc:  mov     r4, #0          // generate immediate to store.
0x768924e0:  lsrs    r12, r3, #24    // bounds check!
0x768924e4:  strbeq  r4, [r11, r3]   // store if we passed the bounds check.
0x768924e8:  adds    r3, r2, r1      // k' <- k+i
0x768924ec:  movw    r12, #20000     ; 0x4e20
0x768924f0:  cmp     r3, r12         // see if k' > m
0x768924f4:  bgt     0x76892500      // jump to after tight loop.
0x768924f8:  mov     r2, r3          // k <- k'
0x768924fc:  b       0x768924d8      // jump to tight loop head
after tight loop:
0x76892500:  str     r0, [sp, #16]   // sync back isPrime
0x76892504:  ldr     r0, [sp, #32]   //load count
0x76892508:  adds    r2, r0, #1      // increment count
0x7689250c:  str     r2, [sp, #32]   // sync back count.
outer loop cont.
0x76892510:  adds    r3, r1, #1      // increment i into new variable
0x76892514:  movw    r12, #20000     ; 0x4e20
0x76892518:  cmp     r3, r12         // check if i is larger than m
0x7689251c:  bgt     0x76892550      // conditionally jump to post-loop.
0x76892520:  ldr     r1, [sp, #16]   // load isPrime (yet again!), ooh, it has a different register!
0x76892524:  adds    r0, r1, r3      // compute &isPrime[i]
0x76892528:  lsrs    r12, r0, #24    // bounds check
0x7689252c:  ldrsbeq r1, [r11, r0]   // conditional load
0x76892530:  movne   r1, #0          // default to 0 on oob accesses
0x76892534:  and     r0, r1, #1      // start godawful tmp = (isPrime[i] & 1) == 0 check
0x76892538:  cmp     r0, #0          // ?= 0
0x7689253c:  mov     r2, #0          // default to 0
0x76892540:  moveq   r2, #1          // x==0 => 1
0x76892544:  mov     r1, r3          // fix-up i.
0x76892548:  ldr     r0, [sp, #32]   // load count
0x7689254c:  b       0x768924b0      // branch to outer head.
post-loop:
0x76892550:  ldr     r0, [sp, #16]   // load isPrime as an argumnet to free()
0x76892554:  tst     sp, #7
0x76892558:  beq     0x76892560
0x7689255c:  bkpt    0x0003
0x76892560:  bl      0x76895f38      // call free()
annotation for same chuck of C/js code, this time compiled with lsra:

   0x76892450:  push    {lr}            ; (str lr, [sp, #-4]!)
   0x76892454:  sub     sp, sp, #28
   0x76892458:  movw    r12, #28404     ; 0x6ef4
   0x7689245c:  movt    r12, #121       ; 0x79
   0x76892460:  ldr     lr, [r12]
   0x76892464:  cmp     lr, sp
   0x76892468:  bcs     0x76899b50
   0x7689246c:  movw    r0, #20001      ; 0x4e21
   0x76892470:  tst     sp, #7
   0x76892474:  beq     0x7689247c
   0x76892478:  bkpt    0x0001
   0x7689247c:  bl      0x768929e4 // call malloc?
   0x76892480:  adds    r1, r0, #2
   0x76892484:  mov     r2, #1
   0x76892488:  movw    r3, #19999      ; 0x4e1f
   0x7689248c:  str     r0, [sp, #24] // stash the address of isPrime
   0x76892490:  mov     r0, r1
   0x76892494:  mov     r1, r2
   0x76892498:  mov     r2, r3
   0x7689249c:  tst     sp, #7
   0x768924a0:  beq     0x768924a8
   0x768924a4:  bkpt    0x0002
   0x768924a8:  bl      0x76897954 // call memset.

// loop preheader
   0x768924ac:  mov     r0, #0 // initialize temp for holding the current prime?
   0x768924b0:  mov     r1, #0
   0x768924b4:  mov     r2, #2 // initialize i to 2.
// some sort of loop header.
   0x768924b8:  cmp     r0, #0 // test to see if the current number is a prime number // this is inverted?
   0x768924bc:  beq     0x768924cc 
   0x768924c0:  ldr     r3, [sp, #24] // it is not prime! [sp, #24] contains isPrime!!
   0x768924c4:  mov     r0, r1
   0x768924c8:  b       0x76892518
// it isn't 
   0x768924cc:  lsl     r0, r2, #1    // initialize k -- add_i instantiation
   0x768924d0:  movw    r12, #20000     ; 0x4e20
   0x768924d4:  cmp     r0, r12 
   0x768924d8:  ble     0x768924e4     // make sure that k <= m
   0x768924dc:  ldr     r3, [sp, #24]  // loading r3 *everywhere*
   0x768924e0:  b       0x76892514     // don't enter the loop.
// loop head.
   0x768924e4:  ldr     r3, [sp, #24]  // load in tight loop is bad.
   0x768924e8:  adds    r4, r3, r0     // compute &isPrime[k]
   0x768924ec:  mov     r5, #0         // make a temp to store 0 -- this may be hoistable, debatable benefits.
   0x768924f0:  lsrs    r12, r4, #24   // heap check
   0x768924f4:  strbeq  r5, [r11, r4]  // write 0 into isPrime
   0x768924f8:  adds    r4, r0, r2     // add i to k, rename k to r4.
   0x768924fc:  movw    r12, #20000     ; 0x4e20
   0x76892500:  cmp     r4, r12
   0x76892504:  bgt     0x76892514     // terminate inner loop
   0x76892508:  str     r3, [sp, #24]  // we don't change r3, why do we store it back?
   0x7689250c:  mov     r0, r4         // sync back change to k.
   0x76892510:  b       0x768924e4
     // inner loop termination
   0x76892514:  adds    r0, r1, #1
     // JOIN for primality check?
   0x76892518:  adds    r1, r2, #1     // increment i.
   0x7689251c:  movw    r12, #20000     ; 0x4e20
   0x76892520:  cmp     r1, r12
   0x76892524:  bgt     0x76892568    // terminate whole loop?
   0x76892528:  adds    r2, r3, r1    // compute &isPrime[i]
   0x7689252c:  lsrs    r12, r2, #24  // bounds check
   0x76892530:  ldrsbeq r2, [r11, r2] // load isPrime[i]
   0x76892534:  movne   r2, #0        // fallback to loading 0
   0x76892538:  and     r4, r2, #1    // make out bottom bit
   0x7689253c:  cmp     r4, #0        // superfluous comparison
   0x76892540:  mov     r2, #0        // oh god, why?
   0x76892544:  moveq   r2, #1        // we *really* should do better than this.

   // this looks like a movegroup shuffle.
   0x76892548:  sub     sp, sp, #8    
   0x7689254c:  str     r3, [sp, #32]
   0x76892550:  str     r1, [sp]
   0x76892554:  mov     r1, r0
   0x76892558:  mov     r0, r2
   0x7689255c:  ldr     r2, [sp]
   0x76892560:  add     sp, sp, #8
   // movegroup over.

   0x76892564:  b       0x768924b8


   0x76892568:  str     r0, [sp, #20]
   0x7689256c:  mov     r0, r3
   0x76892570:  tst     sp, #7
   0x76892574:  beq     0x7689257c
   0x76892578:  bkpt    0x0003
   0x7689257c:  bl      0x76895e8c      // call free.
aaand a commented non-unrolled case.  I believe this is with the backtracking allocator:
0x76893450:  push    {lr}            ; (str lr, [sp, #-4]!)
0x76893454:  sub     sp, sp, #28
0x76893458:  movw    r12, #28404     ; 0x6ef4
0x7689345c:  movt    r12, #121       ; 0x79
0x76893460:  ldr     lr, [r12]
0x76893464:  cmp     lr, sp
0x76893468:  bcs     0x7689ab54
0x7689346c:  mov     r1, #0          // initialize ? count_07?
0x76893470:  mov     r0, #1          // initialize outer i to 1.
0x76893474:  str     r0, [sp, #24]   // sync ? and outer i to the stack
0x76893478:  str     r1, [sp, #20]
top of all loops
0x7689347c:  movw    r0, #10000      // compute 10000
0x76893480:  ldr     r2, [sp, #24]   // grab outer i from the stack-- really?
0x76893484:  and     r1, r2, #31     // bitmask for shift.  bounds check should eliminate this.
0x76893488:  lsl     r1, r0, r1      // compute m = 10000 * (1 << i) == 10000 << i
0x7689348c:  str     r1, [sp, #12]   // store m to the stack
0x76893490:  adds    r0, r1, #1      // we allocate m+1 elements
0x76893494:  tst     sp, #7
0x76893498:  beq     0x768934a0
0x7689349c:  bkpt    0x0001
0x768934a0:  bl      0x768937d0      // call malloc
0x768934a4:  str     r0, [sp, #16]   // store flags to the stack
0x768934a8:  ldr     r0, [sp, #12]   // grab m from the stack-- can we use callee saved regs?
0x768934ac:  cmp     r0, #2          // m < 2? that will *never* happen.
0x768934b0:  bge     0x768934bc
0x768934b4:  mov     r0, #0          // wtf? why do we only initialize count to 0 here? this is llvm generating absurd code.
0x768934b8:  b       0x76893564      // b to end of middle
0x768934bc:  ldr     r1, [sp, #16]   // re-load flags -- argument to memset
0x768934c0:  adds    r0, r1, #2      // actually start at &flags[2]
0x768934c4:  ldr     r1, [sp, #12]   // third argument is m-1; load m
0x768934c8:  subs    r2, r1, #1      // subs? we should really generate sub here.
0x768934cc:  mov     r1, #1          // second argument is 1.
0x768934d0:  tst     sp, #7
0x768934d4:  beq     0x768934dc
0x768934d8:  bkpt    0x0002
0x768934dc:  bl      0x76898648      // call memset
0x768934e0:  mov     r0, #0          // zero count (called $count_020_i in js)
0x768934e4:  mov     r1, #2          // set i_119_i to 2? that sounds like the inner i.
outer loop head:
0x768934e8:  ldr     r2, [sp, #16]   // re-load flags
0x768934ec:  adds    r3, r2, r1      // compute &flags[inner i]
0x768934f0:  lsrs    r12, r3, #24    // bounds check
0x768934f4:  ldrsbeq r3, [r11, r3]   // load  flags[inner i]
0x768934f8:  movne   r3, #0          // oob handler
0x768934fc:  and     r4, r3, #1      // extract bottom bit, kind of pointless.
0x76893500:  cmp     r4, #0          // tst... the instruction you want is tst.
0x76893504:  beq     0x7689354c      // branch to composite case
0x76893508:  lsl     r3, r1, #1      // set k = i + i (inner i, here)
0x7689350c:  ldr     r4, [sp, #12]   // grab m from the stack
0x76893510:  cmp     r3, r4          // is k's initial value out of bounds?
0x76893514:  bgt     0x76893544      // if it is, go to after tight loop.
0x76893518:  ldr     r2, [sp, #16]   // re-load flags --- again ?!
tight loop head:
0x7689351c:  adds    r4, r2, r3      // compute flags[k]
0x76893520:  mov     r5, #0          // load 0 for store-back
0x76893524:  lsrs    r12, r4, #24    // bounds check
0x76893528:  strbeq  r5, [r11, r4]   // flags[k] = 0
0x7689352c:  adds    r4, r3, r1      // k' = k + i
0x76893530:  ldr     r3, [sp, #12]   // m, yet again.
0x76893534:  cmp     r4, r3          // if m > k, abort
0x76893538:  bgt     0x76893544      // go to after tight loop
0x7689353c:  mov     r3, r4          // synchronize variables
0x76893540:  b       0x7689351c      // always go totight loop head
after tight loop
0x76893544:  adds    r2, r0, #1      // increment count
0x76893548:  mov     r0, r2          // synchronize registers
composite case
0x7689354c:  adds    r2, r1, #1      // increment i-inner
0x76893550:  ldr     r1, [sp, #12]   // re-load m (AGAIN)
0x76893554:  cmp     r2, r1          // upper bounds check on i inner
0x76893558:  bgt     0x76893564      // branch to end of middle
0x7689355c:  mov     r1, r2          // sync incremented i back to r1
0x76893560:  b       0x768934e8      // go to the outer loop head.
end of middle:
0x76893564:  ldr     r1, [sp, #20]   // load count
0x76893568:  adds    r2, r0, r1      // add inner count to outer count. 
0x7689356c:  str     r2, [sp, #8]    // store outer count back to the stack.
0x76893570:  ldr     r0, [sp, #16]   // load flags for argument to free.
0x76893574:  tst     sp, #7
0x76893578:  beq     0x76893580
0x7689357c:  bkpt    0x0003
0x76893580:  bl      0x76896d80      // call free
0x76893584:  ldr     r1, [sp, #24]   // load outer i
0x76893588:  adds    r0, r1, #1      // increment outer i
0x7689358c:  cmp     r0, #4          // see if the outer loop terminated
0x76893590:  bge     0x768935b0      // branch to after all loops
0x76893594:  str     r0, [sp, #24]   // cycle fixup
0x76893598:  push    {lr}            ; (str lr, [sp, #-4]!)
0x7689359c:  ldr     lr, [sp, #12]
0x768935a0:  str     lr, [sp, #24]
0x768935a4:  ldr     lr, [sp]
0x768935a8:  add     sp, sp, #4
0x768935ac:  b       0x7689347c
after all loops:
0x768935b0:  ldr     r0, [sp, #8]    // load outer count for returning
0x768935b4:  add     sp, sp, #28
0x768935b8:  pop     {pc}            ; (ldr pc, [sp], #4)
First column is without this "improvement", second column is with it.

TEST                         COMPARISON            FROM                 TO             DETAILS

=============================================================================


and since octane doesn't come with a diff function:
mjrosenb@eve:~/bench/octane$ BETTER_SPLIT=1 ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js; ~/src/central/central-regalloc/js/objs/armhf-opt/shell/js ./run.js; 
Richards: 7300
DeltaBlue: 6856
Crypto: 6717
RayTrace: 7357
EarleyBoyer: 6321
RegExp: 894
Splay: 4669
NavierStokes: 9043
PdfJS: 2963
Mandreel: 1377
Gameboy: 7136
CodeLoad: 7199
Box2D: 2028
----
Score (version 8): 4427
Richards: 7442
DeltaBlue: 6850
Crypto: 6175
RayTrace: 7322
EarleyBoyer: 6323
RegExp: 936
Splay: 4425
NavierStokes: 8877
PdfJS: 3003
Mandreel: 1350
Gameboy: 8029
CodeLoad: 7545
Box2D: 1918
----
Score (version 8): 4429

which isn't saying all that much, since evidently, the change from one run to the next is appreciable.
** TOTAL **:                 ??                5499.5ms +/- 0.2%   5531.1ms +/- 0.9%     not conclusive: might be *1.006x as slow*

=============================================================================

  ai:                        ??                 415.8ms +/- 1.4%    425.9ms +/- 2.4%     not conclusive: might be *1.024x as slow*
    astar:                   ??                 415.8ms +/- 1.4%    425.9ms +/- 2.4%     not conclusive: might be *1.024x as slow*

  audio:                     *1.013x as slow*  2074.3ms +/- 0.4%   2100.4ms +/- 1.1%     significant
    beat-detection:          ??                 592.7ms +/- 0.6%    607.7ms +/- 3.3%     not conclusive: might be *1.025x as slow*
    dft:                     *1.024x as slow*   954.4ms +/- 0.5%    977.4ms +/- 2.2%     significant
    fft:                     1.048x as fast     216.8ms +/- 1.9%    206.9ms +/- 1.5%     significant
    oscillator:              -                  310.4ms +/- 1.6%    308.4ms +/- 1.9% 

  imaging:                   ??                1329.4ms +/- 0.2%   1333.4ms +/- 1.5%     not conclusive: might be *1.003x as slow*
    gaussian-blur:           ??                 396.2ms +/- 0.1%    403.7ms +/- 2.5%     not conclusive: might be *1.019x as slow*
    darkroom:                1.025x as fast     585.9ms +/- 0.1%    571.4ms +/- 0.2%     significant
    desaturate:              *1.032x as slow*   347.3ms +/- 0.8%    358.3ms +/- 3.1%     significant

  json:                      ??                 302.9ms +/- 0.5%    303.6ms +/- 0.6%     not conclusive: might be *1.002x as slow*
    parse-financial:         -                  148.8ms +/- 1.1%    148.6ms +/- 1.3% 
    stringify-tinderbox:     ??                 154.1ms +/- 0.7%    155.0ms +/- 0.4%     not conclusive: might be *1.006x as slow*

  stanford:                  -                 1377.1ms +/- 0.4%   1367.8ms +/- 0.7% 
    crypto-aes:              -                  334.7ms +/- 1.5%    330.9ms +/- 0.6% 
    crypto-ccm:              -                  286.0ms +/- 1.1%    286.1ms +/- 0.7% 
    crypto-pbkdf2:           -                  549.7ms +/- 0.6%    546.3ms +/- 1.8% 
    crypto-sha256-iterative: 1.011x as fast     206.7ms +/- 0.4%    204.5ms +/- 0.4%     significant
Oh, I forgot to mention, this brings the testcase (run 100x iterations) from 35 seconds down to 23 seconds which is *most* of the benefit that I saw from not unrolling the loop.
With this patch applied, it dies on many tests, one of which is:
 BETTER_SPLIT=1 ./shell/js --ion-eager     --ion-eager /home/mjrosenb/src/central/central-regalloc/js/src/jit-test/tests/jaeger/loops/hoist-08.js

Assertion failure: *def->output() == alloc, at /home/mjrosenb/src/central/central-regalloc/js/src/jit/RegisterAllocator.cpp:207
Segmentation fault

The reason for this is the ordering of movegroups:
Block 0 [successor 1]
[2,3 Label]
[4,5 Parameter] [def v1 arg:0]
[6,7 Parameter] [def v2 arg:8]
[8,9 Parameter] [def v3 arg:16]
[10,11 Start] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c]
[12,13 CheckOverRecursed]
[14,15 OsiPoint] [use c c] [use v1:* arg:0] [use v2:* arg:8] [use v3:* arg:16] [use c c]
[0,1 MoveGroup] [arg:8 -> =rax]
[16,17 Unbox] [def v4 =rcx] [use v2:r =rax] [use c c] [use v1:* arg:0] [use v2:* =rax] [use v3:* arg:16] [use c c]
[0,1 MoveGroup] [=rcx -> stack:i3]
[18,19 Integer] [def v5 =rdx]
[0,1 MoveGroup] [=rax -> arg:16]
[0,1 MoveGroup] [arg:16 -> =rax]
[20,21 Goto]

for the final Goto, vreg 3 is allocated to rax, and in at least one of the successor blocks, arg:16 is captured by an OsiPoint, so we spill it (We don't need to spill here, in fact, we *never* need to spill arguments, but I suspect this can happen in other cases)
so the per-block register fixup code inserts an rax -> arg:16 store. Later, while the stack is being re-synchronized for each individual instruction, lsra discovers that [20,21 Goto] expects vreg 3 in rax, so it generates a load, which gets inserted immediately before the Goto.  Unfortunately, these two movegroups are now in the wrong order.

Jan, you said you could poke at this with the patch + testcase, so is this something that is just a fundamental flaw, and should be worked around, or did I miss something obvious?
Attachment #796360 - Flags: feedback?(jdemooij)
Ok, I just went and tested this on the asm.js microbenchmarks. I ran the whole suite 100 times, then computed the improvement:
copy
1.1293700
corrections
.1252000
fannkuch
1.6535700
fasta
.6401500
life
2.5245000
memops
.2542300
primes
.0143900
skinning
.0945500

note these are percentage improved, so primes is 0.01% "faster" with the patch.
so it looks like across the board improvements (verytiny) on the microbenchmarks.  I'll test the apps next.
Comment on attachment 796360 [details] [diff] [review]
bad_better_split.diff

Sorry for the delay. Do you still see the problems with this patch on tip?
Attachment #796360 - Flags: feedback?(jdemooij)
Assignee: general → nobody
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: