[meta] Investigate performance gap between asm.js code and native code

NEW
Unassigned

Status

()

Core
JavaScript Engine: JIT
P5
normal
2 years ago
a year ago

People

(Reporter: bhackett, Unassigned)

Tracking

(Depends on: 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox43 affected)

Details

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
On AWFY x64 asm.js compiled code performance is at least 50% worse than native on the bullet, zlib, and box2d throughput benchmarks.

AFAICT there are several factors that could account for this difference:

1. asm.js startup compilation overhead.  On these benchmarks this doesn't seem to be much of a factor.

2. Heap access masking.  On heap accesses larger than one byte the lower bits in the index are masked out; these instructions won't execute in the native version.  This masking is easy to disable for testing and improves things by up to 5-10%, which is only a fraction of the total difference.

3. Other effects of having a base pointer for heap accesses.  This affects the selection of instructions to emit and in some cases requires additional instructions to be emitted, such as using EffectiveAddress instructions to compute indexes into the heap.

4. Ion backend optimization/codegen issues.  When presented with the same IR, how does generated code performance compare between Ion and, say, clang's backend?  This is an open question.

There might be others I'm missing here.  The goal of this bug though is to get a better understanding of what is contributing to the gap on these simple benchmarks and what can be done to close it.
(Reporter)

Updated

2 years ago
Depends on: 1195545
(Reporter)

Updated

2 years ago
Depends on: 1200407

Comment 1

2 years ago
That's a good list.  A few I'd add:
 - address-taken C++ variables go into the asm.js heap (which has a simulated stack pointer global)
 - function pointer calls go through an extra masking and indirection compared to native
 - if relooping fails, a dynamic state variable and switch is used
 - only double-precision transcendental functions (esp. sin/cos: bug 967709)
 - Ion doesn't have callee-saved registers (bug 985065)
 - some pre-codegen backend LLVM optimizations happen after Emscripten grabs the IR (don't know how much this adds up to; iiuc, the LLVM-wasm backend will get "more" of the pipeline)
 - some other minutae: add-with-carry (bug 1043365), min/max (bug 1060635), double-to-int conversion requires branching (we might be able to use signal handling)

FWIW, for (2) WebAssembly won't require the alignment mask.  For (3), bug 897425 had some initial ideas on hoisting "HeapReg + base" computations to free up effective addressing.
(Reporter)

Comment 2

2 years ago
OK, so the main function I'm looking at now for now is inflate_fast in zlib.  During an inflate benchmark zlib spends almost all its time either in this function or in adler32, and now that bug 1195545 has landed the asm.js and clang versions have similar instruction counts in adler32.  inflate_fast, however, executes 70% more instructions in the asm.js version than the clang version.

I'm trying to get a finer correspondence between the generated code in the asm.js and clang versions, but one issue is that the two compilations are structurally different --- the asm.js version has an extra variable used during control flow.  I'm guessing this is due to a relooper failure, but I don't know --- there's no switch statement, and the variable is only used in a few places.  It's kind of strange that the relooper would fail here, as while the original C function uses a couple gotos they are well structured and could be replaced with loops.  Maybe the control flow graph which Emscripten sees is not as friendly as the original source?

I modified the generated JS for inflate_fast to avoid using the control flow variable in some hot places, and will attach the before/after here.  This improves the number of executed instructions in inflate_fast by 10% and the number in the overall inflate benchmark by 7.5%, but doesn't have much effect on the benchmark's runtime on my machine.
(Reporter)

Comment 3

2 years ago
Created attachment 8655627 [details]
original inflate_fast js
(Reporter)

Comment 4

2 years ago
Created attachment 8655628 [details]
modified inflate_fast js
(Reporter)

Comment 5

2 years ago
OK, so I went through the assembly which clang generated for inflate_fast by hand and matched up parts of it to the corresponding bits of the asm.js-compiled code on the modified inflate_fast .js above.  Here's a description of what's going on in the 5 hottest blocks of code in the function, which together account for about 70% of the instructions spent by both versions.

tl;dr: Extra regalloc moves seem to be the biggest problem, but Emscripten artifacts are also a major issue.

#1

CLANG 1410200000 instructions
ASMJS 1923000000 instructions

This is for the body of the loop below (|out| and |from| are char* variables):

#define PUP(a) *++(a)
do {
    PUP(out) = PUP(from);
    PUP(out) = PUP(from);
    PUP(out) = PUP(from);
    len -= 3;
} while (len > 2);

Here is the Emscripten compiled version, which matches up well with the original C. 

do {
    HEAP8[i5 + 1 >> 0] = HEAP8[i7 + 1 >> 0] | 0;
    HEAP8[i5 + 2 >> 0] = HEAP8[i7 + 2 >> 0] | 0;
    i7 = i7 + 3 | 0;
    i5 = i5 + 3 | 0;
    HEAP8[i5 >> 0] = HEAP8[i7 >> 0] | 0;
    i6 = i6 + -3 | 0;
} while (i6 >>> 0 > 2);

Here is the clang compiled code:

   0x00000000004092c0 <+1296>:    mov    -0x2(%rdx,%rax,1),%bl
   0x00000000004092c4 <+1300>:    mov    %bl,0x1(%r11,%rax,1)
   0x00000000004092c9 <+1305>:    mov    -0x1(%rdx,%rax,1),%bl
   0x00000000004092cd <+1309>:    mov    %bl,0x2(%r11,%rax,1)
   0x00000000004092d2 <+1314>:    mov    (%rdx,%rax,1),%bl
   0x00000000004092d5 <+1317>:    mov    %bl,0x3(%r11,%rax,1)
   0x00000000004092da <+1322>:    add    $0xfffffffffffffffd,%rcx
   0x00000000004092de <+1326>:    lea    (%r10,%rcx,1),%esi
   0x00000000004092e2 <+1330>:    add    $0x3,%rax
   0x00000000004092e6 <+1334>:    cmp    $0x2,%esi
   0x00000000004092e9 <+1337>:    ja     0x4092c0 <inflate_fast+1296>

Here is the asm.js compiled code:

movsbl     0x1(%r15,%rdi,1), %ecx
movb       %cl, 0x1(%r15,%r12,1)
movsbl     0x2(%r15,%rdi,1), %ecx
movb       %cl, 0x2(%r15,%r12,1)
leal       0x3(%rdi), %esi
leal       0x3(%r12), %ecx
movl       %ecx, 0xc(%rsp)
movsbl     0x3(%r15,%rdi,1), %ecx
movb       %cl, 0x3(%r15,%r12,1)
addl       $-3, %eax
cmpl       $0x2, %eax
jbe        .Lfrom19805
movl       %esi, %edi
movl       0xc(%rsp), %r12d
jmp        .Llabel19734

There is one main wonky bit with the asm.js version.  The registers rdi and r12, which are the memory positions of |out| and |from|, are incremented using leal instructions and store the result in different registers, which ends up requiring the regalloc to fixup the register values at the loop backedge with moves.  It would be better if Ion reordered the loop body so that the leals could have the same source and target register and the moves could be removed.  It would be even better to do what clang is doing, and use linear equalities to remove all but one of the index variables (we have an analysis for linear equalities in Ion but it is only used to remove bounds checks).

#2

CLANG 728035000 instructions
ASMJS 1549425000 instructions

The original C:

        here = lcode[hold & lmask];
        op = (unsigned)(here.bits);
        hold >>= op;
        bits -= op;
        op = (unsigned)(here.op);
        if (op == 0) {

The Emscripten version:

        i2 = i3 & i28;
        i7 = HEAP8[i26 + (i2 << 2) >> 0] | 0;
        i6 = HEAP16[i26 + (i2 << 2) + 2 >> 1] | 0;
        i2 = HEAPU8[i26 + (i2 << 2) + 1 >> 0] | 0;
        i3 = i3 >>> i2;
        i2 = i5 - i2 | 0;
        if (i7 << 24 >> 24) {

The clang compiled code:

   0x0000000000408efe <+334>:    movzbl 0x1(%r15,%rax,4),%ecx
   0x0000000000408f04 <+340>:    sub    %ecx,%edx
   0x0000000000408f06 <+342>:    shr    %cl,%rsi
   0x0000000000408f09 <+345>:    movzwl 0x2(%r15,%rax,4),%edi
   0x0000000000408f0f <+351>:    movzbl (%r15,%rax,4),%eax
   0x0000000000408f14 <+356>:    test   %eax,%eax
   0x0000000000408f16 <+358>:    jne    0x408eea <inflate_fast+314>

The asm.js compiled code:

movsbl     0x0(%r15,%rcx,1), %eax
movl       %eax, %edx
movswl     0x2(%r15,%rcx,1), %esi
movl       %esi, 0xc(%rsp)
movzbl     0x1(%r15,%rcx,1), %ecx
shrl       %cl, %ebp
movl       0x14(%rsp), %esi
subl       %ecx, %esi
movl       %esi, 0x14(%rsp)
movl       %edx, %ecx
shll       $24, %ecx
sarl       $24, %ecx
testl      %ecx, %ecx
je         .Lfrom18608

There are a couple problems here.  The biggest one AFAICS is from the 'i7 << 24 >> 24' test in the Emscripten compiled code.  I don't know what that's going here, but i7 was just loaded from an int8 array so these two shifts are together a nop (which Ion could optimize out if desired).  The second problem is some stupidity on the part of the regalloc where it performs the first load into eax, then moves that value into edx and later ecx.  Only one of these copies should be necessary, and without the shifts neither of them should be.

#3

CLANG 619740000 instructions
ASMJS 1446060000 instructions

The original C:

    } while (in < last && out < end);

The Emscripten version:

    } while (i1 >>> 0 < i32 >>> 0 & i4 >>> 0 < i36 >>> 0);

The clang compiled code:

   0x000000000040934b <+1435>:    mov    %r8d,%edx
   0x000000000040934e <+1438>:    cmp    -0x40(%rsp),%rbp
   0x0000000000409353 <+1443>:    mov    $0x1,%ebx
   0x0000000000409358 <+1448>:    jae    0x409365 <inflate_fast+1461>
   0x000000000040935a <+1450>:    cmp    -0x38(%rsp),%r11
   0x000000000040935f <+1455>:    jb     0x408ea0 <inflate_fast+240>

The asm.js compiled code:

movl       %ebx, %r8d
cmpl       0x54(%rsp), %r8d
setb       %al
movzbl     %al, %eax
movl       0x10(%rsp), %esi
cmpl       0x58(%rsp), %esi
setb       %cl
movzbl     %cl, %ecx
testl      %eax, %ecx
je         .Lfrom21320
movl       0x40(%rsp), %r12d
movl       0x44(%rsp), %ebx
movl       %esi, 0x10(%rsp)
jmp        .Llabel18414

Again, there are a couple problems here.  For some reason the '&&' in the test ended up being emitted by Emscripten using a bitwise '&'.  This requires Ion to compile the compares and set registers with the result, which both requires a lot of additional instructions and eats into the registers available to the regalloc.  It would be possible to optimize this from Ion by compiling it like a '&&', but that's a complicated transformation that introduces new basic blocks and it'd be nice to know why this '&' was emitted in the first place.  There's also a lot of spill code in this loop from the regalloc, but I'll need to dig around some more to see how this can be improved --- this is the tail of the main loop in the function, and these things are being loaded for the next loop iteration.

#4

CLANG 583550000 instructions
ASMJS 758615000 instructions

The original C:

            hold += (unsigned long)(PUP(in)) << bits;
            bits += 8;
            hold += (unsigned long)(PUP(in)) << bits;
            bits += 8;

The Emscripten version:

            i12 = i4 + 2 | 0;
            i5 = i2 + 16 | 0;
            i3 = ((HEAPU8[i4 + 1 >> 0] | 0) << i2) + i3 + ((HEAPU8[i12 >> 0] | 0) << i2 + 8) | 0;
            i4 = i12;

The clang compiled code:

   0x0000000000408eaa <+250>:    movzbl 0x2(%rbp),%eax
   0x0000000000408eae <+254>:    lea    0x8(%rdx),%ecx
   0x0000000000408eb1 <+257>:    shl    %cl,%rax
   0x0000000000408eb4 <+260>:    movzbl 0x1(%rbp),%esi
   0x0000000000408eb8 <+264>:    mov    %dl,%cl
   0x0000000000408eba <+266>:    shl    %cl,%rsi
   0x0000000000408ebd <+269>:    add    %r13,%rsi
   0x0000000000408ec0 <+272>:    add    %rax,%rsi
   0x0000000000408ec3 <+275>:    add    $0x10,%edx
   0x0000000000408ec6 <+278>:    add    $0x2,%rbp

The emscripten compiled code:

movl       0x10(%rsp), %esi
leal       0x2(%rsi), %eax
leal       0x10(%rcx), %edx
movzbl     0x1(%r15,%rsi,1), %edi
shll       %cl, %edi
addl       %ebp, %edi
movzbl     0x2(%r15,%rsi,1), %esi
addl       $8, %ecx
shll       %cl, %esi
leal       0x0(%rdi,%rsi,1), %ebp
movl       %edx, 0x14(%rsp)
movl       %eax, 0x10(%rsp)
jmp        .Lfrom18502

The two versions here are pretty close, though the asm.js one is still suffering from extra regalloc moves due (apparently) to the leals having different source and target registers, like #1.  I don't know if this is the regalloc's fault or is a lowering issue.

#5

CLANG 346420000 instructions
ASMJS 602350000 instructions

The original C:

            PUP(out) = (unsigned char)(here.val);

The Emscripten version:

                i5 = i6 & 255;
                i1 = i1 + 1 | 0;
                HEAP8[i1 >> 0] = i5;

The clang compiled code:

   0x0000000000408f18 <+360>:    mov    %dil,0x1(%r11)
   0x0000000000408f1c <+364>:    inc    %r11
   0x0000000000408f1f <+367>:    mov    %rsi,%r13
   0x0000000000408f22 <+370>:    jmpq   0x40934e <inflate_fast+1438>

The asm.js compiled code:

movl       0xc(%rsp), %ecx
movl       0x10(%rsp), %esi
andl       $0xff, %ecx
movl       %ecx, 0xc(%rsp)
leal       0x1(%r8), %ebx
movb       %cl, 0x1(%r15,%r8,1)
movl       %esi, 0x10(%rsp)

A couple issues here.  First, the regalloc in its infinite wisdom has decided to load a value off the stack and into %esi, not use it, then store it back to the same place.  Second, the mask that's being done here should not be necessary.  Similar to the shifts in #2, the mask here isn't necessary because the value will be stored into an int8 array.  Unfortunately, this one is non-trivial to optimize away: i5 is live after this block of code, but only because of a spurious dependency related apparently to the relooper failure.  There are some statements after the main loop body which test the relooper variable and correspond to failure paths within the loop; i5 is live within these statements, but will only be used if a particular path is taken out of the loop which sets i5 to a different value.
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.