Closed
Bug 1200406
Opened 9 years ago
Closed 3 years ago
[meta] Investigate performance gap between asm.js code and native code
Categories
(Core :: JavaScript Engine: JIT, defect, P5)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
firefox43 | --- | affected |
People
(Reporter: bhackett1024, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: meta)
Attachments
(2 files)
On AWFY x64 asm.js compiled code performance is at least 50% worse than native on the bullet, zlib, and box2d throughput benchmarks. AFAICT there are several factors that could account for this difference: 1. asm.js startup compilation overhead. On these benchmarks this doesn't seem to be much of a factor. 2. Heap access masking. On heap accesses larger than one byte the lower bits in the index are masked out; these instructions won't execute in the native version. This masking is easy to disable for testing and improves things by up to 5-10%, which is only a fraction of the total difference. 3. Other effects of having a base pointer for heap accesses. This affects the selection of instructions to emit and in some cases requires additional instructions to be emitted, such as using EffectiveAddress instructions to compute indexes into the heap. 4. Ion backend optimization/codegen issues. When presented with the same IR, how does generated code performance compare between Ion and, say, clang's backend? This is an open question. There might be others I'm missing here. The goal of this bug though is to get a better understanding of what is contributing to the gap on these simple benchmarks and what can be done to close it.
Comment 1•9 years ago
|
||
That's a good list. A few I'd add: - address-taken C++ variables go into the asm.js heap (which has a simulated stack pointer global) - function pointer calls go through an extra masking and indirection compared to native - if relooping fails, a dynamic state variable and switch is used - only double-precision transcendental functions (esp. sin/cos: bug 967709) - Ion doesn't have callee-saved registers (bug 985065) - some pre-codegen backend LLVM optimizations happen after Emscripten grabs the IR (don't know how much this adds up to; iiuc, the LLVM-wasm backend will get "more" of the pipeline) - some other minutae: add-with-carry (bug 1043365), min/max (bug 1060635), double-to-int conversion requires branching (we might be able to use signal handling) FWIW, for (2) WebAssembly won't require the alignment mask. For (3), bug 897425 had some initial ideas on hoisting "HeapReg + base" computations to free up effective addressing.
Reporter | ||
Comment 2•9 years ago
|
||
OK, so the main function I'm looking at now for now is inflate_fast in zlib. During an inflate benchmark zlib spends almost all its time either in this function or in adler32, and now that bug 1195545 has landed the asm.js and clang versions have similar instruction counts in adler32. inflate_fast, however, executes 70% more instructions in the asm.js version than the clang version. I'm trying to get a finer correspondence between the generated code in the asm.js and clang versions, but one issue is that the two compilations are structurally different --- the asm.js version has an extra variable used during control flow. I'm guessing this is due to a relooper failure, but I don't know --- there's no switch statement, and the variable is only used in a few places. It's kind of strange that the relooper would fail here, as while the original C function uses a couple gotos they are well structured and could be replaced with loops. Maybe the control flow graph which Emscripten sees is not as friendly as the original source? I modified the generated JS for inflate_fast to avoid using the control flow variable in some hot places, and will attach the before/after here. This improves the number of executed instructions in inflate_fast by 10% and the number in the overall inflate benchmark by 7.5%, but doesn't have much effect on the benchmark's runtime on my machine.
Reporter | ||
Comment 3•9 years ago
|
||
Reporter | ||
Comment 4•9 years ago
|
||
Reporter | ||
Comment 5•9 years ago
|
||
OK, so I went through the assembly which clang generated for inflate_fast by hand and matched up parts of it to the corresponding bits of the asm.js-compiled code on the modified inflate_fast .js above. Here's a description of what's going on in the 5 hottest blocks of code in the function, which together account for about 70% of the instructions spent by both versions. tl;dr: Extra regalloc moves seem to be the biggest problem, but Emscripten artifacts are also a major issue. #1 CLANG 1410200000 instructions ASMJS 1923000000 instructions This is for the body of the loop below (|out| and |from| are char* variables): #define PUP(a) *++(a) do { PUP(out) = PUP(from); PUP(out) = PUP(from); PUP(out) = PUP(from); len -= 3; } while (len > 2); Here is the Emscripten compiled version, which matches up well with the original C. do { HEAP8[i5 + 1 >> 0] = HEAP8[i7 + 1 >> 0] | 0; HEAP8[i5 + 2 >> 0] = HEAP8[i7 + 2 >> 0] | 0; i7 = i7 + 3 | 0; i5 = i5 + 3 | 0; HEAP8[i5 >> 0] = HEAP8[i7 >> 0] | 0; i6 = i6 + -3 | 0; } while (i6 >>> 0 > 2); Here is the clang compiled code: 0x00000000004092c0 <+1296>: mov -0x2(%rdx,%rax,1),%bl 0x00000000004092c4 <+1300>: mov %bl,0x1(%r11,%rax,1) 0x00000000004092c9 <+1305>: mov -0x1(%rdx,%rax,1),%bl 0x00000000004092cd <+1309>: mov %bl,0x2(%r11,%rax,1) 0x00000000004092d2 <+1314>: mov (%rdx,%rax,1),%bl 0x00000000004092d5 <+1317>: mov %bl,0x3(%r11,%rax,1) 0x00000000004092da <+1322>: add $0xfffffffffffffffd,%rcx 0x00000000004092de <+1326>: lea (%r10,%rcx,1),%esi 0x00000000004092e2 <+1330>: add $0x3,%rax 0x00000000004092e6 <+1334>: cmp $0x2,%esi 0x00000000004092e9 <+1337>: ja 0x4092c0 <inflate_fast+1296> Here is the asm.js compiled code: movsbl 0x1(%r15,%rdi,1), %ecx movb %cl, 0x1(%r15,%r12,1) movsbl 0x2(%r15,%rdi,1), %ecx movb %cl, 0x2(%r15,%r12,1) leal 0x3(%rdi), %esi leal 0x3(%r12), %ecx movl %ecx, 0xc(%rsp) movsbl 0x3(%r15,%rdi,1), %ecx movb %cl, 0x3(%r15,%r12,1) addl $-3, %eax cmpl $0x2, %eax jbe .Lfrom19805 movl %esi, %edi movl 0xc(%rsp), %r12d jmp .Llabel19734 There is one main wonky bit with the asm.js version. The registers rdi and r12, which are the memory positions of |out| and |from|, are incremented using leal instructions and store the result in different registers, which ends up requiring the regalloc to fixup the register values at the loop backedge with moves. It would be better if Ion reordered the loop body so that the leals could have the same source and target register and the moves could be removed. It would be even better to do what clang is doing, and use linear equalities to remove all but one of the index variables (we have an analysis for linear equalities in Ion but it is only used to remove bounds checks). #2 CLANG 728035000 instructions ASMJS 1549425000 instructions The original C: here = lcode[hold & lmask]; op = (unsigned)(here.bits); hold >>= op; bits -= op; op = (unsigned)(here.op); if (op == 0) { The Emscripten version: i2 = i3 & i28; i7 = HEAP8[i26 + (i2 << 2) >> 0] | 0; i6 = HEAP16[i26 + (i2 << 2) + 2 >> 1] | 0; i2 = HEAPU8[i26 + (i2 << 2) + 1 >> 0] | 0; i3 = i3 >>> i2; i2 = i5 - i2 | 0; if (i7 << 24 >> 24) { The clang compiled code: 0x0000000000408efe <+334>: movzbl 0x1(%r15,%rax,4),%ecx 0x0000000000408f04 <+340>: sub %ecx,%edx 0x0000000000408f06 <+342>: shr %cl,%rsi 0x0000000000408f09 <+345>: movzwl 0x2(%r15,%rax,4),%edi 0x0000000000408f0f <+351>: movzbl (%r15,%rax,4),%eax 0x0000000000408f14 <+356>: test %eax,%eax 0x0000000000408f16 <+358>: jne 0x408eea <inflate_fast+314> The asm.js compiled code: movsbl 0x0(%r15,%rcx,1), %eax movl %eax, %edx movswl 0x2(%r15,%rcx,1), %esi movl %esi, 0xc(%rsp) movzbl 0x1(%r15,%rcx,1), %ecx shrl %cl, %ebp movl 0x14(%rsp), %esi subl %ecx, %esi movl %esi, 0x14(%rsp) movl %edx, %ecx shll $24, %ecx sarl $24, %ecx testl %ecx, %ecx je .Lfrom18608 There are a couple problems here. The biggest one AFAICS is from the 'i7 << 24 >> 24' test in the Emscripten compiled code. I don't know what that's going here, but i7 was just loaded from an int8 array so these two shifts are together a nop (which Ion could optimize out if desired). The second problem is some stupidity on the part of the regalloc where it performs the first load into eax, then moves that value into edx and later ecx. Only one of these copies should be necessary, and without the shifts neither of them should be. #3 CLANG 619740000 instructions ASMJS 1446060000 instructions The original C: } while (in < last && out < end); The Emscripten version: } while (i1 >>> 0 < i32 >>> 0 & i4 >>> 0 < i36 >>> 0); The clang compiled code: 0x000000000040934b <+1435>: mov %r8d,%edx 0x000000000040934e <+1438>: cmp -0x40(%rsp),%rbp 0x0000000000409353 <+1443>: mov $0x1,%ebx 0x0000000000409358 <+1448>: jae 0x409365 <inflate_fast+1461> 0x000000000040935a <+1450>: cmp -0x38(%rsp),%r11 0x000000000040935f <+1455>: jb 0x408ea0 <inflate_fast+240> The asm.js compiled code: movl %ebx, %r8d cmpl 0x54(%rsp), %r8d setb %al movzbl %al, %eax movl 0x10(%rsp), %esi cmpl 0x58(%rsp), %esi setb %cl movzbl %cl, %ecx testl %eax, %ecx je .Lfrom21320 movl 0x40(%rsp), %r12d movl 0x44(%rsp), %ebx movl %esi, 0x10(%rsp) jmp .Llabel18414 Again, there are a couple problems here. For some reason the '&&' in the test ended up being emitted by Emscripten using a bitwise '&'. This requires Ion to compile the compares and set registers with the result, which both requires a lot of additional instructions and eats into the registers available to the regalloc. It would be possible to optimize this from Ion by compiling it like a '&&', but that's a complicated transformation that introduces new basic blocks and it'd be nice to know why this '&' was emitted in the first place. There's also a lot of spill code in this loop from the regalloc, but I'll need to dig around some more to see how this can be improved --- this is the tail of the main loop in the function, and these things are being loaded for the next loop iteration. #4 CLANG 583550000 instructions ASMJS 758615000 instructions The original C: hold += (unsigned long)(PUP(in)) << bits; bits += 8; hold += (unsigned long)(PUP(in)) << bits; bits += 8; The Emscripten version: i12 = i4 + 2 | 0; i5 = i2 + 16 | 0; i3 = ((HEAPU8[i4 + 1 >> 0] | 0) << i2) + i3 + ((HEAPU8[i12 >> 0] | 0) << i2 + 8) | 0; i4 = i12; The clang compiled code: 0x0000000000408eaa <+250>: movzbl 0x2(%rbp),%eax 0x0000000000408eae <+254>: lea 0x8(%rdx),%ecx 0x0000000000408eb1 <+257>: shl %cl,%rax 0x0000000000408eb4 <+260>: movzbl 0x1(%rbp),%esi 0x0000000000408eb8 <+264>: mov %dl,%cl 0x0000000000408eba <+266>: shl %cl,%rsi 0x0000000000408ebd <+269>: add %r13,%rsi 0x0000000000408ec0 <+272>: add %rax,%rsi 0x0000000000408ec3 <+275>: add $0x10,%edx 0x0000000000408ec6 <+278>: add $0x2,%rbp The emscripten compiled code: movl 0x10(%rsp), %esi leal 0x2(%rsi), %eax leal 0x10(%rcx), %edx movzbl 0x1(%r15,%rsi,1), %edi shll %cl, %edi addl %ebp, %edi movzbl 0x2(%r15,%rsi,1), %esi addl $8, %ecx shll %cl, %esi leal 0x0(%rdi,%rsi,1), %ebp movl %edx, 0x14(%rsp) movl %eax, 0x10(%rsp) jmp .Lfrom18502 The two versions here are pretty close, though the asm.js one is still suffering from extra regalloc moves due (apparently) to the leals having different source and target registers, like #1. I don't know if this is the regalloc's fault or is a lowering issue. #5 CLANG 346420000 instructions ASMJS 602350000 instructions The original C: PUP(out) = (unsigned char)(here.val); The Emscripten version: i5 = i6 & 255; i1 = i1 + 1 | 0; HEAP8[i1 >> 0] = i5; The clang compiled code: 0x0000000000408f18 <+360>: mov %dil,0x1(%r11) 0x0000000000408f1c <+364>: inc %r11 0x0000000000408f1f <+367>: mov %rsi,%r13 0x0000000000408f22 <+370>: jmpq 0x40934e <inflate_fast+1438> The asm.js compiled code: movl 0xc(%rsp), %ecx movl 0x10(%rsp), %esi andl $0xff, %ecx movl %ecx, 0xc(%rsp) leal 0x1(%r8), %ebx movb %cl, 0x1(%r15,%r8,1) movl %esi, 0x10(%rsp) A couple issues here. First, the regalloc in its infinite wisdom has decided to load a value off the stack and into %esi, not use it, then store it back to the same place. Second, the mask that's being done here should not be necessary. Similar to the shifts in #2, the mask here isn't necessary because the value will be stored into an int8 array. Unfortunately, this one is non-trivial to optimize away: i5 is live after this block of code, but only because of a spurious dependency related apparently to the relooper failure. There are some statements after the main loop body which test the relooper variable and correspond to failure paths within the loop; i5 is live within these statements, but will only be used if a particular path is taken out of the loop which sets i5 to a different value.
Updated•8 years ago
|
Priority: -- → P5
Comment 6•6 years ago
|
||
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Updated•6 years ago
|
Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Updated•3 years ago
|
Status: REOPENED → RESOLVED
Closed: 6 years ago → 3 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•