Closed Bug 876057 Opened 12 years ago Closed 11 months ago

get our codegen to match native on float loop

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: luke, Unassigned)

References

Details

This is a portion of an experiment sunfish performed comparing -O2 llvm to asm.js. As pointed out below, there are a bunch of low-level optimizations we are missing which would help all Ion-compiled code. I'm filing this bug to motivate and track the individual backend optimizations. C code: volatile float array[5000]; int main(int argc, char *argv[]) { for (int i = 0; i < 5000; ++i) { array[i] += 1.0f; } return 0; } asm.js: function _main($argc, $argv) { $argc = $argc | 0; $argv = $argv | 0; var $i_03 = 0, $arrayidx = 0, $inc = 0; $i_03 = 0; while (1) { $arrayidx = 8 + ($i_03 << 2) | 0; HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0; $inc = $i_03 + 1 | 0; if (($inc | 0) < 5e3) { $i_03 = $inc; } else { break; } } return 0; } Odin/Ion code: a1: mov %rax,%rcx a4: shl $0x2,%ecx a7: add $0x8,%ecx aa: movss (%r15,%rcx,1),%xmm0 b0: cvtss2sd %xmm0,%xmm0 b4: pcmpeqw %xmm1,%xmm1 b8: psllq $0x36,%xmm1 bd: psrlq $0x2,%xmm1 c2: addsd %xmm0,%xmm1 c6: cvtsd2ss %xmm1,%xmm15 cb: movss %xmm15,(%r15,%rcx,1) d1: mov $0x1,%ecx d6: add %eax,%ecx d8: cmp $0x1388,%ecx de: jge e9 e4: jmpq ee e9: jmpq f6 ee: mov %rcx,%rax f1: jmpq a1 C compiler: 10: movss 0x0(,%eax,4),%xmm0 19: addss %xmm1,%xmm0 1d: movss %xmm0,0x0(,%eax,4) 26: add $0x1,%eax 29: cmp $0x1388,%eax 2e: jne 10
Oh, and here's the commentary (again, credit to sunfish): Let's break down what's happening on the JavaScript side. The first thing that happens is the array load: a1: mov %rax,%rcx a4: shl $0x2,%ecx a7: add $0x8,%ecx aa: movss (%r15,%rcx,1),%xmm0 This is a good opportunity to use of x86's complex addressing modes. The add and and the shift could both be folded into the load's address computation. That would then allow the mov to be eliminated too; the only reason this code needs a copy of %rax is because it's using instructions which clobber their input. Also while we're here, we can see the use of asm.js' heap register, r15. It's not a big problem here, since I happened to use a statically allocated array. However, if the array were dynamically allocated, it would require an extra register in the address, so it would require an extra add somewhere. Next is a convert to double-precision: b0: cvtss2sd %xmm0,%xmm0 asm.js semantics currently require all floating-point be done in double precision. Note that it would be theoretically possible to fold the array load into the cvtss2sd here. However, this is actually slower, due to a horrible x86 quirk. Unary load-execute scalar SSE instructions don't write to the full output register, so they have false dependencies on earlier instructions. Next is this: b4: pcmpeqw %xmm1,%xmm1 b8: psllq $0x36,%xmm1 bd: psrlq $0x2,%xmm1 This is IonMonkey's special sequence for producing the floating-point constant 1.0 in an xmm register. The desired bitpattern is 0x3ff0000000000000, so it uses pcmpeqw to set xmm1 to all-ones, and then it uses psllq and psrlq to shift off bits on the left and the right to end up with the desired value. This seems like a lot of work, but it's actually an optimization over IonMonkey's generic sequence for floating-point constants, which looks something like this: movabsq $0x3ff0000000000000, %r11 movq %r11, %xmm1 Of course, neither of this is as good as just allocating the constant in memory somewhere and loading it into a register with a single instruction. This is what the C compiler is doing. The other problem here is that the production of that 1.0 constant is inside the loop. Regardless of how the constant gets produced into a register, it only needs to be done once, before the loop, instead of in each iteration of the loop. The C compiler also does this, which is why you don't see the code for it in the snippet I made. This is an LICM optimization. IonMonkey has an LICM pass, however it doesn't know that producing a constant value into a register is something which needs to be hoisted. Next is the add, conversion back to 32-bit, and the array store: c2: addsd %xmm0,%xmm1 c6: cvtsd2ss %xmm1,%xmm15 cb: movss %xmm15,(%r15,%rcx,1) The issues here are already covered by the discussion above. Next, increment the induction variable: d1: mov $0x1,%ecx d6: add %eax,%ecx This is an obvious missed opportunity to fold the immediate into the add. Finally, there's the loop exit: d8: cmp $0x1388,%ecx de: jge e9 e4: jmpq ee e9: jmpq f6 ee: mov %rcx,%rax f1: jmpq a1 Jump to a jump to exit the loop, otherwise fall through to jump to a jump to go back to the loop header. That's four jump instructions, where the C compiler uses one. This little CFG pretzel is actually a direct translation of the JavaScript: while (1) { $arrayidx = 8 + ($i_03 << 2) | 0; HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0; $inc = $i_03 + 1 | 0; if (($inc | 0) < 5e3) { $i_03 = $inc; } else { break; } } That $i_03, which is showing up as the "mov %rcx,%rax" at the end, is the instruction that copies the incremented induction variable value, "i+1", back into the original induction variable. LLVM structures induction variables this way because it uses SSA form. The induction variable is a separate variable from the incremented value. However asm.js is not in SSA form, so what we'd really want here is for Emscripten to recognize that $i_03's lifetime never overlaps with $inc's lifetime, and coalesce them into a single variable. Either that, or we'd like IonMonkey's register allocator to coalesce them into a single register. We'd also like to write the loop exit test to use a single branch, with a fallthrough that directly exits the loop.
Summary: get our codgen to match native on float loop → get our codegen to match native on float loop
I knew there was something fishy-looking about that summary...
Depends on: 876064
With the latest optimizations on the emscripten side, -O2 --minify 0 gives function aD(a, b) { a = a | 0; b = b | 0; b = 0; do { a = 8 + (b << 2) | 0; g[a >> 2] = +g[a >> 2] + 1.0; b = b + 1 | 0; } while ((b | 0) < 5e3); return 0; }
Alon's emscripten changes improve the loop from 4 branches to 2. The remaining unnecessary branch is due to Ion splitting critical edges; I now have a patch underway which should fix this. In other news, Hannes' fix for bug 876607 fixed the add-immediate issue.
Depends on: 876607
Depends on: 881390
Depends on: 881412
With all the fixes, Odin/Ion's output for the testcase now looks like this: [Codegen] movq %rax, %rcx [Codegen] shll $2, %ecx [Codegen] addl $0x8, %ecx [Codegen] movss 0(%r15,%rcx,1), %xmm1 [Codegen] cvtss2sd %xmm1, %xmm1 [Codegen] addsd %xmm0, %xmm1 [Codegen] cvtsd2ss %xmm1, %xmm15 [Codegen] movss %xmm15, 0(%r15,%rcx,1) [Codegen] addl $0x1, %eax [Codegen] cmpl $0x1388, %eax [Codegen] jl ((801)) The biggest issues remaining are address-mode folding and 32-bit floats.
Depends on: 888109
Assignee: general → nobody
Note that float32 optimizations require -s PRECISE_F32=1 on the Emscripten compile.
No longer blocks: 904918
Depends on: 986981, 904918
Depends on: 1116773
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.