Closed
Bug 876057
Opened 12 years ago
Closed 11 months ago
get our codegen to match native on float loop
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: luke, Unassigned)
References
Details
This is a portion of an experiment sunfish performed comparing -O2 llvm to asm.js. As pointed out below, there are a bunch of low-level optimizations we are missing which would help all Ion-compiled code. I'm filing this bug to motivate and track the individual backend optimizations.
C code:
volatile float array[5000];
int main(int argc, char *argv[]) {
for (int i = 0; i < 5000; ++i) {
array[i] += 1.0f;
}
return 0;
}
asm.js:
function _main($argc, $argv) {
$argc = $argc | 0;
$argv = $argv | 0;
var $i_03 = 0, $arrayidx = 0, $inc = 0;
$i_03 = 0;
while (1) {
$arrayidx = 8 + ($i_03 << 2) | 0;
HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0;
$inc = $i_03 + 1 | 0;
if (($inc | 0) < 5e3) {
$i_03 = $inc;
} else {
break;
}
}
return 0;
}
Odin/Ion code:
a1: mov %rax,%rcx
a4: shl $0x2,%ecx
a7: add $0x8,%ecx
aa: movss (%r15,%rcx,1),%xmm0
b0: cvtss2sd %xmm0,%xmm0
b4: pcmpeqw %xmm1,%xmm1
b8: psllq $0x36,%xmm1
bd: psrlq $0x2,%xmm1
c2: addsd %xmm0,%xmm1
c6: cvtsd2ss %xmm1,%xmm15
cb: movss %xmm15,(%r15,%rcx,1)
d1: mov $0x1,%ecx
d6: add %eax,%ecx
d8: cmp $0x1388,%ecx
de: jge e9
e4: jmpq ee
e9: jmpq f6
ee: mov %rcx,%rax
f1: jmpq a1
C compiler:
10: movss 0x0(,%eax,4),%xmm0
19: addss %xmm1,%xmm0
1d: movss %xmm0,0x0(,%eax,4)
26: add $0x1,%eax
29: cmp $0x1388,%eax
2e: jne 10
Reporter | ||
Comment 1•12 years ago
|
||
Oh, and here's the commentary (again, credit to sunfish):
Let's break down what's happening on the JavaScript
side. The first thing that happens is the array load:
a1: mov %rax,%rcx
a4: shl $0x2,%ecx
a7: add $0x8,%ecx
aa: movss (%r15,%rcx,1),%xmm0
This is a good opportunity to use of x86's complex addressing modes.
The add and and the shift could both be folded into the load's address
computation. That would then allow the mov to be eliminated too; the
only reason this code needs a copy of %rax is because it's using
instructions which clobber their input.
Also while we're here, we can see the use of asm.js' heap register,
r15. It's not a big problem here, since I happened to use a
statically allocated array. However, if the array were dynamically
allocated, it would require an extra register in the address, so it
would require an extra add somewhere.
Next is a convert to double-precision:
b0: cvtss2sd %xmm0,%xmm0
asm.js semantics currently require all floating-point be done in
double precision.
Note that it would be theoretically possible to fold the array load
into the cvtss2sd here. However, this is actually slower, due to a
horrible x86 quirk. Unary load-execute scalar SSE instructions don't
write to the full output register, so they have false dependencies on
earlier instructions.
Next is this:
b4: pcmpeqw %xmm1,%xmm1
b8: psllq $0x36,%xmm1
bd: psrlq $0x2,%xmm1
This is IonMonkey's special sequence for producing the floating-point
constant 1.0 in an xmm register. The desired bitpattern is
0x3ff0000000000000, so it uses pcmpeqw to set xmm1 to all-ones, and
then it uses psllq and psrlq to shift off bits on the left and the
right to end up with the desired value. This seems like a lot of work,
but it's actually an optimization over IonMonkey's generic sequence
for floating-point constants, which looks something like this:
movabsq $0x3ff0000000000000, %r11
movq %r11, %xmm1
Of course, neither of this is as good as just allocating the constant
in memory somewhere and loading it into a register with a single
instruction. This is what the C compiler is doing.
The other problem here is that the production of that 1.0 constant is
inside the loop. Regardless of how the constant gets produced into a
register, it only needs to be done once, before the loop, instead of
in each iteration of the loop. The C compiler also does this, which is
why you don't see the code for it in the snippet I made. This is an
LICM optimization. IonMonkey has an LICM pass, however it doesn't know
that producing a constant value into a register is something which
needs to be hoisted.
Next is the add, conversion back to 32-bit, and the array store:
c2: addsd %xmm0,%xmm1
c6: cvtsd2ss %xmm1,%xmm15
cb: movss %xmm15,(%r15,%rcx,1)
The issues here are already covered by the discussion above. Next,
increment the induction variable:
d1: mov $0x1,%ecx
d6: add %eax,%ecx
This is an obvious missed opportunity to fold the immediate into the add.
Finally, there's the loop exit:
d8: cmp $0x1388,%ecx
de: jge e9
e4: jmpq ee
e9: jmpq f6
ee: mov %rcx,%rax
f1: jmpq a1
Jump to a jump to exit the loop, otherwise fall through to jump to a
jump to go back to the loop header. That's four jump instructions,
where the C compiler uses one. This little CFG pretzel is actually a
direct translation of the JavaScript:
while (1) {
$arrayidx = 8 + ($i_03 << 2) | 0;
HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0;
$inc = $i_03 + 1 | 0;
if (($inc | 0) < 5e3) {
$i_03 = $inc;
} else {
break;
}
}
That $i_03, which is showing up as the "mov %rcx,%rax" at the end, is
the instruction that copies the incremented induction variable value,
"i+1", back into the original induction variable.
LLVM structures induction variables this way because it uses SSA form.
The induction variable is a separate variable from the incremented
value. However asm.js is not in SSA form, so what we'd really want
here is for Emscripten to recognize that $i_03's lifetime never
overlaps with $inc's lifetime, and coalesce them into a single
variable. Either that, or we'd like IonMonkey's register allocator to
coalesce them into a single register.
We'd also like to write the loop exit test to use a single branch,
with a fallthrough that directly exits the loop.
Reporter | ||
Updated•12 years ago
|
Summary: get our codgen to match native on float loop → get our codegen to match native on float loop
Comment 2•12 years ago
|
||
I knew there was something fishy-looking about that summary...
Comment 3•11 years ago
|
||
With the latest optimizations on the emscripten side, -O2 --minify 0 gives
function aD(a, b) {
a = a | 0;
b = b | 0;
b = 0;
do {
a = 8 + (b << 2) | 0;
g[a >> 2] = +g[a >> 2] + 1.0;
b = b + 1 | 0;
} while ((b | 0) < 5e3);
return 0;
}
Comment 4•11 years ago
|
||
Alon's emscripten changes improve the loop from 4 branches to 2. The remaining unnecessary branch is due to Ion splitting critical edges; I now have a patch underway which should fix this.
In other news, Hannes' fix for bug 876607 fixed the add-immediate issue.
Comment 5•11 years ago
|
||
With all the fixes, Odin/Ion's output for the testcase now looks like this:
[Codegen] movq %rax, %rcx
[Codegen] shll $2, %ecx
[Codegen] addl $0x8, %ecx
[Codegen] movss 0(%r15,%rcx,1), %xmm1
[Codegen] cvtss2sd %xmm1, %xmm1
[Codegen] addsd %xmm0, %xmm1
[Codegen] cvtsd2ss %xmm1, %xmm15
[Codegen] movss %xmm15, 0(%r15,%rcx,1)
[Codegen] addl $0x1, %eax
[Codegen] cmpl $0x1388, %eax
[Codegen] jl ((801))
The biggest issues remaining are address-mode folding and 32-bit floats.
Assignee | ||
Updated•10 years ago
|
Assignee: general → nobody
Comment 6•10 years ago
|
||
Note that float32 optimizations require -s PRECISE_F32=1 on the Emscripten compile.
Updated•2 years ago
|
Severity: normal → S3
Updated•11 months ago
|
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•