get our codegen to match native on float loop

NEW
Unassigned

Status

()

5 years ago
4 years ago

People

(Reporter: luke, Unassigned)

Tracking

(Depends on: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
This is a portion of an experiment sunfish performed comparing -O2 llvm to asm.js.  As pointed out below, there are a bunch of low-level optimizations we are missing which would help all Ion-compiled code.  I'm filing this bug to motivate and track the individual backend optimizations.

C code:

  volatile float array[5000];

  int main(int argc, char *argv[]) {
    for (int i = 0; i < 5000; ++i) {
      array[i] += 1.0f;
    }
    return 0;
  }

asm.js:

  function _main($argc, $argv) {
    $argc = $argc | 0;
    $argv = $argv | 0;
    var $i_03 = 0, $arrayidx = 0, $inc = 0;
    $i_03 = 0;
    while (1) {
      $arrayidx = 8 + ($i_03 << 2) | 0;
      HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0;
      $inc = $i_03 + 1 | 0;
      if (($inc | 0) < 5e3) {
        $i_03 = $inc;
      } else {
        break;
      }
    }
    return 0;
  }

Odin/Ion code:

    a1:       mov    %rax,%rcx
    a4:       shl    $0x2,%ecx
    a7:       add    $0x8,%ecx
    aa:       movss  (%r15,%rcx,1),%xmm0
    b0:       cvtss2sd %xmm0,%xmm0
    b4:       pcmpeqw %xmm1,%xmm1
    b8:       psllq  $0x36,%xmm1
    bd:       psrlq  $0x2,%xmm1
    c2:       addsd  %xmm0,%xmm1
    c6:       cvtsd2ss %xmm1,%xmm15
    cb:       movss  %xmm15,(%r15,%rcx,1)
    d1:       mov    $0x1,%ecx
    d6:       add    %eax,%ecx
    d8:       cmp    $0x1388,%ecx
    de:       jge    e9
    e4:       jmpq   ee
    e9:       jmpq   f6
    ee:       mov    %rcx,%rax
    f1:       jmpq   a1

C compiler:

    10:       movss  0x0(,%eax,4),%xmm0
    19:       addss  %xmm1,%xmm0
    1d:       movss  %xmm0,0x0(,%eax,4)
    26:       add    $0x1,%eax
    29:       cmp    $0x1388,%eax
    2e:       jne    10
(Reporter)

Comment 1

5 years ago
Oh, and here's the commentary (again, credit to sunfish):

Let's break down what's happening on the JavaScript
side. The first thing that happens is the array load:

    a1:       mov    %rax,%rcx
    a4:       shl    $0x2,%ecx
    a7:       add    $0x8,%ecx
    aa:       movss  (%r15,%rcx,1),%xmm0

This is a good opportunity to use of x86's complex addressing modes.
The add and and the shift could both be folded into the load's address
computation. That would then allow the mov to be eliminated too; the
only reason this code needs a copy of %rax is because it's using
instructions which clobber their input.

Also while we're here, we can see the use of asm.js' heap register,
r15.  It's not a big problem here, since I happened to use a
statically allocated array.  However, if the array were dynamically
allocated, it would require an extra register in the address, so it
would require an extra add somewhere.

Next is a convert to double-precision:

    b0:       cvtss2sd %xmm0,%xmm0

asm.js semantics currently require all floating-point be done in
double precision.

Note that it would be theoretically possible to fold the array load
into the cvtss2sd here. However, this is actually slower, due to a
horrible x86 quirk. Unary load-execute scalar SSE instructions don't
write to the full output register, so they have false dependencies on
earlier instructions.

Next is this:

    b4:       pcmpeqw %xmm1,%xmm1
    b8:       psllq  $0x36,%xmm1
    bd:       psrlq  $0x2,%xmm1

This is IonMonkey's special sequence for producing the floating-point
constant 1.0 in an xmm register. The desired bitpattern is
0x3ff0000000000000, so it uses pcmpeqw to set xmm1 to all-ones, and
then it uses psllq and psrlq to shift off bits on the left and the
right to end up with the desired value. This seems like a lot of work,
but it's actually an optimization over IonMonkey's generic sequence
for floating-point constants, which looks something like this:

        movabsq    $0x3ff0000000000000, %r11
        movq       %r11, %xmm1

Of course, neither of this is as good as just allocating the constant
in memory somewhere and loading it into a register with a single
instruction. This is what the C compiler is doing.

The other problem here is that the production of that 1.0 constant is
inside the loop. Regardless of how the constant gets produced into a
register, it only needs to be done once, before the loop, instead of
in each iteration of the loop. The C compiler also does this, which is
why you don't see the code for it in the snippet I made. This is an
LICM optimization. IonMonkey has an LICM pass, however it doesn't know
that producing a constant value into a register is something which
needs to be hoisted.

Next is the add, conversion back to 32-bit, and the array store:

    c2:       addsd  %xmm0,%xmm1
    c6:       cvtsd2ss %xmm1,%xmm15
    cb:       movss  %xmm15,(%r15,%rcx,1)

The issues here are already covered by the discussion above. Next,
increment the induction variable:

    d1:       mov    $0x1,%ecx
    d6:       add    %eax,%ecx

This is an obvious missed opportunity to fold the immediate into the add.

Finally, there's the loop exit:

    d8:       cmp    $0x1388,%ecx
    de:       jge    e9
    e4:       jmpq   ee
    e9:       jmpq   f6
    ee:       mov    %rcx,%rax
    f1:       jmpq   a1

Jump to a jump to exit the loop, otherwise fall through to jump to a
jump to go back to the loop header. That's four jump instructions,
where the C compiler uses one. This little CFG pretzel is actually a
direct translation of the JavaScript:

    while (1) {
      $arrayidx = 8 + ($i_03 << 2) | 0;
      HEAPF32[$arrayidx >> 2] = +HEAPF32[$arrayidx >> 2] + 1.0;
      $inc = $i_03 + 1 | 0;
      if (($inc | 0) < 5e3) {
        $i_03 = $inc;
      } else {
        break;
      }
    }

That $i_03, which is showing up as the "mov %rcx,%rax" at the end, is
the instruction that copies the incremented induction variable value,
"i+1", back into the original induction variable.

LLVM structures induction variables this way because it uses SSA form.
The induction variable is a separate variable from the incremented
value. However asm.js is not in SSA form, so what we'd really want
here is for Emscripten to recognize that $i_03's lifetime never
overlaps with $inc's lifetime, and coalesce them into a single
variable. Either that, or we'd like IonMonkey's register allocator to
coalesce them into a single register.

We'd also like to write the loop exit test to use a single branch,
with a fallthrough that directly exits the loop.
(Reporter)

Updated

5 years ago
Summary: get our codgen to match native on float loop → get our codegen to match native on float loop

Comment 2

5 years ago
I knew there was something fishy-looking about that summary...
(Reporter)

Updated

5 years ago
Depends on: 876064
Depends on: 875135

Comment 3

5 years ago
With the latest optimizations on the emscripten side, -O2 --minify 0 gives

function aD(a, b) {
  a = a | 0;
  b = b | 0;
  b = 0;
  do {
    a = 8 + (b << 2) | 0;
    g[a >> 2] = +g[a >> 2] + 1.0;
    b = b + 1 | 0;
  } while ((b | 0) < 5e3);
  return 0;
}
Alon's emscripten changes improve the loop from 4 branches to 2. The remaining unnecessary branch is due to Ion splitting critical edges; I now have a patch underway which should fix this.

In other news, Hannes' fix for bug 876607 fixed the add-immediate issue.
(Reporter)

Updated

5 years ago
Depends on: 876607

Updated

5 years ago
Depends on: 881390

Updated

5 years ago
Depends on: 881412
With all the fixes, Odin/Ion's output for the testcase now looks like this:

[Codegen] movq       %rax, %rcx
[Codegen] shll       $2, %ecx
[Codegen] addl       $0x8, %ecx
[Codegen] movss      0(%r15,%rcx,1), %xmm1
[Codegen] cvtss2sd   %xmm1, %xmm1
[Codegen] addsd      %xmm0, %xmm1
[Codegen] cvtsd2ss   %xmm1, %xmm15
[Codegen] movss      %xmm15, 0(%r15,%rcx,1)
[Codegen] addl       $0x1, %eax
[Codegen] cmpl       $0x1388, %eax
[Codegen] jl         ((801))

The biggest issues remaining are address-mode folding and 32-bit floats.

Updated

5 years ago
Depends on: 888109
Blocks: 904918
(Assignee)

Updated

4 years ago
Assignee: general → nobody
Note that float32 optimizations require -s PRECISE_F32=1 on the Emscripten compile.
No longer blocks: 904918
Depends on: 986981, 904918

Updated

4 years ago
Depends on: 1116773
You need to log in before you can comment on or make changes to this bug.