Closed Bug 1253952 Opened 8 years ago Closed 1 month ago

AsmJS table switch optimization: skip indirect jumps

Categories

(Core :: JavaScript Engine: JIT, enhancement, P5)

enhancement

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: pipcet, Unassigned)

Details

Attachments

(2 files, 1 obsolete file)

I recently started implementing a somewhat obvious but worthwhile optimization in the JIT engine: when jumping to a table switch, if the expression that the switch is dispatching based on is constant, thread the jump through to the case block directly instead of performing the range check and indirect jump.

One thing I'm doing a lot in asm.js is simulating goto or writing state machines that look like this:

while(1) {
    switch(x|0) {
    case 0:
        ...
        x = 7; continue;
    ...
    case 7:
        ....
    }
}

Right now, the code we're generating for this case corresponds to the
asm.js source quite literally: we jump back to the while(1) loop
header, check that x is in range, and perform an indirect jump to the
case 7: label. The range check and setting x shouldn't be too
expensive, but indirect jumps are. And they're unnecessary in this
case: we could jump straight to the case label after setting x.

That optimization sounds simple, and it would probably take an expert
only a few minutes to implement atop the existing infrastructure, but
I had a bit of a hard time figuring out how to do it.

That was somewhat hard to do with the JIT compiler's design, because we can't just change edges to become backedges; the approach I've chosen is to leave the CFG unchanged, but record the "path" through which the jump is happening in a new MIR/LIR instruction opcode, XGoto.  When we visit the XGoto, we replay move groups, then perform a direct jump instead of the indirect one.

This isn't particularly clean, and I'd love to see a better way of implementing it, but the main reason I'm reporting this now is that my patch has stopped working due to recent changes which make this optimization much harder.

So the patch is seriously WIP, but maybe it is helpful to people wanting to implement this in a fashion that can go into the tree. Maybe we can work together to resolve this?

Performance is improved significantly (15-20% for my actual workload, and much more for constructed examples) for me, so I think this optimization is worth it, even if the XGoto hack is too ugly to go in

While ugly, I think the XGoto approach can be extended to one more optimization: "half-inlining" functions by jumping to a stub that replays what would have happened after the function call (including optimizations based on constant arguments and the like), to a certain point, then continues with the normal body of the function; in particular, when a state machine like above is called with a constant argument, we could save ourselves the initial indirect jump (unfortunately, asm.js doesn't allow the default label to be used for this purpose, because it has to come last.)

If nothing else, the patch can be used to get a rough idea of the potential performance improvement.
This approach sounds weird as we would expect this modification to happen in the MIR Graph, before any register allocation.

The problem seems to be that we currently have a single way of writing backward edges in the MIR graph, and these involve having loop, with loop headers.  I wonder what would be needed to accept this kind of edges in the MIRGraph.

Inlining the next blocks into the current one sounds like a solution which would not benefit from register allocator, and might be doing too many moves.

On the other hand, we could use the loop-unrolling algorithm + GVN + UCE to convert the interpreted state machine into (macro instructions) a larger linear sequences of code, as a tracing compiler would do.  And fallback to a normal loop if any non-threaded path is taken.

Thus we could convert the code into something which looks like:

while (1) {
  switch (x) {
  case 0:
      while (1) {
          // case 0
          ...
          // case 7
          ...
          break;
      }
      continue;
  case 7:
      ...
  }
}
Interesting. I think jump threading would be something valuable not only for AsmJS. I seem to recall that interpreters compiled to AsmJS did run into this performance fault, because interpreter loops are mostly goto loops that get compiled as in your example in comment 1.

Making it a compiler optimization pass (as e.g. GVN or others) would probably be easier than replaying move groups after regalloc, indeed.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Thanks for the responses!

(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> This approach sounds weird as we would expect this modification to happen in
> the MIR Graph, before any register allocation.

Well, the XGoto generation happens at the right point, but the actual code is generated after regalloc. Two reasons for that: I couldn't get the other way to work easily, and it allows us to combine movegroups and constant moves across the jump.
 
> Inlining the next blocks into the current one sounds like a solution which
> would not benefit from register allocator, and might be doing too many moves.

Indeed. There's a limit in the code, IIRC, but keep in mind that these moves would happen anyway, and we can combine them more aggressively (I'm playing with other optimizations on top of this, most pertinently clobbering of registers in move groups, which actually reduces the number of moves we do).

(In reply to Benjamin Bouvier [:bbouvier] from comment #2)
> Interesting.

That sounds much better than "weird", though I think we can agree it's both.

> I think jump threading would be something valuable not only for
> AsmJS.

I haven't tested non-AsmJS code, but it might work.

> Making it a compiler optimization pass (as e.g. GVN or others) would
> probably be easier than replaying move groups after regalloc, indeed.

Maybe for someone who knows their way around the code better than I do. Replaying move groups (and extending move groups to include clobbers and constant moves) is surprisingly easy, and allows us to optimize things across the XGoto branch.

I realize the answer is probably "no", but would it be possible to go back to generating the table switch subtraction in the JIT code rather than in the asmjs/wasm code?

Some more detail on the cross-jump optimization:

 - I've modified move groups to include constant moves and clobbers, as well as a compose operation (which I'll write as "*", with the left hand side being evaluated first). [a -> b] * [clobber -> b] turns into the empty move group, and so does [a -> b, b -> a] * [a -> b, b -> a].
 - Ideally, the code after the newly non-indirect branch clobbers the "x" register, so we don't need to load that constant at all.
 - There are indeed still cases where very many moves are replayed. It helps a lot to modify the register allocator to treat some registers as call-saved, though.
 - I added a non-indirect fast branch to the first case label of switch statements, because my state machines want that jump to happen in most cases:

sub $constant, %eax
je first_case
cmp $constant, %eax
jae default_case
<indirect jump>

With all those things, I've been able to improve performance from 25 s of post-compilation runtime to ~15 s for my example code (the native equivalent takes about 7 s), and I think I've managed to reduce emitted code size as well. While it's definitely a secondary concern, the disassembled code is more readable as well.

So that's definitely worth it! I'll think more about how to do these things properly.
new patch series, patch 3/8 is the main change.
Attachment #8727188 - Attachment is obsolete: true
I've thought about this some more, and disassembled a lot, and I'm increasingly convinced that my approach is the best way to get this working. Adding non-loop backedges is highly nontrivial and, I think, would require what amounts to a redesign of the register allocator.

It's unfortunate that we cannot use ->foldsTo() to simplify the inlined code, but it's a design issue: foldsTo works on the MIR graph, which is then lowered to the LIR graph, on which the register allocator is run, and only after the register allocator is done can we add the non-loop backedges that we would need to be in place before running foldsTo.

I've attached a patch series that includes all optimizations (including some code to replace two 32-bit moves by a 64-bit move in some cases, which does not appear to be worth it). Any thoughts would be appreciated.
Revert the change to always generate tableswitches with low=0 for AsmJS code. This needs to be applied before any of the patch series for it to have any effect on performance.
Priority: -- → P5
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: