Closed Bug 609899 Opened 13 years ago Closed 12 years ago

JM: Allocate registers across branches and joins


(Core :: JavaScript Engine, defect)

Not set





(Reporter: bhackett1024, Assigned: bhackett1024)


(Blocks 1 open bug)



(1 file, 3 obsolete files)

Using the information computed by the analysis in bug 608346, JM can compute a register allocation to use at join points, and maintain variables in registers when branching or joining.
Attached patch WIP (obsolete) — Splinter Review
This works on toy examples.  This does an analysis pass to precompute some live and soon-to-be-used variables at each bytecode and hands things off to the FrameState to maintain registers within basic blocks and pick a consistent register allocation at join points.

I'm not sure this is the best approach.  The basic problem is that this doesn't fully utilize the registers across branches --- with 5 registers total to use on x86, only 3 are carried across to ensure that 2 are available for testing the branch condition.  This hurts on microbenchmarks:

Everything fits in 3 registers.

function foo(x, y) {
  var res = 0;
  for (var i = 0; i < x; i++)
foo(10000000, 1);

v8            28.8
jsc           32.1
js -j         22.2
js -m (old)   33.0
js -m (types) 28.8
js -m (new)   14.5

Larger loop, res is spilled.

function foo(x, y) {
  var res = 0;
  for (var i = 0; i < x; i++)
    res += y;
foo(10000000, 1);

v8            28.7
jsc           36.0
js -j         28.4
js -m (old)   43.0
js -m (types) 25.2
js -m (new)   24.7

I'd like to investigate using Linear Scan register allocation or a variant, which is the algorithm (apparently) used in the HotSpot client JVM.  This is a backwards pass to compute live variables and ranges, then a forward pass to assign registers.  Which is similar to what this patch already does, except it has full liveness information and would need info about intermediate stack temporaries and temporaries within a compiled opcode.
Attached patch WIP 2 (obsolete) — Splinter Review

I like this design a lot more.  This is based on a binpacking Linear Scan allocator ("Quality and Speed in Linear-scan Register Allocation", by Traub et. al.), combining a fast liveness prepass with a forward, allocate-as-you-go walk, as the FrameState does now.  So processing of individual ops doesn't change much at all, and the FrameState just gets some eviction heuristics (not implemented yet).

The tricky stuff is (still) carrying registers around loops, which Linear Scan work generally avoids talking about (hmm...).  Instead of making a guess at a good loop head allocation, we defer its generation, watching for variables which are accessed early in the loop body and retroactively deciding they were in registers at the loop head.  This requires patching up previous stub code rejoin paths (but not previous sync code --- registers at loop heads are always synced), not implemented yet.

This approach lets any number of registers be carried around the loop, but only ones which will be used quickly end up getting carried.  This gives a good allocation both on simple loops (both loops in comment 1 take 14.5ms) and on loops with lots of register pressure (am3 in v8-crypto, only simulated so far).
Assignee: general → bhackett1024
Attachment #490459 - Attachment is obsolete: true
Attached patch WIP 3 (obsolete) — Splinter Review
Passes trace-tests with --enable-type-inference.  Still need to get methodjit/tracejit integration working.
Attachment #494548 - Attachment is obsolete: true
Attached patch JM patchSplinter Review
Patch landed on JM.  This needs more tuning in the Compiler (most conditional branches still syncAndForgetEverything()) before measuring effectiveness, but the FrameState and analysis changes are in good shape and things work with and without type inference.
Attachment #495445 - Attachment is obsolete: true
Blocks: 619423
No longer blocks: TypeInference
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 621487
(In reply to comment #1)
> I'd like to investigate using Linear Scan register allocation or a variant,
> which is the algorithm (apparently) used in the HotSpot client JVM.
Just to note, the server compiler uses a graph-coloring one, while it takes more time it yields more efficient results, which could be worthwhile once/if off-thread compiling gets implemented in JM.
You need to log in before you can comment on or make changes to this bug.