Closed Bug 814966 Opened 7 years ago Closed 7 years ago

IonMonkey: implement LLVM's greedy backtracking register allocator

Categories

(Core :: JavaScript Engine, defect)

Other Branch
x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: bhackett, Assigned: bhackett)

References

Details

Attachments

(2 files, 2 obsolete files)

Last year a post was made to the LLVM project blog about the new register allocator used in LLVM 3.0

http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html

Since this post there has been a lot of interest inside and outside the JS team in trying this allocator out in IonMonkey.  The appealing features are (a) potentially faster generated code than a well tuned LSRA, and (b) less tuning required to generate good code than in LSRA.

Summarizing and extrapolating a bit the linked post, the design of this allocator is very similar to LSRA, selecting an allocation based on the live ranges of a set of virtual registers and allowing those live ranges to be split during allocation.  The key differences are:

a) Assigning physical locations to vregs using a priority queue rather than visiting them in linear order.  High priority vregs have high use densities in their live range.  By itself this is kind of a wash and generates similar code to LSRA, but making this change enables (b).

b) Allowing unassignment of previous allocations.  This is different from spilling an existing allocation (as LSRA does), and works by simply freeing the vreg's physical register and placing it back in the priority queue for a second chance in a different register.

The idea with the (b) is that visiting vregs in priority order can easily generate a suboptimal (in the graph coloring sense) allocation.  For example, with live ranges such as:

v1 ..
v2   ...
v3 .....

On a two register machine, visiting these in the order v1,v2,v3 may allocate different physical registers to v1 and v2, even though they do not interfere, and leaving no option for allocating v3.  With this backtracking, however, allocating v3 could unassign v2, give v3 a physical register, then revisit v2 and give it the remaining one.  If the example were larger and a physical register could not be found for v3, its live range can be split at arbitrary points while it is in the priority queue, to facilitate easier allocation.

I like the flexibility and cleanliness of this design a lot.  My main concern is running time of the allocator, as even an argument for termination cannot be made for the general approach.  It seems like such an argument could be made with a few more constraints for when it is possible to split and unassign vregs, and without much tuning this allocator could do a good job even on large scripts by quickly allocating for the large boring parts and then focusing in on refining a good allocation for short loop bodies scattered about.
Depends on: 817213
Depends on: 817769
Attached patch WIP (e5082df10222) (obsolete) — Splinter Review
WIP, generates an ok allocation on am3() but still a little worse than LSRA (on x86, 32.7% spill code vs. 31.6% for LSRA).  Needs more tuning, including better handling for phis and MUST_REUSE_INPUT use/def chains.
Assignee: general → bhackett1024
Attached patch WIP (e5082df10222) (obsolete) — Splinter Review
Patch with more tuning for am3/x86.  On an am3 microbenchmark (adjusted so that type information / etc. is the same as on v8-crypto) this improves the spill code % from 31.6% to 26.4%, giving an icount reduction of 7.4% and a raw speedup of 5.1%.  I think there's still more to be had here (plus examining behavior on other benchmarks and architectures) but for now want to get this initial version passing jit-tests and into the tree.
Attachment #690385 - Attachment is obsolete: true
Attached file am3 microbenchmark
Actually, playing a bit more with the microbenchmark, the performance curve is very sensitive to the length of the input arrays.  Varying this the raw speedup from the backtracking allocator is between 5% and 25% (!) though seems to stabilize around 10% for larger arrays.  Microbenchmark attached; the input array lengths in the full benchmark are usually either 19 or 1, though I don't think it's worth trying to game the latter case.  Mostly interested in am3 for the intense register pressure in its inner loop.
Depends on: 814396
Depends on: 813671
Depends on: 821151
Passes jit-tests --ion on x86 (with the dependent bug fixes), and preserves the regalloc improvements on am3.  Still a good ways to go, but I'm happy with this allocator as a starting point.
Attachment #690593 - Attachment is obsolete: true
Attachment #691683 - Flags: review?(jdemooij)
Comment on attachment 691683 [details] [diff] [review]
patch (e5082df10222)

Review of attachment 691683 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, from reading the code and comments it's pretty clear how the algorithm works. 

Can you check how this allocator does on the micro-benchmark in bug 688078? It's based on the safe_add function in SS crypto-md5/sha1 and has an inner loop with high register pressure. We should also see how running time compares to LSRA and how this allocator does on e.g. Kraken and Emscripen code, but we can do that after landing the initial version.

Maybe also good to ask for fuzzing before improving it further.

::: js/src/ion/BacktrackingAllocator.cpp
@@ +57,5 @@
> +        if (block->mir()->isLoopHeader())
> +            backedge = block->mir()->backedge()->lir();
> +
> +        if (block == backedge) {
> +            LBlock *header = block->mir()->getSuccessor(0)->lir();

Nit: s/getSuccessor(0)/loopHeaderOfBackedge()

@@ +127,5 @@
> +    if (IonSpewEnabled(IonSpew_RegAlloc))
> +        dumpRegisterGroups();
> +
> +    // Allocate, spill and split register intervals until finished.
> +    while (!queuedIntervals.empty()) {

Nit: check shouldCancel()

@@ +218,5 @@
> +
> +bool
> +BacktrackingAllocator::groupAndQueueRegisters()
> +{
> +    for (size_t i = 0; i < graph.numVirtualRegisters(); i++) {

Nit: check shouldCancel()

@@ +232,5 @@
> +            }
> +        }
> +
> +        LDefinition *def = reg.def();
> +        if (def && def->policy() == LDefinition::MUST_REUSE_INPUT) {

Can def be NULL here?

@@ +336,5 @@
> +                    return true;
> +            }
> +
> +            // Spill intervals which have no hint or register requirement.
> +            if (interval->requirement()->kind() == Requirement::NONE) {

Can we assert interval->hint()->kind() is NONE?

@@ +379,5 @@
> +
> +bool
> +BacktrackingAllocator::isReusedInput(LUse *use, LInstruction *ins, bool considerCopy)
> +{
> +    if (ins->numDefs() == 0 || ins->getDef(0)->policy() != LDefinition::MUST_REUSE_INPUT)

What about temps with MUST_REUSE_INPUT policy, as created by tempCopy? If we can ignore these it deserves a comment.

@@ +653,5 @@
> +                LiveInterval *prevInterval = reg->getInterval(k);
> +                if (prevInterval->start() != interval->start())
> +                    break;
> +                if (*prevInterval->getAllocation() == *interval->getAllocation())
> +                    skip = true;

We can "break;" here too right?

@@ +661,5 @@
> +
> +            LBlock *block = insData[interval->start()].block();
> +            if (interval->start() > inputOf(block->firstId())) {
> +                CodePosition start = interval->start();
> +                InstructionData *data = &insData[start];

Nit: can move these before the "if" so that you can use data->block() to get the block.

@@ +1030,5 @@
> +    size_t usesTotal = 0;
> +
> +    if (interval->index() == 0) {
> +        VirtualRegister *reg = &vregs[interval->vreg()];
> +        if (reg->def()->policy() == LDefinition::PRESET && reg->def()->output()->isRegister()) {

Nit: no braces

::: js/src/ion/BacktrackingAllocator.h
@@ +12,5 @@
> +
> +#include "ds/PriorityQueue.h"
> +#include "ds/SplayTree.h"
> +
> +// Backtracking priority queue based register allocator based on that described

Can you add a comment somewhere explaining how the live ranges differ from LSRA? Ideally both allocators would use exactly the same liveness information, but if doing it this way simplifies the algorithm it should be fine, as long as we explain how and why.

::: js/src/ion/LIR-Common.h
@@ +829,5 @@
>      }
>  };
>  
>  // Takes in either an integer or boolean input and tests it for truthiness.
> +class LTestDAndBranch : public LInstructionHelper<0, 1, 0>

Good catch.

::: js/src/ion/LIR.h
@@ +200,5 @@
>  
> +#ifdef DEBUG
> +    const char * toString() const;
> +#else
> +    const char * toString() const { return "???"; }

Nit: no space after *

::: js/src/ion/LiveRangeAllocator.cpp
@@ +617,5 @@
> +                        AnyRegister reg = temp->output()->toRegister();
> +                        if (!addFixedRangeAtHead(reg, inputOf(*ins), outputOf(*ins)))
> +                            return false;
> +                    } else {
> +                        if (!vregs[temp].getInterval(0)->addRangeAtHead(inputOf(*ins), outputOf(*ins)))

JS_ASSERT(!ins->isCall());

::: js/src/ion/RegisterAllocator.cpp
@@ +94,1 @@
>  #ifdef DEBUG

Nit: move this #ifdef before the if (IonSpewEnabled(...))
Attachment #691683 - Flags: review?(jdemooij) → review+
https://hg.mozilla.org/mozilla-central/rev/dc4887f61d2e
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Blocks: 822116
Blocks: 825447
Depends on: 826734
Blocks: 826741
You need to log in before you can comment on or make changes to this bug.